◐ Shell
reader mode source ↗
Skip to content

Base implementation of _tokenize module#6240

Closed
ShaharNaveh wants to merge 4 commits into
RustPython:mainfrom
ShaharNaveh:update-token
Closed

Base implementation of _tokenize module#6240
ShaharNaveh wants to merge 4 commits into
RustPython:mainfrom
ShaharNaveh:update-token

Conversation

@ShaharNaveh

@ShaharNaveh ShaharNaveh commented Nov 8, 2025

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • New Python tokenizer module: Tokenizer iterator that yields token tuples (type, text, start, end, line), accepts a readline callable, optional encoding, and an extra-token mode; emits EOF tokens with position metadata for streaming/tokenization.
  • Chores

    • Internal dependency declarations updated to support the new tokenizer module.

@coderabbitai

coderabbitai Bot commented Nov 8, 2025

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2d50d628-86fa-4f25-bfbd-1c6a042f4da8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Reorders and augments Ruff-related dependencies in top-level Cargo.toml and adds workspace dependencies in stdlib/Cargo.toml; registers a new internal tokenize module in stdlib/src/lib.rs; and introduces stdlib/src/tokenize.rs, a PyO3-backed _tokenize module exposing TokenizerIter that yields Python token 5-tuples and handles tokenizer state and EOF.

Changes

Cohort / File(s) Summary
Top-level manifest
Cargo.toml
Reordered Ruff-related crate entries; added ruff_python_trivia; moved ruff_text_size near ruff_source_file. No API changes.
Stdlib manifest updates
stdlib/Cargo.toml
Added workspace dependencies: ruff_python_ast, ruff_python_parser, ruff_source_file, ruff_text_size (all workspace = true).
Module registration
stdlib/src/lib.rs
Added mod tokenize; and registered _tokenize by including "_tokenize" => tokenize::make_module in module initializers.
Tokenizer implementation
stdlib/src/tokenize.rs
New PyO3 module _tokenize exposing Python class TokenizerIter (PyTokenizerIter) and constructor args. Implements an iterator that reads lines via a Python readline callable, manages internal state with a PyRwLock, emits 5-tuples (type, string, start_pos, end_pos, current_line), remaps operator tokens to TOKEN_OP when extra_tokens is true, and emits an EOF token tuple.

Sequence Diagram(s)

sequenceDiagram
    participant PyUser as Python user
    participant TokenIter as TokenizerIter (Py)
    participant State as PyTokenizerIterState
    participant Parser as Ruff parser

    PyUser->>TokenIter: next() / __next__()
    TokenIter->>TokenIter: check internal state & tokens
    alt tokens exhausted
        TokenIter->>PyUser: call readline(vm) -> line
        alt readline returns line
            TokenIter->>State: push_line(line)
            State->>Parser: parse_unchecked_source(line...)
            Parser-->>State: tokens
            State->>TokenIter: next_token()
            TokenIter->>TokenIter: token_kind_value(kind)
            alt extra_tokens and operator
                TokenIter->>TokenIter: remap -> TOKEN_OP (55)
            end
            TokenIter-->>PyUser: (type, string, start, end, current_line)
        else readline -> EOF
            TokenIter->>State: mark EOF
            TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)
        end
    else tokens available
        TokenIter-->>PyUser: next token tuple
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review on stdlib/src/tokenize.rs concurrency (use of PyRwLock), correctness of token position/range calculations, EOF emission behavior, and token-kind remapping.
  • Verify workspace dependency consistency across Cargo.toml and stdlib/Cargo.toml.

Suggested reviewers

  • youknowone

Poem

A rabbit reads each Python line with care,
Hops through tokens drifting in the air,
Ruff crates in order, ready for the quest,
Tuples hop out tidy, positions dressed,
EOF tiptoes in — the iterator rests. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing a base _tokenize module with a TokenizerIter class and supporting infrastructure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cad66c and 7bc0197.

⛔ Files ignored due to path filters (15)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_tokenize.py is excluded by !Lib/**
  • Lib/test/tokenizedata/__init__.py is excluded by !Lib/**
  • Lib/test/tokenizedata/bad_coding.py is excluded by !Lib/**
  • Lib/test/tokenizedata/bad_coding2.py is excluded by !Lib/**
  • Lib/test/tokenizedata/badsyntax_3131.py is excluded by !Lib/**
  • Lib/test/tokenizedata/badsyntax_pep3120.py is excluded by !Lib/**
  • Lib/test/tokenizedata/coding20731.py is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests.txt is excluded by !Lib/**
  • Lib/token.py is excluded by !Lib/**
  • Lib/tokenize.py is excluded by !Lib/**
📒 Files selected for processing (4)
  • Cargo.toml (1 hunks)
  • stdlib/Cargo.toml (1 hunks)
  • stdlib/src/lib.rs (2 hunks)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

Applied to files:

  • stdlib/Cargo.toml
  • stdlib/src/tokenize.rs
  • Cargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code

Applied to files:

  • Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
  • tokenize (466-492)
stdlib/src/socket.rs (1)
  • make_module (7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run tests under miri
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 2

♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)

39-42: Don't swallow non-StopIteration errors from readline.

Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only StopIteration should terminate the stream; all other errors must propagate.

Apply this diff to fix the error handling:

-            let raw_line = match self.readline.invoke((), vm) {
-                Ok(v) => v,
-                Err(_) => return Ok(String::new()),
-            };
+            let raw_line = match self.readline.invoke((), vm) {
+                Ok(v) => v,
+                Err(exc) => {
+                    if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) {
+                        return Ok(String::new());
+                    }
+                    return Err(exc);
+                }
+            };

116-138: End-of-file tuple needs to come from the real EOF token.

Synthesizing the EOF tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser's EndOfFile token, set it as prev_token, and let the existing emission path produce the correct tuple.

Apply this diff to use the parser's EOF token:

                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .rev()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }

151-154: Don't trim token text when extra_tokens is false.

trim() removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.

Apply this diff to preserve the exact lexeme:

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];

178-180: extra_tokens must default to False.

CPython's TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks API compatibility.

Apply this diff to make extra_tokens optional with a default:

         #[pyarg(positional)]
         readline: ArgCallable,
-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,
         #[pyarg(named, optional)]
         encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

108-109: Reminder: Address the TODO comment for error handling.

The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The state.errors field is populated but never checked.

Do you want me to generate code to check state.errors and raise appropriate SyntaxError exceptions, or open a new issue to track this task?


200-207: Re-parsing the entire source on each line may be inefficient.

The push_line method re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.

However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc0197 and 04371d3.

📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)

1-23: LGTM!

Module structure and imports follow RustPython conventions correctly.


25-33: LGTM!

The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.


61-72: LGTM!

Debug implementation and empty pyclass impl block are correct.


74-93: LGTM!

Constructor implementation follows RustPython patterns correctly.


210-220: No changes needed to current_line logic.

The current_line method correctly uses the LineRanges trait methods from ruff_source_file: full_line_str(offset) returns the full line containing an offset, and full_lines_str(range) returns text covering all lines that a range spans. The logic properly handles edge cases:

  • Multi-line tokens: Correctly handled via full_lines_str(range) for non-Newline tokens
  • Newline tokens: Correctly isolated via full_line_str(range.start())
  • When prev_token is None: Safely defaults to TokenKind::Unknown with TextRange::default(), which calls full_lines_str and returns lines from the start

The code is sound and requires no modifications.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 0

♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)

121-143: EOF tuple must use the parser's actual EndOfFile token.

The code synthesizes the EOF tuple with hardcoded (-1) positions (lines 132, 135) and reuses state.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser's EndOfFile token from state.tokens, set it as prev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.

Apply this approach:

                 let nline = zelf.readline(vm)?;
                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }

156-159: Remove trim() to preserve exact token lexemes.

Calling trim() when extra_tokens is false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.

Apply this diff:

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];

183-184: Make extra_tokens optional with default false.

CPython allows TokenizerIter(readline) without specifying extra_tokens. The current declaration requires the argument, breaking API compatibility.

Apply this diff:

-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,

227-243: Token ordering skips tokens that overlap the current range.

When prev_token is None, self.range() returns TextRange::default() (0..0). The comparison range.ordering(self.range()) > Ordering::Greater (line 236) only matches tokens completely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returning Ordering::Equal, and are skipped. The first token in the source is never yielded.

Fix by comparing start positions:

-                if matches!(range.ordering(self.range()), Ordering::Greater) {
+                if range.start() > self.range().end() {
                     self.prev_token = Some(*token);
                     return self.prev_token;
                 }

286-391: Critical token mapping errors: duplicates and CPython conflicts.

Multiple critical issues remain:

  1. Duplicate value 62 (lines 384-385): Both Comment and TStringStart map to 62
  2. Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has COMMENT=61
  3. FString conflicts (lines 381-383): Values 59, 60, 61 correspond to SOFT_KEYWORD, ERRORTOKEN, and COMMENT in CPython 3.14
  4. Exclamation collision (line 380): Maps to 54, which is CPython's OP token
  5. Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate

These mappings must be corrected to either:

  • Match CPython 3.14 standard token values exactly, or
  • Use reserved/unassigned values (e.g., 100+) to avoid collisions

Run this verification to confirm no remaining duplicates after fixes:

#!/bin/bash
# Extract all token value assignments and check for duplicates
rg -n "TokenKind::\w+ => \d+" stdlib/src/tokenize.rs | \
  gawk '{print $NF}' | sort | uniq -d | \
  xargs -I {} echo "Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

189-212: State structure is correct; consider performance implications.

The push_line method reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.


113-118: TODO: Implement syntax error handling.

The TODO on line 114 indicates that parse errors from state.errors are not currently raised as SyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04371d3 and 30f54f6.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_tokenize.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1328-1334)
  • readline (1349-1355)
  • readline (1575-1580)
  • readline (2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)

1-23: LGTM: Module structure and imports are well-organized.

The #[pymodule] macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.


25-33: Struct definition is sound; track the TODO for future work.

The use of PyRwLock for state management is appropriate. The TODO on line 29 regarding PyObject vs ArgCallable can be addressed in a follow-up.


35-64: LGTM: Error handling correctly distinguishes StopIteration from other exceptions.

The implementation now properly treats only StopIteration as EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.


66-77: LGTM: Standard Debug implementation and trait delegation.


79-98: LGTM: Constructor implementation is clean.


100-100: LGTM: Standard SelfIter implementation.


214-225: LGTM: Current line extraction logic is appropriate.


245-267: LGTM: Position helper methods correctly convert to Python conventions.

The start() and end() methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.


270-284: LGTM: Default state initialization is clean.

@ShaharNaveh ShaharNaveh marked this pull request as draft November 14, 2025 10:26
@youknowone

Copy link
Copy Markdown
Member

I missed this, sorry

youknowone added a commit to youknowone/RustPython that referenced this pull request Mar 6, 2026
Port _tokenize.TokenizerIter from PR RustPython#6240 (ShaharNaveh), adapted to
current codebase. Uses ruff_python_parser for tokenization.
Update Lib/tokenize.py from cpython/Lib/tokenize.py.
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 6, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 6, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
@ShaharNaveh

Copy link
Copy Markdown
Contributor Author

I missed this, sorry

All good. This PR wasn't great anyway. I'm thinking it might be worth to implement the tokenizer by hand instead of relying on ruff as there's multiple issues with it. I'll try to resolve the conflicts and how the newer versions of ruff behave

@github-actions

github-actions Bot commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/tokenize.py
[x] test: cpython/Lib/test/test_tokenize.py (TODO: 2)

dependencies:

  • tokenize

dependent tests: (132 tests)

  • tokenize: test_inspect test_linecache test_tabnanny test_tokenize test_unparse
    • importlib._bootstrap_external: test_importlib test_unittest
      • py_compile: test_argparse test_cmd_line_script test_compileall test_importlib test_multiprocessing_main_handling test_py_compile test_pydoc test_runpy
      • pydoc: test_enum
    • inspect: test_abc test_asyncgen test_buffer test_builtin test_code test_collections test_coroutines test_decimal test_functools test_generators test_grammar test_monitoring test_ntpath test_operator test_patma test_posixpath test_signal test_sqlite3 test_traceback test_type_annotations test_types test_typing test_unittest test_yield_from test_zipimport test_zoneinfo
      • ast: test_ast test_compile test_compiler_codegen test_dis test_fstring test_future_stmt test_site test_ssl test_type_comments test_ucn
      • bdb: test_bdb
      • cmd: test_cmd
      • dataclasses: test__colorize test_copy test_ctypes test_genericalias test_pprint test_regrtest
      • pkgutil: test_pkgutil
      • rlcompleter: test_rlcompleter
      • trace: test_trace
      • xmlrpc.server: test_docxmlrpc test_xmlrpc
    • linecache:
      • timeit: test_timeit
      • traceback: test_asyncio test_code_module test_contextlib test_contextlib_async test_dictcomps test_exceptions test_http_cookiejar test_importlib test_iter test_listcomps test_pyexpat test_setcomps test_socket test_subprocess test_sys test_threadedtempfile test_threading test_unittest test_with
    • traceback:
      • concurrent.futures.process: test_concurrent_futures
      • http.cookiejar: test_urllib2
      • logging: test_asyncio test_hashlib test_logging test_support test_urllib2net
      • multiprocessing: test_asyncio test_concurrent_futures test_fcntl test_memoryview test_re
      • socketserver: test_imaplib test_socketserver test_wsgiref
      • threading: test_android test_asyncio test_bz2 test_concurrent_futures test_ctypes test_email test_fork1 test_frame test_ftplib test_gc test_httplib test_httpservers test_importlib test_io test_itertools test_largefile test_opcache test_pathlib test_poll test_queue test_robotparser test_sched test_smtplib test_super test_syslog test_termios test_threading_local test_time test_urllib2_localnet test_weakref test_winreg test_zstd

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 9, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 9, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 10, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone added a commit that referenced this pull request Mar 10, 2026
* Base implementation of _tokenize module

Port from PR #6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.

* Update tokenize from v3.14.3

* Rewrite _tokenize with 2-phase model

Replace per-line reparsing with single-pass tokenization:
- Read all lines via readline, parse once, yield tokens
- Fix token type values (COMMENT=65, NL=66, OP=55)
- Fix NEWLINE/NL end positions and implicit newline handling
- Fix DEDENT positions via look-ahead to next non-DEDENT token
- Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → })
- Emit implicit NL before ENDMARKER when source lacks trailing newline
- Raise IndentationError from lexer errors
- Remove 13 expectedFailure marks for now-passing tests

---------

Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com>
Co-authored-by: CPython Developers <>
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 19, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Mar 22, 2026
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
youknowone added a commit to youknowone/RustPython that referenced this pull request Mar 22, 2026
* Base implementation of _tokenize module

Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.

* Update tokenize from v3.14.3

* Rewrite _tokenize with 2-phase model

Replace per-line reparsing with single-pass tokenization:
- Read all lines via readline, parse once, yield tokens
- Fix token type values (COMMENT=65, NL=66, OP=55)
- Fix NEWLINE/NL end positions and implicit newline handling
- Fix DEDENT positions via look-ahead to next non-DEDENT token
- Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → })
- Emit implicit NL before ENDMARKER when source lacks trailing newline
- Raise IndentationError from lexer errors
- Remove 13 expectedFailure marks for now-passing tests

---------

Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com>
Co-authored-by: CPython Developers <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants