Base implementation of _tokenize module#6240
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReorders and augments Ruff-related dependencies in top-level Changes
Sequence Diagram(s)sequenceDiagram
participant PyUser as Python user
participant TokenIter as TokenizerIter (Py)
participant State as PyTokenizerIterState
participant Parser as Ruff parser
PyUser->>TokenIter: next() / __next__()
TokenIter->>TokenIter: check internal state & tokens
alt tokens exhausted
TokenIter->>PyUser: call readline(vm) -> line
alt readline returns line
TokenIter->>State: push_line(line)
State->>Parser: parse_unchecked_source(line...)
Parser-->>State: tokens
State->>TokenIter: next_token()
TokenIter->>TokenIter: token_kind_value(kind)
alt extra_tokens and operator
TokenIter->>TokenIter: remap -> TOKEN_OP (55)
end
TokenIter-->>PyUser: (type, string, start, end, current_line)
else readline -> EOF
TokenIter->>State: mark EOF
TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)
end
else tokens available
TokenIter-->>PyUser: next token tuple
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (15)
Cargo.lockis excluded by!**/*.lockLib/test/test_tokenize.pyis excluded by!Lib/**Lib/test/tokenizedata/__init__.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding2.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_3131.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_pep3120.pyis excluded by!Lib/**Lib/test/tokenizedata/coding20731.pyis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests.txtis excluded by!Lib/**Lib/token.pyis excluded by!Lib/**Lib/tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (4)
Cargo.toml(1 hunks)stdlib/Cargo.toml(1 hunks)stdlib/src/lib.rs(2 hunks)stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
Applied to files:
stdlib/Cargo.tomlstdlib/src/tokenize.rsCargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code
Applied to files:
Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
tokenize(466-492)stdlib/src/socket.rs (1)
make_module(7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run tests under miri
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)
39-42: Don't swallow non-StopIteration errors fromreadline.Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only
StopIterationshould terminate the stream; all other errors must propagate.Apply this diff to fix the error handling:
- let raw_line = match self.readline.invoke((), vm) { - Ok(v) => v, - Err(_) => return Ok(String::new()), - }; + let raw_line = match self.readline.invoke((), vm) { + Ok(v) => v, + Err(exc) => { + if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) { + return Ok(String::new()); + } + return Err(exc); + } + };
116-138: End-of-file tuple needs to come from the real EOF token.Synthesizing the EOF tuple here hardcodes
(line, -1)and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken, set it asprev_token, and let the existing emission path produce the correct tuple.Apply this diff to use the parser's EOF token:
if nline.is_empty() { state.eof = true; - *zelf.state.write() = state.clone(); - - let line_num = &state.start().0; - let out = vm - .ctx - .new_tuple(vec![ - token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), - vm.ctx.new_str("").into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx.new_str(state.current_line()).into(), - ]) - .into(); - return Ok(PyIterReturn::Return(out)); + if let Some(eof) = state + .tokens + .iter() + .rev() + .find(|token| token.kind() == TokenKind::EndOfFile) + { + state.prev_token = Some(*eof); + break *eof; + } + *zelf.state.write() = state.clone(); + return Ok(PyIterReturn::StopIteration(None)); }
151-154: Don't trim token text whenextra_tokensis false.
trim()removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.Apply this diff to preserve the exact lexeme:
- let mut token_repr = &state.source[state.range()]; - if !zelf.extra_tokens { - token_repr = token_repr.trim(); - } + let token_repr = &state.source[state.range()];
178-180:extra_tokensmust default toFalse.CPython's
TokenizerIterallows callingTokenizerIter(readline)without naming the flag. Requiring the argument here breaks API compatibility.Apply this diff to make
extra_tokensoptional with a default:#[pyarg(positional)] readline: ArgCallable, - #[pyarg(named)] - extra_tokens: bool, + #[pyarg(named, default = "false")] + extra_tokens: bool, #[pyarg(named, optional)] encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
108-109: Reminder: Address the TODO comment for error handling.The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The
state.errorsfield is populated but never checked.Do you want me to generate code to check
state.errorsand raise appropriate SyntaxError exceptions, or open a new issue to track this task?
200-207: Re-parsing the entire source on each line may be inefficient.The
push_linemethod re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)
1-23: LGTM!Module structure and imports follow RustPython conventions correctly.
25-33: LGTM!The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.
61-72: LGTM!Debug implementation and empty
pyclassimpl block are correct.
74-93: LGTM!Constructor implementation follows RustPython patterns correctly.
210-220: No changes needed tocurrent_linelogic.The
current_linemethod correctly uses theLineRangestrait methods fromruff_source_file:full_line_str(offset)returns the full line containing an offset, andfull_lines_str(range)returns text covering all lines that a range spans. The logic properly handles edge cases:
- Multi-line tokens: Correctly handled via
full_lines_str(range)for non-Newline tokens- Newline tokens: Correctly isolated via
full_line_str(range.start())- When
prev_tokenisNone: Safely defaults toTokenKind::UnknownwithTextRange::default(), which callsfull_lines_strand returns lines from the startThe code is sound and requires no modifications.
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)
121-143: EOF tuple must use the parser's actual EndOfFile token.The code synthesizes the EOF tuple with hardcoded
(-1)positions (lines 132, 135) and reusesstate.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken fromstate.tokens, set it asprev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.Apply this approach:
let nline = zelf.readline(vm)?; if nline.is_empty() { state.eof = true; - *zelf.state.write() = state.clone(); - - let line_num = &state.start().0; - let out = vm - .ctx - .new_tuple(vec![ - token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), - vm.ctx.new_str("").into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx.new_str(state.current_line()).into(), - ]) - .into(); - return Ok(PyIterReturn::Return(out)); + if let Some(eof) = state + .tokens + .iter() + .find(|token| token.kind() == TokenKind::EndOfFile) + { + state.prev_token = Some(*eof); + break *eof; + } + *zelf.state.write() = state.clone(); + return Ok(PyIterReturn::StopIteration(None)); }
156-159: Remove trim() to preserve exact token lexemes.Calling
trim()whenextra_tokensis false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.Apply this diff:
- let mut token_repr = &state.source[state.range()]; - if !zelf.extra_tokens { - token_repr = token_repr.trim(); - } + let token_repr = &state.source[state.range()];
183-184: Makeextra_tokensoptional with defaultfalse.CPython allows
TokenizerIter(readline)without specifyingextra_tokens. The current declaration requires the argument, breaking API compatibility.Apply this diff:
- #[pyarg(named)] - extra_tokens: bool, + #[pyarg(named, default = "false")] + extra_tokens: bool,
227-243: Token ordering skips tokens that overlap the current range.When
prev_tokenisNone,self.range()returnsTextRange::default()(0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater(line 236) only matches tokens completely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returningOrdering::Equal, and are skipped. The first token in the source is never yielded.Fix by comparing start positions:
- if matches!(range.ordering(self.range()), Ordering::Greater) { + if range.start() > self.range().end() { self.prev_token = Some(*token); return self.prev_token; }
286-391: Critical token mapping errors: duplicates and CPython conflicts.Multiple critical issues remain:
- Duplicate value 62 (lines 384-385): Both
CommentandTStringStartmap to 62- Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has
COMMENT=61- FString conflicts (lines 381-383): Values 59, 60, 61 correspond to
SOFT_KEYWORD,ERRORTOKEN, andCOMMENTin CPython 3.14- Exclamation collision (line 380): Maps to 54, which is CPython's
OPtoken- Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate
These mappings must be corrected to either:
- Match CPython 3.14 standard token values exactly, or
- Use reserved/unassigned values (e.g., 100+) to avoid collisions
Run this verification to confirm no remaining duplicates after fixes:
#!/bin/bash # Extract all token value assignments and check for duplicates rg -n "TokenKind::\w+ => \d+" stdlib/src/tokenize.rs | \ gawk '{print $NF}' | sort | uniq -d | \ xargs -I {} echo "Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
189-212: State structure is correct; consider performance implications.The
push_linemethod reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.
113-118: TODO: Implement syntax error handling.The TODO on line 114 indicates that parse errors from
state.errorsare not currently raised asSyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.Based on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/test/test_tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1328-1334)readline(1349-1355)readline(1575-1580)readline(2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)
1-23: LGTM: Module structure and imports are well-organized.The
#[pymodule]macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.
25-33: Struct definition is sound; track the TODO for future work.The use of
PyRwLockfor state management is appropriate. The TODO on line 29 regardingPyObjectvsArgCallablecan be addressed in a follow-up.
35-64: LGTM: Error handling correctly distinguishes StopIteration from other exceptions.The implementation now properly treats only
StopIterationas EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.
66-77: LGTM: Standard Debug implementation and trait delegation.
79-98: LGTM: Constructor implementation is clean.
100-100: LGTM: Standard SelfIter implementation.
214-225: LGTM: Current line extraction logic is appropriate.
245-267: LGTM: Position helper methods correctly convert to Python conventions.The
start()andend()methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.
270-284: LGTM: Default state initialization is clean.
Sorry, something went wrong.
|
I missed this, sorry |
Sorry, something went wrong.
Port _tokenize.TokenizerIter from PR RustPython#6240 (ShaharNaveh), adapted to current codebase. Uses ruff_python_parser for tokenization. Update Lib/tokenize.py from cpython/Lib/tokenize.py.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
All good. This PR wasn't great anyway. I'm thinking it might be worth to implement the tokenizer by hand instead of relying on ruff as there's multiple issues with it. I'll try to resolve the conflicts and how the newer versions of ruff behave |
Sorry, something went wrong.
37f6c72 to
f2fecb5
Compare
March 6, 2026 10:11
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/tokenize.py dependencies:
dependent tests: (132 tests)
Legend:
|
Sorry, something went wrong.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
* Base implementation of _tokenize module Port from PR #6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter. * Update tokenize from v3.14.3 * Rewrite _tokenize with 2-phase model Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests --------- Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com> Co-authored-by: CPython Developers <>
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
* Base implementation of _tokenize module Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter. * Update tokenize from v3.14.3 * Rewrite _tokenize with 2-phase model Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests --------- Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com> Co-authored-by: CPython Developers <>
Summary by CodeRabbit
New Features
Chores