Implement _tokenize and update tokenize from v3.14.3#7392
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe PR adds a new Python-accessible tokenizer module ( Changes
Sequence DiagramsequenceDiagram
participant Python as Python Caller
participant Iter as PyTokenizerIter
participant Reader as readline Callback
participant Parser as Token Parser
participant TokenMgmt as Token State Manager
Python->>Iter: __init__(readline, extra_tokens, encoding)
Iter->>Iter: Initialize state (Reading phase)
Python->>Iter: __next__()
Iter->>Iter: Check current phase
alt Reading Phase
Iter->>Reader: Call readline callback
Reader-->>Iter: Return line text
Iter->>Parser: Parse line → tokens
Parser-->>Iter: Return tokens
Iter->>TokenMgmt: Queue pending tokens
Iter->>Iter: Transition to Yielding phase
end
alt Yielding Phase
Iter->>TokenMgmt: Fetch next token
Iter->>Iter: Handle special cases (FSTRING_MIDDLE, Dedent, NL, EOF)
Iter->>Iter: Build token tuple (type, text, start, end, line)
Iter-->>Python: Return token tuple
end
alt Done Phase
Iter-->>Python: Raise StopIteration
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Sorry, something went wrong.
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/tabnanny.py dependencies:
dependent tests: (1 tests)
[x] lib: cpython/Lib/tokenize.py dependencies:
dependent tests: (132 tests)
Legend:
|
Sorry, something went wrong.
ShaharNaveh
left a comment
There was a problem hiding this comment.
Way better than what I did
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
crates/stdlib/src/_tokenize.rs (2)
396-419: Extract magic numbers 60 and 63 to named constants.The function uses magic numbers
60(FSTRING_MIDDLE) and63(TSTRING_MIDDLE) directly. For consistency with the other token constants defined at the top of the module (lines 25-29), consider adding named constants.♻️ Suggested refactor
Add to the constants section (after line 29):
const TOKEN_FSTRING_MIDDLE: u8 = 60; const TOKEN_TSTRING_MIDDLE: u8 = 63;Then update the function:
TokenKind::FStringStart => { if depth == 0 { - return 60; // FSTRING_MIDDLE + return TOKEN_FSTRING_MIDDLE; } depth -= 1; } TokenKind::TStringStart => { if depth == 0 { - return 63; // TSTRING_MIDDLE + return TOKEN_TSTRING_MIDDLE; } depth -= 1; } _ => {} } } - 60 // default to FSTRING_MIDDLE + TOKEN_FSTRING_MIDDLE // default🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/_tokenize.rs` around lines 396 - 419, The function find_fstring_middle_type currently returns magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the module and replace the literal returns (and the final default 60) with those constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE instead of raw numbers.
107-149: Consider the memory overhead of cloningTokenizerState.The clone-modify-write pattern (line 109 reads/clones, line 140 writes back) avoids holding the lock during potentially slow
readline()calls. However, in theYieldingphase, this clones the entire source string and token vectors on eachnext()call, which could be expensive for large files.A more efficient approach would be to hold the write lock directly in the
Yieldingphase (since no external calls are made):♻️ Suggested optimization (optional)
impl IterNext for PyTokenizerIter { fn next(zelf: &Py<Self>, vm: &VirtualMachine) -> PyResult<PyIterReturn> { - let mut state = zelf.state.read().clone(); - loop { - match &mut state.phase { - TokenizerPhase::Reading { source } => { + // Check phase without holding lock + let is_reading = matches!(&zelf.state.read().phase, TokenizerPhase::Reading { .. }); + + if is_reading { + let mut state = zelf.state.write(); + if let TokenizerPhase::Reading { source } = &mut state.phase { let line = zelf.readline(vm)?; // ... rest of reading logic - } - TokenizerPhase::Yielding { .. } => { - let result = - emit_next_token(&mut state, zelf.extra_tokens, vm)?; - *zelf.state.write() = state; - return Ok(result); } + } else { + let mut state = zelf.state.write(); + return emit_next_token(&mut state, zelf.extra_tokens, vm); } } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/_tokenize.rs` around lines 107 - 149, The current PyTokenizerIter::next clones the entire TokenizerState at the start (zelf.state.read().clone()), which causes repeated heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the Reading branch but for the Yielding branch acquire a write lock on zelf.state (i.e., get a write guard) and call emit_next_token with a mutable reference into that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens on each next(); keep the readline path unchanged (read/clone/modify/write) but switch to holding the write lock inside the TokenizerPhase::Yielding arm before calling emit_next_token and returning its PyIterReturn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 742-744: The match currently returns 0 (ENDMARKER) for
TokenKind::Unknown (and leaves IpyEscapeCommand and Question as 0); change these
arms so they map to ERRORTOKEN (numeric 67 or the appropriate ERRORTOKEN
constant) instead of 0: update the match handling for TokenKind::Unknown,
TokenKind::IpyEscapeCommand, and TokenKind::Question in the tokenizer function
to return 67 (or token::ERRORTOKEN) so unrecognized/IPython-specific tokens
produce ERRORTOKEN rather than ENDMARKER.
- Around line 189-199: The call to source.full_line_str(...) is invalid; replace
it with a small helper function (e.g., full_line_at(source: &str, byte_offset:
usize) -> &str or String) and call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.
---
Nitpick comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 396-419: The function find_fstring_middle_type currently returns
magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and
TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the
module and replace the literal returns (and the final default 60) with those
constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE
instead of raw numbers.
- Around line 107-149: The current PyTokenizerIter::next clones the entire
TokenizerState at the start (zelf.state.read().clone()), which causes repeated
heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the
Reading branch but for the Yielding branch acquire a write lock on zelf.state
(i.e., get a write guard) and call emit_next_token with a mutable reference into
that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens
on each next(); keep the readline path unchanged (read/clone/modify/write) but
switch to holding the write lock inside the TokenizerPhase::Yielding arm before
calling emit_next_token and returning its PyIterReturn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 51c73337-da3f-40a7-a355-8dedaf682e2d
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockLib/test/test_tabnanny.pyis excluded by!Lib/**Lib/test/test_tokenize.pyis excluded by!Lib/**Lib/tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (3)
crates/stdlib/Cargo.tomlcrates/stdlib/src/_tokenize.rscrates/stdlib/src/lib.rs
Sorry, something went wrong.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Replace per-line reparsing with single-pass tokenization:
- Read all lines via readline, parse once, yield tokens
- Fix token type values (COMMENT=65, NL=66, OP=55)
- Fix NEWLINE/NL end positions and implicit newline handling
- Fix DEDENT positions via look-ahead to next non-DEDENT token
- Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → })
- Emit implicit NL before ENDMARKER when source lacks trailing newline
- Raise IndentationError from lexer errors
- Remove 13 expectedFailure marks for now-passing tests
042aed4 to
55737ed
Compare
March 10, 2026 02:28
e645761
into
RustPython:main
Mar 10, 2026
* Base implementation of _tokenize module Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter. * Update tokenize from v3.14.3 * Rewrite _tokenize with 2-phase model Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests --------- Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com> Co-authored-by: CPython Developers <>
close #6240
Summary by CodeRabbit
New Features
Chores