Implement _tokenize and update tokenize from v3.14.3#7392

youknowone

close #6240

Summary by CodeRabbit

New Features
- Added a tokenizer accessible from Python that iterates token streams, with f-string handling, encoding awareness, implicit-newline and indentation error reporting.
Chores
- Added workspace dependencies for Ruff-related crates (parser, AST, text-size, source-file) to support the stdlib build.

coderabbitai

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0f32219c-d6cf-4084-8ad8-5c73335a6b68

📥 Commits

Reviewing files that changed from the base of the PR and between 042aed4 and 55737ed.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_tabnanny.py is excluded by !Lib/**
Lib/test/test_tokenize.py is excluded by !Lib/**
Lib/tokenize.py is excluded by !Lib/**

📒 Files selected for processing (3)

crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs
crates/stdlib/src/lib.rs

🚧 Files skipped from review as they are similar to previous changes (3)

crates/stdlib/src/lib.rs
crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs

📝 Walkthrough

Walkthrough

The PR adds a new Python-accessible tokenizer module (_tokenize) implemented in Rust, registers it in stdlib, and adds four Ruff workspace dependencies to crates/stdlib/Cargo.toml. The tokenizer provides a TokenizerIter exposing Python iterator semantics with f-string, indentation, and EOF handling.

Changes

Cohort / File(s)	Summary
Dependency Management `crates/stdlib/Cargo.toml`	Added four workspace dependencies: `ruff_python_parser`, `ruff_python_ast`, `ruff_text_size`, `ruff_source_file`.
Tokenizer Implementation `crates/stdlib/src/_tokenize.rs`	New Rust module (~747 lines) adding `PyTokenizerIter` (Python class `TokenizerIter`) with iterator protocol, multi-phase state (Reading, Yielding, Done), f-string splitting/handling, implicit-NL/EOF logic, indentation/syntax error reporting, and token tuple construction.
Module Registration `crates/stdlib/src/lib.rs`	Declared `mod _tokenize;` and added `_tokenize::module_def(ctx)` to stdlib module registrations.

Sequence Diagram

sequenceDiagram
    participant Python as Python Caller
    participant Iter as PyTokenizerIter
    participant Reader as readline Callback
    participant Parser as Token Parser
    participant TokenMgmt as Token State Manager

    Python->>Iter: __init__(readline, extra_tokens, encoding)
    Iter->>Iter: Initialize state (Reading phase)
    
    Python->>Iter: __next__()
    Iter->>Iter: Check current phase
    
    alt Reading Phase
        Iter->>Reader: Call readline callback
        Reader-->>Iter: Return line text
        Iter->>Parser: Parse line → tokens
        Parser-->>Iter: Return tokens
        Iter->>TokenMgmt: Queue pending tokens
        Iter->>Iter: Transition to Yielding phase
    end
    
    alt Yielding Phase
        Iter->>TokenMgmt: Fetch next token
        Iter->>Iter: Handle special cases (FSTRING_MIDDLE, Dedent, NL, EOF)
        Iter->>Iter: Build token tuple (type, text, start, end, line)
        Iter-->>Python: Return token tuple
    end
    
    alt Done Phase
        Iter-->>Python: Raise StopIteration
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Update ruff to 0.15.4 #7246: Updates Ruff dependency revisions for the same ruff_python_* crates added here.
Update ruff to 0.15.0 #7086: Modifies the same Ruff workspace dependency entries present in this PR.
Update ruff to 0.15.1 #7105: Adjusts workspace usage/revisions for the same Ruff crates referenced in this PR.

Poem

🐰 I hopped through code with tiny feet,
A tokenizer tune I learned to beat,
F-strings twirled and tokens spun,
Readline sang until the job was done. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: implementing the _tokenize module and updating tokenize from v3.14.3.
Linked Issues check	✅ Passed	All coding requirements from `#6240` are met: the new _tokenize module is implemented with a Tokenizer iterator yielding token tuples, accepting readline callable and optional encoding, and internal dependencies are updated.
Out of Scope Changes check	✅ Passed	All changes directly support the implementation of the _tokenize module and its dependencies as required by `#6240`; no unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/tabnanny.py
[ ] test: cpython/Lib/test/test_tabnanny.py (TODO: 4)

dependencies:

tabnanny

dependent tests: (1 tests)

tabnanny: test_tabnanny

[x] lib: cpython/Lib/tokenize.py
[x] test: cpython/Lib/test/test_tokenize.py (TODO: 7)

dependencies:

tokenize

dependent tests: (132 tests)

tokenize: test_inspect test_linecache test_tabnanny test_tokenize test_unparse
- importlib._bootstrap_external: test_importlib test_unittest
  - py_compile: test_argparse test_cmd_line_script test_compileall test_importlib test_multiprocessing_main_handling test_py_compile test_pydoc test_runpy
  - pydoc: test_enum
- inspect: test_abc test_asyncgen test_buffer test_builtin test_code test_collections test_coroutines test_decimal test_functools test_generators test_grammar test_monitoring test_ntpath test_operator test_patma test_posixpath test_signal test_sqlite3 test_traceback test_type_annotations test_types test_typing test_unittest test_yield_from test_zipimport test_zoneinfo
  - ast: test_ast test_compile test_compiler_codegen test_dis test_fstring test_future_stmt test_site test_ssl test_type_comments test_ucn
  - bdb: test_bdb
  - cmd: test_cmd
  - dataclasses: test__colorize test_copy test_ctypes test_genericalias test_pprint test_regrtest
  - pkgutil: test_pkgutil
  - rlcompleter: test_rlcompleter
  - trace: test_trace
  - xmlrpc.server: test_docxmlrpc test_xmlrpc
- linecache:
  - timeit: test_timeit
  - traceback: test_asyncio test_code_module test_contextlib test_contextlib_async test_dictcomps test_exceptions test_http_cookiejar test_importlib test_iter test_listcomps test_pyexpat test_setcomps test_socket test_subprocess test_sys test_threadedtempfile test_threading test_unittest test_with
- traceback:
  - concurrent.futures.process: test_concurrent_futures
  - http.cookiejar: test_urllib2
  - logging: test_asyncio test_hashlib test_logging test_support test_urllib2net
  - multiprocessing: test_asyncio test_concurrent_futures test_fcntl test_memoryview test_re
  - socketserver: test_imaplib test_socketserver test_wsgiref
  - threading: test_android test_asyncio test_bz2 test_concurrent_futures test_ctypes test_email test_fork1 test_frame test_ftplib test_gc test_httplib test_httpservers test_importlib test_io test_itertools test_largefile test_opcache test_pathlib test_poll test_queue test_robotparser test_sched test_smtplib test_super test_syslog test_termios test_threading_local test_time test_urllib2_localnet test_weakref test_winreg test_zstd

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

ShaharNaveh

Way better than what I did

youknowone

@ShaharNaveh this is obviously based on your work!

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

crates/stdlib/src/_tokenize.rs (2)

396-419: Extract magic numbers 60 and 63 to named constants.

The function uses magic numbers 60 (FSTRING_MIDDLE) and 63 (TSTRING_MIDDLE) directly. For consistency with the other token constants defined at the top of the module (lines 25-29), consider adding named constants.

♻️ Suggested refactor

Add to the constants section (after line 29):

const TOKEN_FSTRING_MIDDLE: u8 = 60;
const TOKEN_TSTRING_MIDDLE: u8 = 63;

Then update the function:

                 TokenKind::FStringStart => {
                     if depth == 0 {
-                        return 60; // FSTRING_MIDDLE
+                        return TOKEN_FSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 TokenKind::TStringStart => {
                     if depth == 0 {
-                        return 63; // TSTRING_MIDDLE
+                        return TOKEN_TSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 _ => {}
             }
         }
-        60 // default to FSTRING_MIDDLE
+        TOKEN_FSTRING_MIDDLE // default

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 396 - 419, The function
find_fstring_middle_type currently returns magic numbers 60 and 63; define named
constants (e.g., TOKEN_FSTRING_MIDDLE and TOKEN_TSTRING_MIDDLE) alongside the
other token constants at the top of the module and replace the literal returns
(and the final default 60) with those constants so the function returns
TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE instead of raw numbers.

107-149: Consider the memory overhead of cloning TokenizerState.

The clone-modify-write pattern (line 109 reads/clones, line 140 writes back) avoids holding the lock during potentially slow readline() calls. However, in the Yielding phase, this clones the entire source string and token vectors on each next() call, which could be expensive for large files.

A more efficient approach would be to hold the write lock directly in the Yielding phase (since no external calls are made):

♻️ Suggested optimization (optional)

     impl IterNext for PyTokenizerIter {
         fn next(zelf: &Py<Self>, vm: &VirtualMachine) -> PyResult<PyIterReturn> {
-            let mut state = zelf.state.read().clone();
-
             loop {
-                match &mut state.phase {
-                    TokenizerPhase::Reading { source } => {
+                // Check phase without holding lock
+                let is_reading = matches!(&zelf.state.read().phase, TokenizerPhase::Reading { .. });
+                
+                if is_reading {
+                    let mut state = zelf.state.write();
+                    if let TokenizerPhase::Reading { source } = &mut state.phase {
                         let line = zelf.readline(vm)?;
                         // ... rest of reading logic
-                    }
-                    TokenizerPhase::Yielding { .. } => {
-                        let result =
-                            emit_next_token(&mut state, zelf.extra_tokens, vm)?;
-                        *zelf.state.write() = state;
-                        return Ok(result);
                     }
+                } else {
+                    let mut state = zelf.state.write();
+                    return emit_next_token(&mut state, zelf.extra_tokens, vm);
                 }
             }
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 107 - 149, The current
PyTokenizerIter::next clones the entire TokenizerState at the start
(zelf.state.read().clone()), which causes repeated heavy clones when in
TokenizerPhase::Yielding; instead, only clone/read for the Reading branch but
for the Yielding branch acquire a write lock on zelf.state (i.e., get a write
guard) and call emit_next_token with a mutable reference into that guarded state
(TokenizerPhase::Yielding) so you avoid cloning source/tokens on each next();
keep the readline path unchanged (read/clone/modify/write) but switch to holding
the write lock inside the TokenizerPhase::Yielding arm before calling
emit_next_token and returning its PyIterReturn.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 742-744: The match currently returns 0 (ENDMARKER) for
TokenKind::Unknown (and leaves IpyEscapeCommand and Question as 0); change these
arms so they map to ERRORTOKEN (numeric 67 or the appropriate ERRORTOKEN
constant) instead of 0: update the match handling for TokenKind::Unknown,
TokenKind::IpyEscapeCommand, and TokenKind::Question in the tokenizer function
to return 67 (or token::ERRORTOKEN) so unrecognized/IPython-specific tokens
produce ERRORTOKEN rather than ENDMARKER.
- Around line 189-199: The call to source.full_line_str(...) is invalid; replace
it with a small helper function (e.g., full_line_at(source: &str, byte_offset:
usize) -> &str or String) and call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.

---

Nitpick comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 396-419: The function find_fstring_middle_type currently returns
magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and
TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the
module and replace the literal returns (and the final default 60) with those
constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE
instead of raw numbers.
- Around line 107-149: The current PyTokenizerIter::next clones the entire
TokenizerState at the start (zelf.state.read().clone()), which causes repeated
heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the
Reading branch but for the Yielding branch acquire a write lock on zelf.state
(i.e., get a write guard) and call emit_next_token with a mutable reference into
that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens
on each next(); keep the readline path unchanged (read/clone/modify/write) but
switch to holding the write lock inside the TokenizerPhase::Yielding arm before
calling emit_next_token and returning its PyIterReturn.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 51c73337-da3f-40a7-a355-8dedaf682e2d

📥 Commits

Reviewing files that changed from the base of the PR and between 2266ba7 and 042aed4.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_tabnanny.py is excluded by !Lib/**
Lib/test/test_tokenize.py is excluded by !Lib/**
Lib/tokenize.py is excluded by !Lib/**

📒 Files selected for processing (3)

crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs
crates/stdlib/src/lib.rs

Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.

Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests

* Base implementation of _tokenize module Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter. * Update tokenize from v3.14.3 * Rewrite _tokenize with 2-phase model Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests --------- Co-authored-by: ShaharNaveh <shaharnaveh@users.noreply.github.com> Co-authored-by: CPython Developers <>

youknowone requested a review from ShaharNaveh March 9, 2026 13:56

ShaharNaveh reviewed Mar 9, 2026

View reviewed changes

ShaharNaveh approved these changes Mar 9, 2026

View reviewed changes

youknowone marked this pull request as ready for review March 10, 2026 02:14

youknowone changed the title ~~Implement _tokenize~~ Mar 10, 2026

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

ShaharNaveh and others added 3 commits March 10, 2026 11:28

Base implementation of _tokenize module …

5c5aff8

Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.

Update tokenize from v3.14.3

bf2b993

youknowone force-pushed the update-tokenize branch from 042aed4 to 55737ed Compare March 10, 2026 02:28

youknowone deleted the update-tokenize branch March 10, 2026 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement _tokenize and update tokenize from v3.14.3#7392