Optimize unpack, str.add and fastlocals#7293

youknowone

Summary by CodeRabbit

New Features
- String concatenation via + supported for direct string operands
Performance
- Faster integer arithmetic and range iteration via new fast paths
- Improved comparison performance for common numeric types
- Faster global and builtin lookups via exact-dict fast paths and caching
- Optimized sequence unpacking and loop iteration dispatch
Stability
- Reduced overhead in signal checking with a lightweight fast path

coderabbitai

📝 Walkthrough

Walkthrough

Adds exact-dict chaining, string + concatenation, a fast range-iterator step, extensive frame execution fast-paths (builtins cache, int ops, unpacking, FOR_ITER), consolidated comparison recursion guarding, and a read-only fast path for signal checking.

Changes

Cohort / File(s)	Summary
Dict / Exact path `crates/vm/src/builtins/dict.rs`	Added `PyExact<PyDict>` import and `get_chain_exact` to enable exact-dict key chaining; `PyDict::get_chain` delegates to the exact-path when both dicts are exact.
Range iterator `crates/vm/src/builtins/range.rs`	Added `PyRangeIterator::next_fast` for a protocol-free fast next-value computation.
String number methods `crates/vm/src/builtins/str.rs`	Added `add` handler to `PyStr`'s `PyNumberMethods` to implement `+` for two `PyStr` operands; adjusted `remainder` to return `NotImplemented` when left is not `PyStr`.
Frame execution & fast-paths `crates/vm/src/frame.rs`	Changed `with_exec` to accept `&VirtualMachine`; added `ExecutingFrame.builtins_dict` cache; added `int_add`/`int_sub`, `unpack_fast`, `for_iter_jump_target`; optimized LOAD_GLOBAL/LoadBuildClass and updated call sites (run/resume/gen_throw/yield_from_target).
Comparison recursion guard `crates/vm/src/protocol/object.rs`	Consolidated recursion guard by adding `_cmp_inner` and wrapping `_cmp` with a single `vm.with_recursion`, removing inner per-call recursion wrappers.
Signal check fast-path `crates/vm/src/signal.rs`	Early read-only check of `ANY_TRIGGERED` added to avoid atomic swap/cache-line invalidation when no signals are pending.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Bytecode as Bytecode/VM Dispatcher
participant Frame as ExecutingFrame
participant VM as VirtualMachine
participant Globals as Globals Dict
participant Builtins as Builtins (PyExact)

Bytecode->>Frame: LOAD_GLOBAL(name)
Frame->>VM: with_exec(vm)
Frame->>Globals: check globals dict (exact?)
alt Globals is exact and has name
    Globals-->>Frame: return value
else Globals not exact or missing
    Frame->>Builtins: consult builtins_dict cache
    alt builtins cache hit
        Builtins-->>Frame: return value
    else
        Frame->>Builtins: fallback full dict lookup (may lock)
        Builtins-->>Frame: return value or None
    end
end
Frame-->>Bytecode: push result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Optimize fast_locals and atomic ordering #7289: Related changes to frame local-access concurrency and fastlocals locking patterns.
Fix unpack msg #7175: Related unpacking fast-path changes for exact tuple/list handling used by unpack_fast.
Optimize coroutine exception handling and fix gen_throw traceback #7166: Related core frame execution API changes (with_exec/ExecutingFrame) and call-site updates.

Suggested reviewers

ShaharNaveh

Poem

🐇 I nibble chains where dicts are exact,
I stitch two strings with a hop and tact,
Frames tuck builtins into a snug little tract,
Ints race fast, iterators skip a beat,
Signals glance quick — then quiet and neat.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: optimizations to unpacking, string concatenation via add, and fast local variable access.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Code has been automatically formatted

The code in this PR has been formatted using:

cargo fmt --all
Please pull the latest changes before pushing again:

git pull origin worktree-perf-step1

Push elements directly from tuple/list slice in reverse order instead of cloning into a temporary Vec first.

Add Relaxed load guard before the Acquire swap to avoid cache-line invalidation on every instruction dispatch when no signal is pending.

Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse the cached reference in load_global_or_builtin and LoadBuildClass. Also add get_chain_exact to skip redundant exact_dict type checks.

binary_op1 can now resolve str+str addition directly via the number slot instead of falling through to the sequence concat path.

Address CodeRabbit review: f_locals() could access fastlocals without synchronization when called from another thread. Use try_lock on the state mutex so concurrent access is properly serialized.

downcast_ref::<PyDict>() matches dict subclasses, causing get_chain_exact to bypass custom __getitem__ overrides. Use downcast_ref_if_exact to only fast-path exact dict types.

Move the recursion depth check to wrap the entire _cmp body instead of each individual call_cmp direction, reducing Cell read/write pairs and scopeguard overhead per comparison.

- FOR_ITER: detect PyRangeIterator and bypass generic iterator protocol (atomic slot load + indirect call) - COMPARE_OP: inline int/float comparison for exact types, skip rich_compare dispatch and with_recursion overhead - BINARY_OP: inline int add/sub with i64 checked arithmetic to avoid BigInt heap allocation and binary_op1 dispatch

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 4097-4105: The fast-path for exact PyFloat in frame.rs wrongly
treats NaN as equal by using partial_cmp(...).is_some_and(...); change this so
you only use cmp_op.eval_ord(ord) when partial_cmp returns Some(ord), and if
partial_cmp returns None (NaN case) fall back to the existing rich-compare
handling instead of pushing a boolean. Concretely, inside the block that checks
a.downcast_ref_if_exact::<PyFloat>(vm) and
b.downcast_ref_if_exact::<PyFloat>(vm), test partial_cmp(...); if Some(ord) push
ctx.new_bool(cmp_op.eval_ord(ord)), otherwise call or jump to the rich_compare
path (the same logic used for non-fast-path comparisons) so that operations like
!= obey Python's NaN semantics.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b3a5b and 125ceca.

📒 Files selected for processing (6)

crates/vm/src/builtins/dict.rs
crates/vm/src/builtins/range.rs
crates/vm/src/builtins/str.rs
crates/vm/src/frame.rs
crates/vm/src/protocol/object.rs
crates/vm/src/signal.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: In locals(), don't assume try_lock() succeeded: check the
result of self.state.try_lock() and only perform unsafe {
self.fastlocals.borrow() } when the guard is Some; if try_lock() returns None
return early with a PyResult error (e.g. a runtime error via the VM) or an
appropriate fallback instead of unsafely reading fastlocals while another thread
may be executing the frame; update the locals() function to inspect the try_lock
outcome before the unsafe borrow to prevent data races.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 125ceca and 29c0cf2.

📒 Files selected for processing (4)

crates/vm/src/builtins/dict.rs
crates/vm/src/builtins/range.rs
crates/vm/src/frame.rs
crates/vm/src/protocol/object.rs

get_chain_exact bypasses __missing__ on dict subclasses. Move get_chain_exact to PyExact<PyDict> impl with debug_assert, and have get_chain delegate to it. Store builtins_dict as Option<&PyExact<PyDict>> to enforce exact type at compile time. Use PyRangeIterator::next_fast() instead of pub(crate) fields. Fix comment style issues.

coderabbitai

♻️ Duplicate comments (1)

crates/vm/src/frame.rs (1)

395-407: ⚠️ Potential issue | 🔴 Critical

Avoid unsynchronized fastlocals access when try_lock() fails.

Line 399 allows None from try_lock(), but Line 407 still performs unsafe { self.fastlocals.borrow() }. If another thread owns the mutex, this is a data race/UB.

🐛 Proposed safety fix

 pub fn locals(&self, vm: &VirtualMachine) -> PyResult<ArgMapping> {
-    // Acquire the state mutex to synchronize with frame execution.
-    // If try_lock fails, the frame is executing on this thread (e.g.
-    // trace callback accessing f_locals), so fastlocals access is safe.
-    let _guard = self.state.try_lock();
+    // Acquire the state mutex to synchronize with frame execution.
+    // If unavailable, avoid unsynchronized fastlocals reads.
+    let _guard = match self.state.try_lock() {
+        Some(guard) => guard,
+        None => return Ok(self.locals.clone()),
+    };
@@
-        // SAFETY: Either _guard holds the state mutex (frame not executing),
-        // or we're in a trace callback on the same thread that holds it.
+        // SAFETY: `_guard` holds the state mutex, so fastlocals access is exclusive.
         let fastlocals = unsafe { self.fastlocals.borrow() };

Use this to verify there is no explicit same-thread ownership check before the unsafe borrow:

#!/bin/bash
set -euo pipefail

echo "1) locals() lock/unsafe flow"
rg -n "pub fn locals\\(|try_lock\\(|fastlocals\\.borrow\\(" crates/vm/src/frame.rs -C4

echo
echo "2) Any thread-identity guard in frame path"
rg -n "ThreadId|current_thread|owner\\.load|owner" crates/vm/src/frame.rs -C3

echo
echo "3) Mutex backend and try_lock definitions/usages"
fd lock.rs | xargs rg -n "PyMutex|try_lock|parking_lot|Mutex" -C2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/frame.rs` around lines 395 - 407, The locals() method currently
calls self.state.try_lock() and ignores a failed lock (None), but always
proceeds to unsafely access self.fastlocals.borrow(), which can race if another
thread owns the mutex; modify locals() so that when try_lock() returns None you
either (a) acquire the lock with self.state.lock() (or spin until safe) before
doing the unsafe borrow, or (b) check a same-thread/owner marker on the frame
(e.g. an owner ThreadId) and only perform unsafe { self.fastlocals.borrow() }
when the current thread is the owner; update references to _guard, try_lock(),
state, locals(), and fastlocals.borrow() accordingly to ensure the
unsynchronized path cannot perform the unsafe borrow.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: The locals() method currently calls self.state.try_lock()
and ignores a failed lock (None), but always proceeds to unsafely access
self.fastlocals.borrow(), which can race if another thread owns the mutex;
modify locals() so that when try_lock() returns None you either (a) acquire the
lock with self.state.lock() (or spin until safe) before doing the unsafe borrow,
or (b) check a same-thread/owner marker on the frame (e.g. an owner ThreadId)
and only perform unsafe { self.fastlocals.borrow() } when the current thread is
the owner; update references to _guard, try_lock(), state, locals(), and
fastlocals.borrow() accordingly to ensure the unsynchronized path cannot perform
the unsafe borrow.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29c0cf2 and 4c24868.

📒 Files selected for processing (4)

crates/vm/src/builtins/dict.rs
crates/vm/src/builtins/range.rs
crates/vm/src/frame.rs
crates/vm/src/protocol/object.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/vm/src/builtins/range.rs
crates/vm/src/builtins/dict.rs

* Remove intermediate Vec allocation in unpack_sequence fast path Push elements directly from tuple/list slice in reverse order instead of cloning into a temporary Vec first. * Use read-only atomic load before swap in check_signals Add Relaxed load guard before the Acquire swap to avoid cache-line invalidation on every instruction dispatch when no signal is pending. * Cache builtins downcast in ExecutingFrame for LOAD_GLOBAL Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse the cached reference in load_global_or_builtin and LoadBuildClass. Also add get_chain_exact to skip redundant exact_dict type checks. * Add number Add slot to PyStr for direct str+str dispatch binary_op1 can now resolve str+str addition directly via the number slot instead of falling through to the sequence concat path. * Guard FastLocals access in locals() with try_lock on state mutex Address CodeRabbit review: f_locals() could access fastlocals without synchronization when called from another thread. Use try_lock on the state mutex so concurrent access is properly serialized. * Use exact type check for builtins_dict cache downcast_ref::<PyDict>() matches dict subclasses, causing get_chain_exact to bypass custom __getitem__ overrides. Use downcast_ref_if_exact to only fast-path exact dict types. * Consolidate with_recursion in _cmp to single guard Move the recursion depth check to wrap the entire _cmp body instead of each individual call_cmp direction, reducing Cell read/write pairs and scopeguard overhead per comparison. * Add opcode-level fast paths for FOR_ITER, COMPARE_OP, BINARY_OP - FOR_ITER: detect PyRangeIterator and bypass generic iterator protocol (atomic slot load + indirect call) - COMPARE_OP: inline int/float comparison for exact types, skip rich_compare dispatch and with_recursion overhead - BINARY_OP: inline int add/sub with i64 checked arithmetic to avoid BigInt heap allocation and binary_op1 dispatch * Also check globals is exact dict for LOAD_GLOBAL fast path get_chain_exact bypasses __missing__ on dict subclasses. Move get_chain_exact to PyExact<PyDict> impl with debug_assert, and have get_chain delegate to it. Store builtins_dict as Option<&PyExact<PyDict>> to enforce exact type at compile time. Use PyRangeIterator::next_fast() instead of pub(crate) fields. Fix comment style issues.

youknowone force-pushed the worktree-perf-step1 branch 3 times, most recently from b7d0f60 to b537519 Compare March 1, 2026 15:42

youknowone added 8 commits March 2, 2026 00:45

Remove intermediate Vec allocation in unpack_sequence fast path …

ef3c5ef

Push elements directly from tuple/list slice in reverse order instead of cloning into a temporary Vec first.

Use read-only atomic load before swap in check_signals …

b6760e1

Add Relaxed load guard before the Acquire swap to avoid cache-line invalidation on every instruction dispatch when no signal is pending.

Cache builtins downcast in ExecutingFrame for LOAD_GLOBAL …

98ea46e

Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse the cached reference in load_global_or_builtin and LoadBuildClass. Also add get_chain_exact to skip redundant exact_dict type checks.

Add number Add slot to PyStr for direct str+str dispatch …

f3a8d8d

binary_op1 can now resolve str+str addition directly via the number slot instead of falling through to the sequence concat path.

Guard FastLocals access in locals() with try_lock on state mutex …

67ed603

Address CodeRabbit review: f_locals() could access fastlocals without synchronization when called from another thread. Use try_lock on the state mutex so concurrent access is properly serialized.

Use exact type check for builtins_dict cache …

b1cf47c

downcast_ref::<PyDict>() matches dict subclasses, causing get_chain_exact to bypass custom __getitem__ overrides. Use downcast_ref_if_exact to only fast-path exact dict types.

Consolidate with_recursion in _cmp to single guard …

0af466c

Move the recursion depth check to wrap the entire _cmp body instead of each individual call_cmp direction, reducing Cell read/write pairs and scopeguard overhead per comparison.

youknowone force-pushed the worktree-perf-step1 branch from b537519 to abaad3c Compare March 1, 2026 15:46

youknowone marked this pull request as ready for review March 1, 2026 15:46

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

youknowone force-pushed the worktree-perf-step1 branch from 125ceca to 29c0cf2 Compare March 1, 2026 16:05

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

youknowone force-pushed the worktree-perf-step1 branch from 29c0cf2 to 4c24868 Compare March 1, 2026 16:15

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

youknowone deleted the worktree-perf-step1 branch March 2, 2026 03:37

This was referenced Mar 2, 2026

Lazy locals dict for NEWLOCALS frames #7307

Merged

More fork safety #7380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unpack, str.add and fastlocals#7293