Optimize unpack, str.__add__ and fastlocals#7293
Conversation
📝 WalkthroughWalkthroughAdds exact-dict chaining, string Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Sorry, something went wrong.
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin worktree-perf-step1 |
Sorry, something went wrong.
b7d0f60 to
b537519
Compare
March 1, 2026 15:42
Push elements directly from tuple/list slice in reverse order instead of cloning into a temporary Vec first.
Add Relaxed load guard before the Acquire swap to avoid cache-line invalidation on every instruction dispatch when no signal is pending.
Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse the cached reference in load_global_or_builtin and LoadBuildClass. Also add get_chain_exact to skip redundant exact_dict type checks.
binary_op1 can now resolve str+str addition directly via the number slot instead of falling through to the sequence concat path.
Address CodeRabbit review: f_locals() could access fastlocals without synchronization when called from another thread. Use try_lock on the state mutex so concurrent access is properly serialized.
downcast_ref::<PyDict>() matches dict subclasses, causing get_chain_exact to bypass custom __getitem__ overrides. Use downcast_ref_if_exact to only fast-path exact dict types.
Move the recursion depth check to wrap the entire _cmp body instead of each individual call_cmp direction, reducing Cell read/write pairs and scopeguard overhead per comparison.
- FOR_ITER: detect PyRangeIterator and bypass generic iterator protocol (atomic slot load + indirect call) - COMPARE_OP: inline int/float comparison for exact types, skip rich_compare dispatch and with_recursion overhead - BINARY_OP: inline int add/sub with i64 checked arithmetic to avoid BigInt heap allocation and binary_op1 dispatch
b537519 to
abaad3c
Compare
March 1, 2026 15:46
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 4097-4105: The fast-path for exact PyFloat in frame.rs wrongly
treats NaN as equal by using partial_cmp(...).is_some_and(...); change this so
you only use cmp_op.eval_ord(ord) when partial_cmp returns Some(ord), and if
partial_cmp returns None (NaN case) fall back to the existing rich-compare
handling instead of pushing a boolean. Concretely, inside the block that checks
a.downcast_ref_if_exact::<PyFloat>(vm) and
b.downcast_ref_if_exact::<PyFloat>(vm), test partial_cmp(...); if Some(ord) push
ctx.new_bool(cmp_op.eval_ord(ord)), otherwise call or jump to the rich_compare
path (the same logic used for non-fast-path comparisons) so that operations like
!= obey Python's NaN semantics.
ℹ️ Review info
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
crates/vm/src/builtins/dict.rscrates/vm/src/builtins/range.rscrates/vm/src/builtins/str.rscrates/vm/src/frame.rscrates/vm/src/protocol/object.rscrates/vm/src/signal.rs
Sorry, something went wrong.
125ceca to
29c0cf2
Compare
March 1, 2026 16:05
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: In locals(), don't assume try_lock() succeeded: check the
result of self.state.try_lock() and only perform unsafe {
self.fastlocals.borrow() } when the guard is Some; if try_lock() returns None
return early with a PyResult error (e.g. a runtime error via the VM) or an
appropriate fallback instead of unsafely reading fastlocals while another thread
may be executing the frame; update the locals() function to inspect the try_lock
outcome before the unsafe borrow to prevent data races.
ℹ️ Review info
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/vm/src/builtins/dict.rscrates/vm/src/builtins/range.rscrates/vm/src/frame.rscrates/vm/src/protocol/object.rs
Sorry, something went wrong.
get_chain_exact bypasses __missing__ on dict subclasses. Move get_chain_exact to PyExact<PyDict> impl with debug_assert, and have get_chain delegate to it. Store builtins_dict as Option<&PyExact<PyDict>> to enforce exact type at compile time. Use PyRangeIterator::next_fast() instead of pub(crate) fields. Fix comment style issues.
29c0cf2 to
4c24868
Compare
March 1, 2026 16:15
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/vm/src/frame.rs (1)
395-407: ⚠️ Potential issue | 🔴 CriticalAvoid unsynchronized
fastlocalsaccess whentry_lock()fails.Line 399 allows
Nonefromtry_lock(), but Line 407 still performsunsafe { self.fastlocals.borrow() }. If another thread owns the mutex, this is a data race/UB.🐛 Proposed safety fix
pub fn locals(&self, vm: &VirtualMachine) -> PyResult<ArgMapping> { - // Acquire the state mutex to synchronize with frame execution. - // If try_lock fails, the frame is executing on this thread (e.g. - // trace callback accessing f_locals), so fastlocals access is safe. - let _guard = self.state.try_lock(); + // Acquire the state mutex to synchronize with frame execution. + // If unavailable, avoid unsynchronized fastlocals reads. + let _guard = match self.state.try_lock() { + Some(guard) => guard, + None => return Ok(self.locals.clone()), + }; @@ - // SAFETY: Either _guard holds the state mutex (frame not executing), - // or we're in a trace callback on the same thread that holds it. + // SAFETY: `_guard` holds the state mutex, so fastlocals access is exclusive. let fastlocals = unsafe { self.fastlocals.borrow() };Use this to verify there is no explicit same-thread ownership check before the unsafe borrow:
#!/bin/bash set -euo pipefail echo "1) locals() lock/unsafe flow" rg -n "pub fn locals\\(|try_lock\\(|fastlocals\\.borrow\\(" crates/vm/src/frame.rs -C4 echo echo "2) Any thread-identity guard in frame path" rg -n "ThreadId|current_thread|owner\\.load|owner" crates/vm/src/frame.rs -C3 echo echo "3) Mutex backend and try_lock definitions/usages" fd lock.rs | xargs rg -n "PyMutex|try_lock|parking_lot|Mutex" -C2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/frame.rs` around lines 395 - 407, The locals() method currently calls self.state.try_lock() and ignores a failed lock (None), but always proceeds to unsafely access self.fastlocals.borrow(), which can race if another thread owns the mutex; modify locals() so that when try_lock() returns None you either (a) acquire the lock with self.state.lock() (or spin until safe) before doing the unsafe borrow, or (b) check a same-thread/owner marker on the frame (e.g. an owner ThreadId) and only perform unsafe { self.fastlocals.borrow() } when the current thread is the owner; update references to _guard, try_lock(), state, locals(), and fastlocals.borrow() accordingly to ensure the unsynchronized path cannot perform the unsafe borrow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: The locals() method currently calls self.state.try_lock()
and ignores a failed lock (None), but always proceeds to unsafely access
self.fastlocals.borrow(), which can race if another thread owns the mutex;
modify locals() so that when try_lock() returns None you either (a) acquire the
lock with self.state.lock() (or spin until safe) before doing the unsafe borrow,
or (b) check a same-thread/owner marker on the frame (e.g. an owner ThreadId)
and only perform unsafe { self.fastlocals.borrow() } when the current thread is
the owner; update references to _guard, try_lock(), state, locals(), and
fastlocals.borrow() accordingly to ensure the unsynchronized path cannot perform
the unsafe borrow.
ℹ️ Review info
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/vm/src/builtins/dict.rscrates/vm/src/builtins/range.rscrates/vm/src/frame.rscrates/vm/src/protocol/object.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/vm/src/builtins/range.rs
- crates/vm/src/builtins/dict.rs
Sorry, something went wrong.
3a81f94
into
RustPython:main
Mar 2, 2026
* Remove intermediate Vec allocation in unpack_sequence fast path Push elements directly from tuple/list slice in reverse order instead of cloning into a temporary Vec first. * Use read-only atomic load before swap in check_signals Add Relaxed load guard before the Acquire swap to avoid cache-line invalidation on every instruction dispatch when no signal is pending. * Cache builtins downcast in ExecutingFrame for LOAD_GLOBAL Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse the cached reference in load_global_or_builtin and LoadBuildClass. Also add get_chain_exact to skip redundant exact_dict type checks. * Add number Add slot to PyStr for direct str+str dispatch binary_op1 can now resolve str+str addition directly via the number slot instead of falling through to the sequence concat path. * Guard FastLocals access in locals() with try_lock on state mutex Address CodeRabbit review: f_locals() could access fastlocals without synchronization when called from another thread. Use try_lock on the state mutex so concurrent access is properly serialized. * Use exact type check for builtins_dict cache downcast_ref::<PyDict>() matches dict subclasses, causing get_chain_exact to bypass custom __getitem__ overrides. Use downcast_ref_if_exact to only fast-path exact dict types. * Consolidate with_recursion in _cmp to single guard Move the recursion depth check to wrap the entire _cmp body instead of each individual call_cmp direction, reducing Cell read/write pairs and scopeguard overhead per comparison. * Add opcode-level fast paths for FOR_ITER, COMPARE_OP, BINARY_OP - FOR_ITER: detect PyRangeIterator and bypass generic iterator protocol (atomic slot load + indirect call) - COMPARE_OP: inline int/float comparison for exact types, skip rich_compare dispatch and with_recursion overhead - BINARY_OP: inline int add/sub with i64 checked arithmetic to avoid BigInt heap allocation and binary_op1 dispatch * Also check globals is exact dict for LOAD_GLOBAL fast path get_chain_exact bypasses __missing__ on dict subclasses. Move get_chain_exact to PyExact<PyDict> impl with debug_assert, and have get_chain delegate to it. Store builtins_dict as Option<&PyExact<PyDict>> to enforce exact type at compile time. Use PyRangeIterator::next_fast() instead of pub(crate) fields. Fix comment style issues.
Summary by CodeRabbit
New Features
Performance
Stability