◐ Shell
reader mode source ↗
Skip to content

Optimize unpack, str.__add__ and fastlocals#7293

Merged
youknowone merged 9 commits into
RustPython:mainfrom
youknowone:worktree-perf-step1
Mar 2, 2026
Merged

Optimize unpack, str.__add__ and fastlocals#7293
youknowone merged 9 commits into
RustPython:mainfrom
youknowone:worktree-perf-step1

Conversation

@youknowone

@youknowone youknowone commented Mar 1, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

  • New Features

    • String concatenation via + supported for direct string operands
  • Performance

    • Faster integer arithmetic and range iteration via new fast paths
    • Improved comparison performance for common numeric types
    • Faster global and builtin lookups via exact-dict fast paths and caching
    • Optimized sequence unpacking and loop iteration dispatch
  • Stability

    • Reduced overhead in signal checking with a lightweight fast path

@coderabbitai

coderabbitai Bot commented Mar 1, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds exact-dict chaining, string + concatenation, a fast range-iterator step, extensive frame execution fast-paths (builtins cache, int ops, unpacking, FOR_ITER), consolidated comparison recursion guarding, and a read-only fast path for signal checking.

Changes

Cohort / File(s) Summary
Dict / Exact path
crates/vm/src/builtins/dict.rs
Added PyExact<PyDict> import and get_chain_exact to enable exact-dict key chaining; PyDict::get_chain delegates to the exact-path when both dicts are exact.
Range iterator
crates/vm/src/builtins/range.rs
Added PyRangeIterator::next_fast for a protocol-free fast next-value computation.
String number methods
crates/vm/src/builtins/str.rs
Added add handler to PyStr's PyNumberMethods to implement + for two PyStr operands; adjusted remainder to return NotImplemented when left is not PyStr.
Frame execution & fast-paths
crates/vm/src/frame.rs
Changed with_exec to accept &VirtualMachine; added ExecutingFrame.builtins_dict cache; added int_add/int_sub, unpack_fast, for_iter_jump_target; optimized LOAD_GLOBAL/LoadBuildClass and updated call sites (run/resume/gen_throw/yield_from_target).
Comparison recursion guard
crates/vm/src/protocol/object.rs
Consolidated recursion guard by adding _cmp_inner and wrapping _cmp with a single vm.with_recursion, removing inner per-call recursion wrappers.
Signal check fast-path
crates/vm/src/signal.rs
Early read-only check of ANY_TRIGGERED added to avoid atomic swap/cache-line invalidation when no signals are pending.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Bytecode as Bytecode/VM Dispatcher
participant Frame as ExecutingFrame
participant VM as VirtualMachine
participant Globals as Globals Dict
participant Builtins as Builtins (PyExact)

Bytecode->>Frame: LOAD_GLOBAL(name)
Frame->>VM: with_exec(vm)
Frame->>Globals: check globals dict (exact?)
alt Globals is exact and has name
    Globals-->>Frame: return value
else Globals not exact or missing
    Frame->>Builtins: consult builtins_dict cache
    alt builtins cache hit
        Builtins-->>Frame: return value
    else
        Frame->>Builtins: fallback full dict lookup (may lock)
        Builtins-->>Frame: return value or None
    end
end
Frame-->>Bytecode: push result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh

Poem

🐇 I nibble chains where dicts are exact,
I stitch two strings with a hop and tact,
Frames tuck builtins into a snug little tract,
Ints race fast, iterators skip a beat,
Signals glance quick — then quiet and neat.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: optimizations to unpacking, string concatenation via add, and fast local variable access.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Mar 1, 2026

Copy link
Copy Markdown
Contributor

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin worktree-perf-step1

@youknowone youknowone force-pushed the worktree-perf-step1 branch 3 times, most recently from b7d0f60 to b537519 Compare March 1, 2026 15:42
Push elements directly from tuple/list slice in reverse order
instead of cloning into a temporary Vec first.
Add Relaxed load guard before the Acquire swap to avoid cache-line
invalidation on every instruction dispatch when no signal is pending.
Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse
the cached reference in load_global_or_builtin and LoadBuildClass.
Also add get_chain_exact to skip redundant exact_dict type checks.
binary_op1 can now resolve str+str addition directly via the number
slot instead of falling through to the sequence concat path.
Address CodeRabbit review: f_locals() could access fastlocals without
synchronization when called from another thread. Use try_lock on the
state mutex so concurrent access is properly serialized.
downcast_ref::<PyDict>() matches dict subclasses, causing
get_chain_exact to bypass custom __getitem__ overrides.
Use downcast_ref_if_exact to only fast-path exact dict types.
Move the recursion depth check to wrap the entire _cmp body
instead of each individual call_cmp direction, reducing Cell
read/write pairs and scopeguard overhead per comparison.
- FOR_ITER: detect PyRangeIterator and bypass generic iterator
  protocol (atomic slot load + indirect call)
- COMPARE_OP: inline int/float comparison for exact types,
  skip rich_compare dispatch and with_recursion overhead
- BINARY_OP: inline int add/sub with i64 checked arithmetic
  to avoid BigInt heap allocation and binary_op1 dispatch
@youknowone youknowone force-pushed the worktree-perf-step1 branch from b537519 to abaad3c Compare March 1, 2026 15:46
@youknowone youknowone marked this pull request as ready for review March 1, 2026 15:46

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 4097-4105: The fast-path for exact PyFloat in frame.rs wrongly
treats NaN as equal by using partial_cmp(...).is_some_and(...); change this so
you only use cmp_op.eval_ord(ord) when partial_cmp returns Some(ord), and if
partial_cmp returns None (NaN case) fall back to the existing rich-compare
handling instead of pushing a boolean. Concretely, inside the block that checks
a.downcast_ref_if_exact::<PyFloat>(vm) and
b.downcast_ref_if_exact::<PyFloat>(vm), test partial_cmp(...); if Some(ord) push
ctx.new_bool(cmp_op.eval_ord(ord)), otherwise call or jump to the rich_compare
path (the same logic used for non-fast-path comparisons) so that operations like
!= obey Python's NaN semantics.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b3a5b and 125ceca.

📒 Files selected for processing (6)
  • crates/vm/src/builtins/dict.rs
  • crates/vm/src/builtins/range.rs
  • crates/vm/src/builtins/str.rs
  • crates/vm/src/frame.rs
  • crates/vm/src/protocol/object.rs
  • crates/vm/src/signal.rs

@youknowone youknowone force-pushed the worktree-perf-step1 branch from 125ceca to 29c0cf2 Compare March 1, 2026 16:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: In locals(), don't assume try_lock() succeeded: check the
result of self.state.try_lock() and only perform unsafe {
self.fastlocals.borrow() } when the guard is Some; if try_lock() returns None
return early with a PyResult error (e.g. a runtime error via the VM) or an
appropriate fallback instead of unsafely reading fastlocals while another thread
may be executing the frame; update the locals() function to inspect the try_lock
outcome before the unsafe borrow to prevent data races.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 125ceca and 29c0cf2.

📒 Files selected for processing (4)
  • crates/vm/src/builtins/dict.rs
  • crates/vm/src/builtins/range.rs
  • crates/vm/src/frame.rs
  • crates/vm/src/protocol/object.rs

get_chain_exact bypasses __missing__ on dict subclasses.
Move get_chain_exact to PyExact<PyDict> impl with debug_assert,
and have get_chain delegate to it. Store builtins_dict as
Option<&PyExact<PyDict>> to enforce exact type at compile time.

Use PyRangeIterator::next_fast() instead of pub(crate) fields.
Fix comment style issues.
@youknowone youknowone force-pushed the worktree-perf-step1 branch from 29c0cf2 to 4c24868 Compare March 1, 2026 16:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment
♻️ Duplicate comments (1)
crates/vm/src/frame.rs (1)

395-407: ⚠️ Potential issue | 🔴 Critical

Avoid unsynchronized fastlocals access when try_lock() fails.

Line 399 allows None from try_lock(), but Line 407 still performs unsafe { self.fastlocals.borrow() }. If another thread owns the mutex, this is a data race/UB.

🐛 Proposed safety fix
 pub fn locals(&self, vm: &VirtualMachine) -> PyResult<ArgMapping> {
-    // Acquire the state mutex to synchronize with frame execution.
-    // If try_lock fails, the frame is executing on this thread (e.g.
-    // trace callback accessing f_locals), so fastlocals access is safe.
-    let _guard = self.state.try_lock();
+    // Acquire the state mutex to synchronize with frame execution.
+    // If unavailable, avoid unsynchronized fastlocals reads.
+    let _guard = match self.state.try_lock() {
+        Some(guard) => guard,
+        None => return Ok(self.locals.clone()),
+    };
@@
-        // SAFETY: Either _guard holds the state mutex (frame not executing),
-        // or we're in a trace callback on the same thread that holds it.
+        // SAFETY: `_guard` holds the state mutex, so fastlocals access is exclusive.
         let fastlocals = unsafe { self.fastlocals.borrow() };

Use this to verify there is no explicit same-thread ownership check before the unsafe borrow:

#!/bin/bash
set -euo pipefail

echo "1) locals() lock/unsafe flow"
rg -n "pub fn locals\\(|try_lock\\(|fastlocals\\.borrow\\(" crates/vm/src/frame.rs -C4

echo
echo "2) Any thread-identity guard in frame path"
rg -n "ThreadId|current_thread|owner\\.load|owner" crates/vm/src/frame.rs -C3

echo
echo "3) Mutex backend and try_lock definitions/usages"
fd lock.rs | xargs rg -n "PyMutex|try_lock|parking_lot|Mutex" -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/frame.rs` around lines 395 - 407, The locals() method currently
calls self.state.try_lock() and ignores a failed lock (None), but always
proceeds to unsafely access self.fastlocals.borrow(), which can race if another
thread owns the mutex; modify locals() so that when try_lock() returns None you
either (a) acquire the lock with self.state.lock() (or spin until safe) before
doing the unsafe borrow, or (b) check a same-thread/owner marker on the frame
(e.g. an owner ThreadId) and only perform unsafe { self.fastlocals.borrow() }
when the current thread is the owner; update references to _guard, try_lock(),
state, locals(), and fastlocals.borrow() accordingly to ensure the
unsynchronized path cannot perform the unsafe borrow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/vm/src/frame.rs`:
- Around line 395-407: The locals() method currently calls self.state.try_lock()
and ignores a failed lock (None), but always proceeds to unsafely access
self.fastlocals.borrow(), which can race if another thread owns the mutex;
modify locals() so that when try_lock() returns None you either (a) acquire the
lock with self.state.lock() (or spin until safe) before doing the unsafe borrow,
or (b) check a same-thread/owner marker on the frame (e.g. an owner ThreadId)
and only perform unsafe { self.fastlocals.borrow() } when the current thread is
the owner; update references to _guard, try_lock(), state, locals(), and
fastlocals.borrow() accordingly to ensure the unsynchronized path cannot perform
the unsafe borrow.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29c0cf2 and 4c24868.

📒 Files selected for processing (4)
  • crates/vm/src/builtins/dict.rs
  • crates/vm/src/builtins/range.rs
  • crates/vm/src/frame.rs
  • crates/vm/src/protocol/object.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/vm/src/builtins/range.rs
  • crates/vm/src/builtins/dict.rs

Hide details View details @youknowone youknowone merged commit 3a81f94 into RustPython:main Mar 2, 2026
23 of 24 checks passed
@youknowone youknowone deleted the worktree-perf-step1 branch March 2, 2026 03:37
youknowone added a commit to youknowone/RustPython that referenced this pull request Mar 22, 2026
* Remove intermediate Vec allocation in unpack_sequence fast path

Push elements directly from tuple/list slice in reverse order
instead of cloning into a temporary Vec first.

* Use read-only atomic load before swap in check_signals

Add Relaxed load guard before the Acquire swap to avoid cache-line
invalidation on every instruction dispatch when no signal is pending.

* Cache builtins downcast in ExecutingFrame for LOAD_GLOBAL

Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse
the cached reference in load_global_or_builtin and LoadBuildClass.
Also add get_chain_exact to skip redundant exact_dict type checks.

* Add number Add slot to PyStr for direct str+str dispatch

binary_op1 can now resolve str+str addition directly via the number
slot instead of falling through to the sequence concat path.

* Guard FastLocals access in locals() with try_lock on state mutex

Address CodeRabbit review: f_locals() could access fastlocals without
synchronization when called from another thread. Use try_lock on the
state mutex so concurrent access is properly serialized.

* Use exact type check for builtins_dict cache

downcast_ref::<PyDict>() matches dict subclasses, causing
get_chain_exact to bypass custom __getitem__ overrides.
Use downcast_ref_if_exact to only fast-path exact dict types.

* Consolidate with_recursion in _cmp to single guard

Move the recursion depth check to wrap the entire _cmp body
instead of each individual call_cmp direction, reducing Cell
read/write pairs and scopeguard overhead per comparison.

* Add opcode-level fast paths for FOR_ITER, COMPARE_OP, BINARY_OP

- FOR_ITER: detect PyRangeIterator and bypass generic iterator
  protocol (atomic slot load + indirect call)
- COMPARE_OP: inline int/float comparison for exact types,
  skip rich_compare dispatch and with_recursion overhead
- BINARY_OP: inline int add/sub with i64 checked arithmetic
  to avoid BigInt heap allocation and binary_op1 dispatch

* Also check globals is exact dict for LOAD_GLOBAL fast path

get_chain_exact bypasses __missing__ on dict subclasses.
Move get_chain_exact to PyExact<PyDict> impl with debug_assert,
and have get_chain delegate to it. Store builtins_dict as
Option<&PyExact<PyDict>> to enforce exact type at compile time.

Use PyRangeIterator::next_fast() instead of pub(crate) fields.
Fix comment style issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant