Optimize unpack, str.add and fastlocals by youknowone · Pull Request #7293 · RustPython/RustPython

youknowone

Push elements directly from tuple/list slice in reverse order
instead of cloning into a temporary Vec first.

Add Relaxed load guard before the Acquire swap to avoid cache-line
invalidation on every instruction dispatch when no signal is pending.

Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse
the cached reference in load_global_or_builtin and LoadBuildClass.
Also add get_chain_exact to skip redundant exact_dict type checks.

binary_op1 can now resolve str+str addition directly via the number
slot instead of falling through to the sequence concat path.

Address CodeRabbit review: f_locals() could access fastlocals without
synchronization when called from another thread. Use try_lock on the
state mutex so concurrent access is properly serialized.

downcast_ref::<PyDict>() matches dict subclasses, causing
get_chain_exact to bypass custom __getitem__ overrides.
Use downcast_ref_if_exact to only fast-path exact dict types.

Move the recursion depth check to wrap the entire _cmp body
instead of each individual call_cmp direction, reducing Cell
read/write pairs and scopeguard overhead per comparison.

- FOR_ITER: detect PyRangeIterator and bypass generic iterator
  protocol (atomic slot load + indirect call)
- COMPARE_OP: inline int/float comparison for exact types,
  skip rich_compare dispatch and with_recursion overhead
- BINARY_OP: inline int add/sub with i64 checked arithmetic
  to avoid BigInt heap allocation and binary_op1 dispatch

get_chain_exact bypasses __missing__ on dict subclasses.
Move get_chain_exact to PyExact<PyDict> impl with debug_assert,
and have get_chain delegate to it. Store builtins_dict as
Option<&PyExact<PyDict>> to enforce exact type at compile time.

Use PyRangeIterator::next_fast() instead of pub(crate) fields.
Fix comment style issues.

This was referenced

Mar 2, 2026

youknowone added a commit to youknowone/RustPython that referenced this pull request

Mar 22, 2026

* Remove intermediate Vec allocation in unpack_sequence fast path

Push elements directly from tuple/list slice in reverse order
instead of cloning into a temporary Vec first.

* Use read-only atomic load before swap in check_signals

Add Relaxed load guard before the Acquire swap to avoid cache-line
invalidation on every instruction dispatch when no signal is pending.

* Cache builtins downcast in ExecutingFrame for LOAD_GLOBAL

Pre-compute builtins.downcast_ref::<PyDict>() at frame entry and reuse
the cached reference in load_global_or_builtin and LoadBuildClass.
Also add get_chain_exact to skip redundant exact_dict type checks.

* Add number Add slot to PyStr for direct str+str dispatch

binary_op1 can now resolve str+str addition directly via the number
slot instead of falling through to the sequence concat path.

* Guard FastLocals access in locals() with try_lock on state mutex

Address CodeRabbit review: f_locals() could access fastlocals without
synchronization when called from another thread. Use try_lock on the
state mutex so concurrent access is properly serialized.

* Use exact type check for builtins_dict cache

downcast_ref::<PyDict>() matches dict subclasses, causing
get_chain_exact to bypass custom __getitem__ overrides.
Use downcast_ref_if_exact to only fast-path exact dict types.

* Consolidate with_recursion in _cmp to single guard

Move the recursion depth check to wrap the entire _cmp body
instead of each individual call_cmp direction, reducing Cell
read/write pairs and scopeguard overhead per comparison.

* Add opcode-level fast paths for FOR_ITER, COMPARE_OP, BINARY_OP

- FOR_ITER: detect PyRangeIterator and bypass generic iterator
  protocol (atomic slot load + indirect call)
- COMPARE_OP: inline int/float comparison for exact types,
  skip rich_compare dispatch and with_recursion overhead
- BINARY_OP: inline int add/sub with i64 checked arithmetic
  to avoid BigInt heap allocation and binary_op1 dispatch

* Also check globals is exact dict for LOAD_GLOBAL fast path

get_chain_exact bypasses __missing__ on dict subclasses.
Move get_chain_exact to PyExact<PyDict> impl with debug_assert,
and have get_chain delegate to it. Store builtins_dict as
Option<&PyExact<PyDict>> to enforce exact type at compile time.

Use PyRangeIterator::next_fast() instead of pub(crate) fields.
Fix comment style issues.

Optimize unpack, str.__add__ and fastlocals by youknowone · Pull Request #7293 · RustPython/RustPython

Optimize unpack, str.add and fastlocals by youknowone · Pull Request #7293 · RustPython/RustPython