◐ Shell
clean mode source ↗

Remove Frame mutex and use DataStack bump allocator for LocalsPlus by youknowone · Pull Request #7333 · RustPython/RustPython

@youknowone changed the title Worktree frame datastack frame datastack

Mar 3, 2026

coderabbitai[bot]

coderabbitai[bot]

coderabbitai[bot]

coderabbitai[bot]

coderabbitai[bot]

youknowone

@youknowone youknowone changed the title frame datastack Remove Frame mutex and use DataStack bump allocator for LocalsPlus

Mar 4, 2026
Move stack, cells_frees, prev_line out of the mutex-protected FrameState
into Frame as FrameUnsafeCell fields. This eliminates mutex lock/unlock
overhead on every frame execution (with_exec).

Safety relies on the same single-threaded execution guarantee that
FastLocals already uses.
Introduce DataStack with linked chunks (16KB initial, doubling) and
push/pop bump allocation. Add datastack field to VirtualMachine.
Not yet wired to frame creation.
Replace separate FastLocals (Box<[Option<PyObjectRef>]>) and
BoxVec<Option<PyStackRef>> with a single LocalsPlus struct that
stores both in a contiguous Box<[usize]> array. The first
nlocalsplus slots are fastlocals and the rest is the evaluation
stack. Typed access is provided through transmute-based methods.

Remove BoxVec import from frame.rs.
Normal function calls now bump-allocate LocalsPlus data from the
per-thread DataStack instead of a separate heap allocation.
Generator/coroutine frames continue using heap allocation since
they outlive the call.

On frame exit, data is copied to the heap (materialize_to_heap)
to preserve locals for tracebacks, then the DataStack is popped.

VirtualMachine.datastack is wrapped in UnsafeCell for interior
mutability (safe because frame allocation is single-threaded LIFO).
Update vectorcall dispatch functions to use localsplus stack
accessors instead of direct stack field access. Add
stack_truncate method to LocalsPlus. Update vectorcall fast
path in function.rs to use datastack and fastlocals_mut().
Check both bounds of the current chunk when determining if a
pop base is in the current chunk. The previous check (base >=
chunk_start) fails on Windows where newer chunks may be
allocated at lower addresses than older ones.
Two fixes for Cell-based types used in static items under non-threading
mode, which cause data races when Rust test runner uses parallel threads:

1. LazyLock: use std::sync::LazyLock when std is available instead of
   wrapping core::cell::LazyCell with a false `unsafe impl Sync`.
   The LazyCell wrapper is kept only for no-std (truly single-threaded).

2. gc_state: use static_cell! (thread-local in non-threading mode)
   instead of OnceLock, so each thread gets its own GcState with
   Cell-based PyRwLock/PyMutex that are not accessed concurrently.
… gate

- Use checked arithmetic for nlocalsplus in Frame::new
- Add "std" to threading feature dependencies in rustpython-common
- Gate GcState Send impl with #[cfg(feature = "threading")]

@youknowone

youknowone added a commit to youknowone/RustPython that referenced this pull request

Mar 22, 2026
…ustPython#7333)

* Remove PyMutex<FrameState> from Frame, use UnsafeCell fields directly

Move stack, cells_frees, prev_line out of the mutex-protected FrameState
into Frame as FrameUnsafeCell fields. This eliminates mutex lock/unlock
overhead on every frame execution (with_exec).

Safety relies on the same single-threaded execution guarantee that
FastLocals already uses.

* Add thread-local DataStack for bump-allocating frame data

Introduce DataStack with linked chunks (16KB initial, doubling) and
push/pop bump allocation. Add datastack field to VirtualMachine.
Not yet wired to frame creation.

* Unify FastLocals and BoxVec stack into LocalsPlus

Replace separate FastLocals (Box<[Option<PyObjectRef>]>) and
BoxVec<Option<PyStackRef>> with a single LocalsPlus struct that
stores both in a contiguous Box<[usize]> array. The first
nlocalsplus slots are fastlocals and the rest is the evaluation
stack. Typed access is provided through transmute-based methods.

Remove BoxVec import from frame.rs.

* Use DataStack for LocalsPlus in non-generator function calls

Normal function calls now bump-allocate LocalsPlus data from the
per-thread DataStack instead of a separate heap allocation.
Generator/coroutine frames continue using heap allocation since
they outlive the call.

On frame exit, data is copied to the heap (materialize_to_heap)
to preserve locals for tracebacks, then the DataStack is popped.

VirtualMachine.datastack is wrapped in UnsafeCell for interior
mutability (safe because frame allocation is single-threaded LIFO).

* Fix clippy: import Layout from core::alloc instead of alloc::alloc

* Fix vectorcall compatibility with LocalsPlus API

Update vectorcall dispatch functions to use localsplus stack
accessors instead of direct stack field access. Add
stack_truncate method to LocalsPlus. Update vectorcall fast
path in function.rs to use datastack and fastlocals_mut().

* Add datastack, nlocalsplus, ncells, tstate to cspell dictionary

* Fix DataStack pop() for non-monotonic allocation addresses

Check both bounds of the current chunk when determining if a
pop base is in the current chunk. The previous check (base >=
chunk_start) fails on Windows where newer chunks may be
allocated at lower addresses than older ones.

* Fix stale comments: release_datastack -> materialize_localsplus

* Fix non-threading mode for parallel test execution

Two fixes for Cell-based types used in static items under non-threading
mode, which cause data races when Rust test runner uses parallel threads:

1. LazyLock: use std::sync::LazyLock when std is available instead of
   wrapping core::cell::LazyCell with a false `unsafe impl Sync`.
   The LazyCell wrapper is kept only for no-std (truly single-threaded).

2. gc_state: use static_cell! (thread-local in non-threading mode)
   instead of OnceLock, so each thread gets its own GcState with
   Cell-based PyRwLock/PyMutex that are not accessed concurrently.

* Fix CallAllocAndEnterInit to use LocalsPlus stack API

* Use checked arithmetic in LocalsPlus and DataStack allocators

* Address code review: checked arithmetic, threading feature deps, Send gate

- Use checked arithmetic for nlocalsplus in Frame::new
- Add "std" to threading feature dependencies in rustpython-common
- Gate GcState Send impl with #[cfg(feature = "threading")]

* Clean up comments: remove redundant/stale remarks, fix CPython references