Specialized ops#7322

youknowone

Summary by CodeRabbit

Performance & Stability
- Faster instruction execution via expanded adaptive specialization and improved inline caches.
- Better thread-safety and stronger memory ordering for reliable concurrent execution and cache visibility.
- Cache invalidation timing adjusted to reduce stale specializations.
Refactor
- Centralized specialization/deoptimization lifecycle for consistent runtime behavior.
- JIT cache moved to mutex-guarded storage for safer concurrent updates.
Behavior
- Slot lookup now distinguishes native vs Python-level results for more accurate dispatch.

coderabbitai

📝 Walkthrough

Walkthrough

This PR makes bytecode slots atomic for concurrent specialization, centralizes adaptive inline-caching and deoptimization flows, moves JIT cache storage to a mutex-protected optional, strengthens several atomic ordering choices across types/dicts, and distinguishes Python-level methods from native slots in slot lookup/update logic.

Changes

Cohort / File(s)	Summary
Atomic Bytecode Operations `crates/compiler-core/src/bytecode.rs`	Replaced non-atomic in-place opcode/arg/cache/adaptive-counter mutability with per-slot atomics (AtomicU8/AtomicU16). Added `compare_exchange_op`, `read_op`, `read_arg`, `read_cache_u16`/`write_cache_u16`, and adaptive counter read/write methods; updated serialization/quicken to use atomic reads; `unsafe impl Sync for CodeUnits`.
Adaptive Inline Caching & Deoptimization `crates/vm/src/frame.rs`	Large refactor centralizing specialization lifecycle: added `adaptive()`, `commit_specialization()`, `deoptimize()`/`deoptimize_at()`, unified adaptive counters and commit paths, many specialized handlers (binary int/float, LoadAttr variants, Call optimizations, ForIter variants, ToBool, CompareOp, StoreSubscr, ContainsOp, etc.), and consolidated guards/version checks.
JIT Cache Storage `crates/vm/src/builtins/function.rs`	Replaced `OnceCell<CompiledCode>` with `PyMutex<Option<CompiledCode>>` for `jitted_code`. Reworked all accesses to lock-based reads/writes and added cache invalidation on `__code__` setter and in `__jit__`.
Slot Lookup Result Distinction `crates/vm/src/types/slot.rs`	Introduced `SlotLookupResult<T>` (`NativeSlot`, `PythonMethod`, `NotFound`) and updated `lookup_slot_in_mro()` and all slot update sites/macros to branch on these outcomes, ensuring Python-level methods yield wrappers and MRO inheritance when not found.
Atomic Ordering Adjustments `crates/vm/src/dict_inner.rs`, `crates/vm/src/builtins/type.rs`	Strengthened atomic orderings: dict version loads use `Acquire` and bumps use `Release`; `PyType::modified()` now uses `SeqCst`. `SetAttr` now invalidates type version before mutation.
Stack Analysis Minor Change `crates/vm/src/builtins/frame.rs`	Stack analysis now calls `deoptimize()` on the resolved base opcode (after `to_base()`), ensuring de-specialization is applied during stack effect analysis.

Sequence Diagram(s)

sequenceDiagram
    participant Frame as ExecutingFrame
    participant CodeUnits as Code Storage
    participant Adaptive as Adaptive Engine
    participant Specialized as Specialized Handler
    participant Base as Base Handler

    Frame->>CodeUnits: read_op(index) (Acquire)
    CodeUnits-->>Frame: opcode
    Frame->>Adaptive: check adaptive counter & candidate (Relaxed)
    Adaptive->>Adaptive: increment counter (Relaxed)
    Adaptive->>Specialized: attempt specialization
    alt Success
        Specialized->>CodeUnits: compare_exchange_op(index, expected, new) (Release)
        CodeUnits-->>Specialized: CAS success
        Specialized-->>Frame: execute specialized path
    else Fail
        Specialized->>CodeUnits: write base opcode (Release)
        CodeUnits-->>Base: base opcode written
        Specialized-->>Frame: fall back to base handler
    end

sequenceDiagram
    participant Setter as PyType SetAttr
    participant Version as Version Tag
    participant Readers as Concurrent Readers

    Setter->>Version: modified() (SeqCst store)
    Version->>Readers: readers load version (Acquire)
    Setter->>Setter: perform attribute insert/remove
    Setter-->>Readers: changes ordered via version operations

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Instrumented instruction #7212: Overlaps with CodeUnits opcode mutation API changes (in-place replace_op → atomic per-slot ops).
Add newtype of CodeUnits #6241: Related CodeUnits/newtype and parsing/construction APIs—touches same compiler-core bytecode representation.
Implement LOAD_ATTR inline caching with adaptive specialization #7292: Related adaptive-specialization and deoptimization work touching both bytecode instrumentation and VM adaptive/frame logic.

Suggested reviewers

ShaharNaveh

🐰 Hops excitedly
Atoms guard the opcode lane,
Slots now tell if Python reign—
Caches nudge their version flag,
Specializers leap, then sag.
Hooray—deopt, commit, and play!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Specialized ops' is vague and generic, using a non-descriptive term that fails to convey meaningful information about the specific changes in this substantial multi-file refactoring.	Consider a more descriptive title that captures the main objectives, such as 'Implement adaptive specialization framework for bytecode operations' or 'Refactor to atomic-based synchronization with specialized opcode paths'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/doctest.py
[ ] test: cpython/Lib/test/test_doctest (TODO: 6)

dependencies:

doctest

dependent tests: (33 tests)

doctest: test_builtin test_cmd test_code test_collections test_ctypes test_decimal test_deque test_descrtut test_difflib test_doctest test_doctest2 test_enum test_extcall test_generators test_getopt test_heapq test_http_cookies test_itertools test_listcomps test_math test_metaclass test_pep646_syntax test_pickle test_pickletools test_setcomps test_statistics test_syntax test_threading_local test_typing test_unpack test_unpack_ex test_weakref test_zipimport

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/vm/src/builtins/type.rs (1)

2171-2196: ⚠️ Potential issue | 🔴 Critical

Post-mutation cache invalidation is missing after pre-invalidation.

At Line 2174, zelf.modified() runs before the attribute write. find_name_in_mro() can reassign tp_version_tag and repopulate cache entries from pre-mutation state in that window. Since this path has no second invalidation after commit, stale cache entries can survive.

💡 Proposed fix (keep pre-invalidate, add post-commit invalidate)

         // Invalidate inline caches before modifying attributes.
         // This ensures other threads see the version invalidation before
         // any attribute changes, preventing use-after-free of cached descriptors.
         zelf.modified();

         if let PySetterValue::Assign(value) = value {
             zelf.attributes.write().insert(attr_name, value);
         } else {
             let prev_value = zelf.attributes.write().shift_remove(attr_name); // TODO: swap_remove applicable?
             if prev_value.is_none() {
                 return Err(vm.new_attribute_error(format!(
                     "type object '{}' has no attribute '{}'",
                     zelf.name(),
                     attr_name,
                 )));
             }
         }

         if attr_name.as_wtf8().starts_with("__") && attr_name.as_wtf8().ends_with("__") {
             if assign {
                 zelf.update_slot::<true>(attr_name, &vm.ctx);
             } else {
                 zelf.update_slot::<false>(attr_name, &vm.ctx);
             }
         }
+        // Flush any entries repopulated between pre-invalidation and commit.
+        zelf.modified();
         Ok(())

#!/bin/bash
set -euo pipefail

# Verify invalidation placement in setattro
rg -n -C8 'fn setattro|modified\(\)|attributes\.write\(\)\.(insert|shift_remove)' crates/vm/src/builtins/type.rs

# Verify cache/version reassignment path that can run concurrently
rg -n -C8 'fn find_name_in_mro|assign_version_tag|tp_version_tag\.load|tp_version_tag\.store' crates/vm/src/builtins/type.rs

Expected result: modified() appears before mutation in setattro, and no post-mutation invalidation currently exists.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/builtins/type.rs` around lines 2171 - 2196, The pre-mutation
cache invalidation (zelf.modified()) in setattro is currently invoked before the
attribute write but there is no post-commit invalidation, allowing
find_name_in_mro to repopulate caches from the old state; fix this by keeping
the existing pre-invalidation and adding a second invalidation immediately after
the mutation commit (after attributes.write().insert or shift_remove completes
and before calling update_slot::<...>), i.e., invoke zelf.modified() again to
ensure tp_version_tag/cache entries are refreshed and prevent stale cache reuse.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/compiler-core/src/bytecode.rs`:
- Around line 509-510: The code casts CodeUnit pointers to *const AtomicU16 (see
usages like units.as_ptr().wrapping_add(index) as *const AtomicU16 and the
subsequent .store calls), but CodeUnit is currently #[repr(C)] with two u8
fields and only 1-byte alignment, which violates atomic alignment requirements;
fix by changing the struct declaration to #[repr(C, align(2))] for CodeUnit and
add a compile-time assertion to ensure alignment (e.g.
assert_eq!(std::mem::align_of::<CodeUnit>(), 2) or a const_assert) so the
AtomicU16 casts are safe.
- Around line 485-499: The public safe readers (read_op, read_arg,
read_adaptive_counter) currently compute raw pointers with wrapping_add(index)
and dereference without bounds checks, which can cause UB on out-of-range
indices; update each function to load units first (as in read_cache_u16) and
assert the index (and any additional byte offsets, e.g., index+1 for arg) is
within units.len() before computing pointers and dereferencing so out-of-bounds
inputs panic instead of producing UB, keeping the existing pointer/dereference
logic otherwise.

In `@crates/vm/src/frame.rs`:
- Around line 4717-4723: The specialized LoadGlobalBuiltin path incorrectly
accepts dict subclasses via downcast_ref::<PyDict>(), which can bypass subclass
mapping; change the guard to require an exact PyDict type before doing the
direct dict lookup: first check that self.builtins is exactly the dict type
(e.g. via an exact-type check against vm.ctx.types.dict_type or a
PyDict::is_exact method), only then downcast_ref::<PyDict>() and use
cached_builtins_ver/current_builtins_ver and the fast get_item_opt path;
otherwise fall back to the generic global lookup; apply the same exact-type
check change to the other occurrence mentioned (around the 7313–7319 region).
- Around line 2442-2445: The current synchronization using
self.code.quickened.swap(true, atomic::Ordering::Relaxed) plus
atomic::fence(atomic::Ordering::Release) is misleading because there is no
matching Acquire reader; update the quickening logic in frame.rs: either remove
the explicit atomics (drop the swap + fence) if the actual memory ordering is
already provided by the replace_op (Release) ↔ read_op (Acquire) protocol in
CodeUnits, or make the intent explicit by switching to an acquire-aware pattern
(e.g., use swap/store with Release paired with an explicit Acquire load where
readers check quickened) and add a short comment referencing the CodeUnits
protocol; target the quickening site (self.code.quickened,
self.code.instructions.quicken(), atomic::fence) when making the change.

---

Outside diff comments:
In `@crates/vm/src/builtins/type.rs`:
- Around line 2171-2196: The pre-mutation cache invalidation (zelf.modified())
in setattro is currently invoked before the attribute write but there is no
post-commit invalidation, allowing find_name_in_mro to repopulate caches from
the old state; fix this by keeping the existing pre-invalidation and adding a
second invalidation immediately after the mutation commit (after
attributes.write().insert or shift_remove completes and before calling
update_slot::<...>), i.e., invoke zelf.modified() again to ensure
tp_version_tag/cache entries are refreshed and prevent stale cache reuse.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d106c3 and 2954b60.

📒 Files selected for processing (7)

crates/compiler-core/src/bytecode.rs
crates/vm/src/builtins/frame.rs
crates/vm/src/builtins/function.rs
crates/vm/src/builtins/type.rs
crates/vm/src/dict_inner.rs
crates/vm/src/frame.rs
crates/vm/src/types/slot.rs

Optimizes user-defined class instantiation MyClass(args...) when tp_new == object.__new__ and __init__ is a simple PyFunction. Allocates the object directly and calls __init__ via invoke_exact_args, bypassing the generic type.__call__ dispatch path.

Change jitted_code from OnceCell to PyMutex<Option<CompiledCode>> so it can be cleared on __code__ assignment. The setter now sets the cached JIT code to None to prevent executing stale machine code.

- range iterator: deduplicate fast_next/next_fast - Replace raw pointer reads/writes in CodeUnits with atomic operations (AtomicU8/AtomicU16) for thread safety - Add read_op (Acquire), read_arg (Relaxed), compare_exchange_op - Use Release ordering in replace_op to synchronize cache writes - Dispatch loop reads opcodes atomically via read_op/read_arg - Fix adaptive counter access: use read/write_adaptive_counter instead of read/write_cache_u16 (was reading wrong bytes) - Add pre-check guards to all specialize_* functions to prevent concurrent specialization races - Move modified() before attribute changes in type.__setattr__ to prevent use-after-free of cached descriptors - Use SeqCst ordering in modified() for version invalidation - Add Release fence after quicken() initialization

For __getattribute__: only use getattro_wrapper when the type itself defines the attribute; otherwise inherit native slot from base class via MRO. For __setattr__/__delattr__: only store setattro_wrapper when the type has its own __setattr__ or __delattr__; otherwise keep the inherited base slot.

write_cache_u32 at cache_base+3 writes 2 code units (positions 3 and 4), but STORE_ATTR only has 4 cache entries (positions 0-3). This overwrites the next instruction with the upper 16 bits of the slot offset. Changed to write_cache_u16/read_cache_u16 since member descriptor offsets fit within u16 (max 65535 bytes).

has_python_cmp incorrectly treated method_descriptor as Python-level comparison methods, causing richcompare slot to use wrapper dispatch instead of inheriting the native slot.

partial_cmp returns None for NaN comparisons. is_some_and incorrectly returned false for all NaN comparisons, but NaN != x should be true per IEEE 754 semantics.

Change lookup_slot_in_mro to return a 3-state SlotLookupResult enum (NativeSlot/PythonMethod/NotFound) instead of Option<T>. Previously, both "found a Python-level method" and "found nothing" returned None, causing incorrect slot inheritance. For example, class Test(Mixin, TestCase) would inherit object.slot_init from Mixin via inherit_from_mro instead of using init_wrapper to dispatch TestCase.__init__. Apply this fix consistently to all slot update sites: update_main_slot!, update_sub_slot!, TpGetattro, TpSetattro, TpDescrSet, TpHash, TpRichcompare, SqAssItem, MpAssSubscript.

- deoptimize() / deoptimize_at(): replace specialized op with base op - adaptive(): decrement warmup counter or call specialize function - commit_specialization(): replace op on success, backoff on failure - execute_binary_op_int() / execute_binary_op_float(): typed binary ops Removes 10 duplicate deoptimize_* functions, consolidates 13 adaptive counter blocks, 6 binary op handlers, and 7 specialize tail patterns. Also replaces inline deopt blocks in LoadAttr/StoreAttr handlers.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/vm/src/frame.rs (2)

7391-7407: ⚠️ Potential issue | 🔴 Critical

Guard UnpackSequenceTwoTuple behind expected arity (UNPACK_SEQUENCE 2)

Instruction::UnpackSequenceTwoTuple is selected from runtime tuple length only. If this opcode is UNPACK_SEQUENCE N where N != 2, a later execution can incorrectly succeed with 2 elements instead of raising. This is a correctness break and can corrupt stack expectations.

💡 Proposed fix

@@
-            Instruction::UnpackSequence { size } => {
-                self.adaptive(|s, ii, cb| s.specialize_unpack_sequence(vm, ii, cb));
-                self.unpack_sequence(size.get(arg), vm)
+            Instruction::UnpackSequence { size } => {
+                let expected_size = size.get(arg);
+                self.adaptive(|s, ii, cb| s.specialize_unpack_sequence(vm, expected_size, ii, cb));
+                self.unpack_sequence(expected_size, vm)
             }
@@
-    fn specialize_unpack_sequence(
+    fn specialize_unpack_sequence(
         &mut self,
         vm: &VirtualMachine,
+        expected_size: u32,
         instr_idx: usize,
         cache_base: usize,
     ) {
@@
-        let new_op = if let Some(tuple) = obj.downcast_ref_if_exact::<PyTuple>(vm) {
-            if tuple.len() == 2 {
+        let new_op = if let Some(tuple) = obj.downcast_ref_if_exact::<PyTuple>(vm) {
+            if expected_size == 2 && tuple.len() == 2 {
                 Some(Instruction::UnpackSequenceTwoTuple)
             } else {
                 Some(Instruction::UnpackSequenceTuple)
             }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/frame.rs` around lines 7391 - 7407, The specialization picks
Instruction::UnpackSequenceTwoTuple based solely on the runtime tuple length,
which can violate the original UNPACK_SEQUENCE N expectation; update
specialize_unpack_sequence to first inspect the existing instruction at
instr_idx (match on Instruction::UnpackSequence { count }) and only choose
UnpackSequenceTwoTuple when count == 2 in addition to the obj being an exact
PyTuple of length 2; keep all other paths unchanged so UNPACK_SEQUENCE N (N !=
2) still preserves the generic UnpackSequence behavior.

7475-7486: ⚠️ Potential issue | 🟠 Major

Avoid unchecked narrowing of slot offset in StoreAttrSlot cache

offset as u16 can truncate. If that happens, specialization can write to the wrong slot index at runtime. Please guard the conversion and back off specialization when it does not fit.

💡 Proposed fix

             if let Some(ref descr) = cls_attr
                 && let Some(member_descr) = descr.downcast_ref::<PyMemberDescriptor>()
                 && let MemberGetter::Offset(offset) = member_descr.member.getter
             {
-                unsafe {
-                    self.code
-                        .instructions
-                        .write_cache_u32(cache_base + 1, type_version);
-                    self.code
-                        .instructions
-                        .write_cache_u16(cache_base + 3, offset as u16);
-                    self.code
-                        .instructions
-                        .replace_op(instr_idx, Instruction::StoreAttrSlot);
-                }
+                if let Ok(slot_offset) = u16::try_from(offset) {
+                    unsafe {
+                        self.code
+                            .instructions
+                            .write_cache_u32(cache_base + 1, type_version);
+                        self.code
+                            .instructions
+                            .write_cache_u16(cache_base + 3, slot_offset);
+                        self.code
+                            .instructions
+                            .replace_op(instr_idx, Instruction::StoreAttrSlot);
+                    }
+                } else {
+                    unsafe {
+                        self.code
+                            .instructions
+                            .write_adaptive_counter(cache_base, ADAPTIVE_BACKOFF_VALUE);
+                    }
+                }
             } else {

As per coding guidelines: "Follow Rust best practices for error handling and memory management".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/frame.rs` around lines 7475 - 7486, The code currently casts
offset as u16 unconditionally in the StoreAttrSlot cache path (see cls_attr,
member_descr and MemberGetter::Offset(offset) and the call to write_cache_u16),
which can truncate large offsets; add a guard that checks whether offset fits in
u16 (e.g. offset <= u16::MAX) before taking the specialization/unsafe path and
writing the cache (write_cache_u32/write_cache_u16); if the offset does not fit,
back off the specialization and fall through to the generic store-attribute path
instead of writing a truncated offset into the cache.

♻️ Duplicate comments (2)

crates/compiler-core/src/bytecode.rs (2)

485-499: ⚠️ Potential issue | 🔴 Critical

Safe readers dereference raw pointers without bounds validation.

Lines [485-499] and [567-574] are safe methods, but they compute/deref raw pointers without checking index < len. Out-of-range calls can produce UB instead of a panic.

Proposed fix

 pub fn read_op(&self, index: usize) -> Instruction {
     let units = unsafe { &*self.0.get() };
+    assert!(index < units.len(), "read_op: index out of bounds");
     let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU8;
     let byte = unsafe { &*ptr }.load(Ordering::Acquire);
     unsafe { mem::transmute::<u8, Instruction>(byte) }
 }

 pub fn read_arg(&self, index: usize) -> OpArgByte {
     let units = unsafe { &*self.0.get() };
+    assert!(index < units.len(), "read_arg: index out of bounds");
     let ptr = units.as_ptr().wrapping_add(index) as *const u8;
     let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
     OpArgByte::from(unsafe { &*arg_ptr }.load(Ordering::Relaxed))
 }

 pub fn read_adaptive_counter(&self, index: usize) -> u8 {
     let units = unsafe { &*self.0.get() };
+    assert!(index < units.len(), "read_adaptive_counter: index out of bounds");
     let ptr = units.as_ptr().wrapping_add(index) as *const u8;
     let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
     unsafe { &*arg_ptr }.load(Ordering::Relaxed)
 }

#!/bin/bash
set -e
sed -n '483,500p' crates/compiler-core/src/bytecode.rs
sed -n '567,575p' crates/compiler-core/src/bytecode.rs

As per coding guidelines: "Follow Rust best practices for error handling and memory management".

Also applies to: 567-574

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/compiler-core/src/bytecode.rs` around lines 485 - 499, The safe reader
methods read_op and read_arg perform raw pointer arithmetic/derefs without
validating index, which can cause UB for out-of-range indices; modify both
functions to first obtain units (as now), check that index < units.len() and if
not panic with a clear message (e.g., index out of bounds) before doing any
pointer arithmetic/dereference, then proceed with the existing AtomicU8 load and
transmute/OpArgByte conversion; ensure the check uses the same units reference
used for pointer math so bounds are validated against the exact slice.

342-350: ⚠️ Potential issue | 🔴 Critical

AtomicU16 pointer casts require alignment that CodeUnit does not guarantee.

Lines [509-523] cast CodeUnit* to AtomicU16*, but CodeUnit at Lines [342-350] is #[repr(C)] with byte-sized fields, so alignment is not guaranteed to be 2. That is UB for atomic u16 accesses.

Proposed fix

-#[repr(C)]
+#[repr(C, align(2))]
 #[derive(Copy, Clone, Debug)]
 pub struct CodeUnit {
     pub op: Instruction,
     pub arg: OpArgByte,
 }

-const _: () = assert!(mem::size_of::<CodeUnit>() == 2);
+const _: () = {
+    assert!(mem::size_of::<CodeUnit>() == 2);
+    assert!(mem::align_of::<CodeUnit>() >= mem::align_of::<u16>());
+};

#!/bin/bash
set -e
sed -n '340,352p' crates/compiler-core/src/bytecode.rs
rg -n 'as \*const AtomicU16' crates/compiler-core/src/bytecode.rs

As per coding guidelines: "Follow Rust best practices for error handling and memory management".

Also applies to: 509-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/compiler-core/src/bytecode.rs` around lines 342 - 350, The CodeUnit
struct is not guaranteed 2-byte aligned but later code casts CodeUnit pointers
to AtomicU16 (unsafe for misaligned data); fix by making CodeUnit 2-byte
aligned—add #[repr(C, align(2))] to the CodeUnit definition (keeping the
existing fields op: Instruction and arg: OpArgByte) and verify the size
assertion still holds, then confirm the atomic pointer casts around the code
that converts CodeUnit* to AtomicU16* (the casts at the section around lines
that perform "as *const AtomicU16"/"as *mut AtomicU16") are now safe due to the
enforced alignment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/vm/src/frame.rs`:
- Around line 7391-7407: The specialization picks
Instruction::UnpackSequenceTwoTuple based solely on the runtime tuple length,
which can violate the original UNPACK_SEQUENCE N expectation; update
specialize_unpack_sequence to first inspect the existing instruction at
instr_idx (match on Instruction::UnpackSequence { count }) and only choose
UnpackSequenceTwoTuple when count == 2 in addition to the obj being an exact
PyTuple of length 2; keep all other paths unchanged so UNPACK_SEQUENCE N (N !=
2) still preserves the generic UnpackSequence behavior.
- Around line 7475-7486: The code currently casts offset as u16 unconditionally
in the StoreAttrSlot cache path (see cls_attr, member_descr and
MemberGetter::Offset(offset) and the call to write_cache_u16), which can
truncate large offsets; add a guard that checks whether offset fits in u16 (e.g.
offset <= u16::MAX) before taking the specialization/unsafe path and writing the
cache (write_cache_u32/write_cache_u16); if the offset does not fit, back off
the specialization and fall through to the generic store-attribute path instead
of writing a truncated offset into the cache.

---

Duplicate comments:
In `@crates/compiler-core/src/bytecode.rs`:
- Around line 485-499: The safe reader methods read_op and read_arg perform raw
pointer arithmetic/derefs without validating index, which can cause UB for
out-of-range indices; modify both functions to first obtain units (as now),
check that index < units.len() and if not panic with a clear message (e.g.,
index out of bounds) before doing any pointer arithmetic/dereference, then
proceed with the existing AtomicU8 load and transmute/OpArgByte conversion;
ensure the check uses the same units reference used for pointer math so bounds
are validated against the exact slice.
- Around line 342-350: The CodeUnit struct is not guaranteed 2-byte aligned but
later code casts CodeUnit pointers to AtomicU16 (unsafe for misaligned data);
fix by making CodeUnit 2-byte aligned—add #[repr(C, align(2))] to the CodeUnit
definition (keeping the existing fields op: Instruction and arg: OpArgByte) and
verify the size assertion still holds, then confirm the atomic pointer casts
around the code that converts CodeUnit* to AtomicU16* (the casts at the section
around lines that perform "as *const AtomicU16"/"as *mut AtomicU16") are now
safe due to the enforced alignment.

ℹ️ Review info

Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cdcead23-a33a-4961-b807-327211593155

📥 Commits

Reviewing files that changed from the base of the PR and between 2954b60 and 0b66b19.

📒 Files selected for processing (7)

crates/compiler-core/src/bytecode.rs
crates/vm/src/builtins/frame.rs
crates/vm/src/builtins/function.rs
crates/vm/src/builtins/type.rs
crates/vm/src/dict_inner.rs
crates/vm/src/frame.rs
crates/vm/src/types/slot.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/vm/src/dict_inner.rs
crates/vm/src/builtins/frame.rs

- CONTAINS_OP_SET: add frozenset support in handler and specialize - TO_BOOL_ALWAYS_TRUE: cache type version instead of checking slots - LOAD_GLOBAL_BUILTIN: cache builtins dict version alongside globals - mark_stacks: deoptimize specialized opcodes for correct reachability

github-actions

Code has been automatically formatted

The code in this PR has been formatted using:

cargo fmt --all
Please pull the latest changes before pushing again:

git pull origin specialization2

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

crates/compiler-core/src/bytecode.rs (2)

485-499: ⚠️ Potential issue | 🔴 Critical

Add bounds checks to all safe atomic readers.

read_op, read_arg, and read_adaptive_counter are safe APIs but dereference pointers derived from unchecked index. Out-of-range input is UB instead of a panic.

#!/bin/bash
set -euo pipefail
FILE="crates/compiler-core/src/bytecode.rs"

# Verify safe readers and whether they assert bounds.
rg -n "pub fn read_op|pub fn read_arg|pub fn read_adaptive_counter|assert!" "$FILE" -A8 -B2

🔧 Suggested fix

     pub fn read_op(&self, index: usize) -> Instruction {
         let units = unsafe { &*self.0.get() };
+        assert!(index < units.len(), "read_op: index out of bounds");
         let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU8;
         let byte = unsafe { &*ptr }.load(Ordering::Acquire);
         // SAFETY: Only valid Instruction values are stored via replace_op/compare_exchange_op.
         unsafe { mem::transmute::<u8, Instruction>(byte) }
     }

     pub fn read_arg(&self, index: usize) -> OpArgByte {
         let units = unsafe { &*self.0.get() };
+        assert!(index < units.len(), "read_arg: index out of bounds");
         let ptr = units.as_ptr().wrapping_add(index) as *const u8;
         let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
         OpArgByte::from(unsafe { &*arg_ptr }.load(Ordering::Relaxed))
     }

     pub fn read_adaptive_counter(&self, index: usize) -> u8 {
         let units = unsafe { &*self.0.get() };
+        assert!(index < units.len(), "read_adaptive_counter: index out of bounds");
         let ptr = units.as_ptr().wrapping_add(index) as *const u8;
         let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
         unsafe { &*arg_ptr }.load(Ordering::Relaxed)
     }

As per coding guidelines: "Follow Rust best practices for error handling and memory management".

Also applies to: 567-574

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/compiler-core/src/bytecode.rs` around lines 485 - 499, The safe atomic
readers (read_op, read_arg, read_adaptive_counter) currently compute pointers
from an unchecked index causing UB on out-of-range input; fix by reading units =
unsafe { &*self.0.get() } then perform explicit bounds checks (e.g.,
assert!(index < units.len()) for read_op and read_adaptive_counter, and
assert!(index + 1 < units.len()) for read_arg because it accesses the next byte)
before creating/wrapping the pointer; only after those checks compute the
ptr/arg_ptr and perform the AtomicU8 load so out-of-range indices panic instead
of producing UB.

509-510: ⚠️ Potential issue | 🔴 Critical

Guarantee 2-byte alignment before casting to AtomicU16.

The CodeUnit* -> AtomicU16* cast requires 2-byte alignment. With #[repr(C)] and two 1-byte fields, this alignment is not guaranteed, so these atomic accesses are UB-prone.

#!/bin/bash
set -euo pipefail
FILE="crates/compiler-core/src/bytecode.rs"

# Show CodeUnit layout declaration + AtomicU16 pointer casts.
rg -n "struct CodeUnit|repr\\(C|AtomicU16|size_of::<CodeUnit>|align_of::<CodeUnit>" "$FILE" -A3 -B2

🔧 Suggested fix

-#[repr(C)]
+#[repr(C, align(2))]
 #[derive(Copy, Clone, Debug)]
 pub struct CodeUnit {
     pub op: Instruction,
     pub arg: OpArgByte,
 }

-const _: () = assert!(mem::size_of::<CodeUnit>() == 2);
+const _: () = {
+    assert!(mem::size_of::<CodeUnit>() == 2);
+    assert!(mem::align_of::<CodeUnit>() >= mem::align_of::<u16>());
+};

As per coding guidelines: "Follow Rust best practices for error handling and memory management".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/compiler-core/src/bytecode.rs` around lines 509 - 510, The code casts
CodeUnit pointers to AtomicU16 pointers (units.as_ptr().wrapping_add(index) as
*const AtomicU16) but CodeUnit (with #[repr(C)] and two 1-byte fields) may not
be 2-byte aligned, causing UB; fix by guaranteeing 2-byte alignment on CodeUnit
(e.g., annotate the CodeUnit struct with #[repr(C, align(2))]) so the cast is
valid, then replace the cast with a safe pointer cast like
units.as_ptr().wrapping_add(index).cast::<AtomicU16>() and keep the unsafe {
&*ptr }.store(value, Ordering::Relaxed) usage.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/builtins/function.rs`:
- Around line 711-716: The update to __code__ must be atomic with JIT-cache
invalidation to avoid running stale jitted code; modify set___code__ so it
acquires the jitted_code mutex and clears the cached entry before performing
self.code.swap_to_temporary_refs, or hold the same jitted_code lock while doing
both operations so no other thread can observe cleared/uncleared state in
between. Specifically, in set___code__, lock jitted_code (the Mutex used at
*self.jitted_code.lock()), set it to None, then call
self.code.swap_to_temporary_refs(code, vm) while still holding the lock, and
finally release the lock so the swap and cache invalidation occur atomically.

In `@crates/vm/src/builtins/type.rs`:
- Around line 2171-2175: The current pre-invalidation (calling zelf.modified()
once before attribute mutation) leaves a window where a concurrent
find_name_in_mro may cache results under the new tag during the mutation; fix
this by invoking zelf.modified() both before starting the mutation and once more
immediately after finishing attribute modifications so any caches created during
the mutation are replay-invalidated; update the code path where attributes are
changed (the block containing the single zelf.modified() call) to perform a
second modified() post-mutation and ensure this behavior aligns with the
cache-checking logic in find_name_in_mro.

---

Duplicate comments:
In `@crates/compiler-core/src/bytecode.rs`:
- Around line 485-499: The safe atomic readers (read_op, read_arg,
read_adaptive_counter) currently compute pointers from an unchecked index
causing UB on out-of-range input; fix by reading units = unsafe { &*self.0.get()
} then perform explicit bounds checks (e.g., assert!(index < units.len()) for
read_op and read_adaptive_counter, and assert!(index + 1 < units.len()) for
read_arg because it accesses the next byte) before creating/wrapping the
pointer; only after those checks compute the ptr/arg_ptr and perform the
AtomicU8 load so out-of-range indices panic instead of producing UB.
- Around line 509-510: The code casts CodeUnit pointers to AtomicU16 pointers
(units.as_ptr().wrapping_add(index) as *const AtomicU16) but CodeUnit (with
#[repr(C)] and two 1-byte fields) may not be 2-byte aligned, causing UB; fix by
guaranteeing 2-byte alignment on CodeUnit (e.g., annotate the CodeUnit struct
with #[repr(C, align(2))]) so the cast is valid, then replace the cast with a
safe pointer cast like units.as_ptr().wrapping_add(index).cast::<AtomicU16>()
and keep the unsafe { &*ptr }.store(value, Ordering::Relaxed) usage.

ℹ️ Review info

Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 3cc6b83b-aec4-4da8-a316-976f49a64f3c

📥 Commits

Reviewing files that changed from the base of the PR and between 0b66b19 and dd89151.

📒 Files selected for processing (7)

crates/compiler-core/src/bytecode.rs
crates/vm/src/builtins/frame.rs
crates/vm/src/builtins/function.rs
crates/vm/src/builtins/type.rs
crates/vm/src/dict_inner.rs
crates/vm/src/frame.rs
crates/vm/src/types/slot.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/vm/src/builtins/frame.rs
crates/vm/src/dict_inner.rs

* Add CALL_ALLOC_AND_ENTER_INIT specialization Optimizes user-defined class instantiation MyClass(args...) when tp_new == object.__new__ and __init__ is a simple PyFunction. Allocates the object directly and calls __init__ via invoke_exact_args, bypassing the generic type.__call__ dispatch path. * Invalidate JIT cache when __code__ is reassigned Change jitted_code from OnceCell to PyMutex<Option<CompiledCode>> so it can be cleared on __code__ assignment. The setter now sets the cached JIT code to None to prevent executing stale machine code. * Atomic operations for specialization cache - range iterator: deduplicate fast_next/next_fast - Replace raw pointer reads/writes in CodeUnits with atomic operations (AtomicU8/AtomicU16) for thread safety - Add read_op (Acquire), read_arg (Relaxed), compare_exchange_op - Use Release ordering in replace_op to synchronize cache writes - Dispatch loop reads opcodes atomically via read_op/read_arg - Fix adaptive counter access: use read/write_adaptive_counter instead of read/write_cache_u16 (was reading wrong bytes) - Add pre-check guards to all specialize_* functions to prevent concurrent specialization races - Move modified() before attribute changes in type.__setattr__ to prevent use-after-free of cached descriptors - Use SeqCst ordering in modified() for version invalidation - Add Release fence after quicken() initialization * Fix slot wrapper override for inherited attributes For __getattribute__: only use getattro_wrapper when the type itself defines the attribute; otherwise inherit native slot from base class via MRO. For __setattr__/__delattr__: only store setattro_wrapper when the type has its own __setattr__ or __delattr__; otherwise keep the inherited base slot. * Fix StoreAttrSlot cache overflow corrupting next instruction write_cache_u32 at cache_base+3 writes 2 code units (positions 3 and 4), but STORE_ATTR only has 4 cache entries (positions 0-3). This overwrites the next instruction with the upper 16 bits of the slot offset. Changed to write_cache_u16/read_cache_u16 since member descriptor offsets fit within u16 (max 65535 bytes). * Exclude method_descriptor from has_python_cmp check has_python_cmp incorrectly treated method_descriptor as Python-level comparison methods, causing richcompare slot to use wrapper dispatch instead of inheriting the native slot. * Fix CompareOpFloat NaN handling partial_cmp returns None for NaN comparisons. is_some_and incorrectly returned false for all NaN comparisons, but NaN != x should be true per IEEE 754 semantics. * Fix invoke_exact_args borrow in CallAllocAndEnterInit * Distinguish Python method vs not-found in slot MRO lookup Change lookup_slot_in_mro to return a 3-state SlotLookupResult enum (NativeSlot/PythonMethod/NotFound) instead of Option<T>. Previously, both "found a Python-level method" and "found nothing" returned None, causing incorrect slot inheritance. For example, class Test(Mixin, TestCase) would inherit object.slot_init from Mixin via inherit_from_mro instead of using init_wrapper to dispatch TestCase.__init__. Apply this fix consistently to all slot update sites: update_main_slot!, update_sub_slot!, TpGetattro, TpSetattro, TpDescrSet, TpHash, TpRichcompare, SqAssItem, MpAssSubscript. * Extract specialization helper functions to reduce boilerplate - deoptimize() / deoptimize_at(): replace specialized op with base op - adaptive(): decrement warmup counter or call specialize function - commit_specialization(): replace op on success, backoff on failure - execute_binary_op_int() / execute_binary_op_float(): typed binary ops Removes 10 duplicate deoptimize_* functions, consolidates 13 adaptive counter blocks, 6 binary op handlers, and 7 specialize tail patterns. Also replaces inline deopt blocks in LoadAttr/StoreAttr handlers. * Improve specialization guards and fix mark_stacks - CONTAINS_OP_SET: add frozenset support in handler and specialize - TO_BOOL_ALWAYS_TRUE: cache type version instead of checking slots - LOAD_GLOBAL_BUILTIN: cache builtins dict version alongside globals - mark_stacks: deoptimize specialized opcodes for correct reachability * Auto-format: cargo fmt --all --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

youknowone force-pushed the specialization2 branch 5 times, most recently from b3456e0 to 2954b60 Compare March 3, 2026 19:46

youknowone marked this pull request as ready for review March 3, 2026 20:55

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

youknowone force-pushed the specialization2 branch from 2954b60 to 0b66b19 Compare March 4, 2026 04:56

youknowone added 10 commits March 4, 2026 13:58

Invalidate JIT cache when __code__ is reassigned …

4ab342d

Change jitted_code from OnceCell to PyMutex<Option<CompiledCode>> so it can be cleared on __code__ assignment. The setter now sets the cached JIT code to None to prevent executing stale machine code.

Exclude method_descriptor from has_python_cmp check …

7bd5300

has_python_cmp incorrectly treated method_descriptor as Python-level comparison methods, causing richcompare slot to use wrapper dispatch instead of inheriting the native slot.

Fix CompareOpFloat NaN handling …

5c81306

partial_cmp returns None for NaN comparisons. is_some_and incorrectly returned false for all NaN comparisons, but NaN != x should be true per IEEE 754 semantics.

Fix invoke_exact_args borrow in CallAllocAndEnterInit

5a1b29b

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

youknowone force-pushed the specialization2 branch from 0b66b19 to dd89151 Compare March 4, 2026 05:10

Auto-format: cargo fmt --all

21c29ca

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

youknowone changed the title ~~Specialized ops 2~~ Mar 4, 2026

youknowone deleted the specialization2 branch March 4, 2026 06:39

coderabbitai Bot mentioned this pull request Mar 4, 2026

Apply vectorcall for more places #7336

Merged

youknowone changed the title ~~Specialized ops~~ Mar 4, 2026

youknowone changed the title ~~Align specialization guards and caching with CPython~~ Mar 4, 2026

This was referenced Mar 22, 2026

marshal #7467

Merged

Add global type mutex for thread-safe type operations #7416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialized ops#7322