Implement LOAD_ATTR inline caching with adaptive specialization by youknowone · Pull Request #7292 · RustPython/RustPython

youknowone

Add type version counter (tp_version_tag) to PyType with subclass
invalidation cascade. Add cache read/write methods (u16/u32/u64)
to CodeUnits. Implement adaptive specialization in load_attr that
replaces the opcode with specialized variants on first execution:

- LoadAttrMethodNoDict: cached method lookup for slotted types
- LoadAttrMethodWithValues: cached method with dict shadow check
- LoadAttrInstanceValue: direct dict lookup skipping descriptors

Specialized opcodes guard on type_version_tag and deoptimize back
to generic LOAD_ATTR with backoff counter on cache miss.

BINARY_OP: Specialize int add/subtract/multiply and float
add/subtract/multiply with type guards and deoptimization.

CALL: Add func_version to PyFunction, specialize simple
function calls (CallPyExactArgs, CallBoundMethodExactArgs)
with invoke_exact_args fast path that skips FuncArgs
allocation and fill_locals_from_args.

Move counter initialization from compile-time to RESUME execution,
matching CPython's _PyCode_Quicken pattern. Store counter in CACHE
entry's arg byte to preserve op=Instruction::Cache for dis/JIT.
Add PyCode.quickened flag for one-time initialization.

- deoptimize() maps specialized opcodes back to their base adaptive variant
- original_bytes() produces deoptimized bytecode with zeroed CACHE entries
- co_code now returns deoptimized bytes, _co_code_adaptive returns current bytes
- Marshal serialization uses original_bytes() instead of raw transmute

- cache_entries() returns correct count for instrumented opcodes
- deoptimize() maps instrumented opcodes back to base
- quicken() skips adaptive counter for instrumented opcodes
- instrument_code Phase 3 deoptimizes specialized opcodes and
  clears CACHE entries to prevent stale pointer dereferences

- Add bounds checks to read_cache_u16/u32/u64
- Fix quicken() aliasing UB by using &mut directly
- Add JumpBackwardJit/JumpBackwardNoJit to deoptimize()
- Guard can_specialize_call with NEWLOCALS flag check
- Use compare_exchange_weak for version tag to prevent wraparound
- Propagate dict lookup errors in LoadAttrMethodWithValues
- Apply adaptive backoff on version tag assignment failure
- Remove duplicate imports in frame.rs

This was referenced

Mar 3, 2026

youknowone added a commit to youknowone/RustPython that referenced this pull request

Mar 22, 2026

…Python#7292)

* Implement LOAD_ATTR inline caching with adaptive specialization

Add type version counter (tp_version_tag) to PyType with subclass
invalidation cascade. Add cache read/write methods (u16/u32/u64)
to CodeUnits. Implement adaptive specialization in load_attr that
replaces the opcode with specialized variants on first execution:

- LoadAttrMethodNoDict: cached method lookup for slotted types
- LoadAttrMethodWithValues: cached method with dict shadow check
- LoadAttrInstanceValue: direct dict lookup skipping descriptors

Specialized opcodes guard on type_version_tag and deoptimize back
to generic LOAD_ATTR with backoff counter on cache miss.

* Add BINARY_OP and CALL adaptive specialization

BINARY_OP: Specialize int add/subtract/multiply and float
add/subtract/multiply with type guards and deoptimization.

CALL: Add func_version to PyFunction, specialize simple
function calls (CallPyExactArgs, CallBoundMethodExactArgs)
with invoke_exact_args fast path that skips FuncArgs
allocation and fill_locals_from_args.

* Lazy quickening for adaptive specialization counters

Move counter initialization from compile-time to RESUME execution,
matching CPython's _PyCode_Quicken pattern. Store counter in CACHE
entry's arg byte to preserve op=Instruction::Cache for dis/JIT.
Add PyCode.quickened flag for one-time initialization.

* Add Instruction::deoptimize() and CodeUnits::original_bytes()

- deoptimize() maps specialized opcodes back to their base adaptive variant
- original_bytes() produces deoptimized bytecode with zeroed CACHE entries
- co_code now returns deoptimized bytes, _co_code_adaptive returns current bytes
- Marshal serialization uses original_bytes() instead of raw transmute

* Fix monitoring and specialization interaction

- cache_entries() returns correct count for instrumented opcodes
- deoptimize() maps instrumented opcodes back to base
- quicken() skips adaptive counter for instrumented opcodes
- instrument_code Phase 3 deoptimizes specialized opcodes and
  clears CACHE entries to prevent stale pointer dereferences

* Address review: bounds checks, UB fix, version overflow, error handling

- Add bounds checks to read_cache_u16/u32/u64
- Fix quicken() aliasing UB by using &mut directly
- Add JumpBackwardJit/JumpBackwardNoJit to deoptimize()
- Guard can_specialize_call with NEWLOCALS flag check
- Use compare_exchange_weak for version tag to prevent wraparound
- Propagate dict lookup errors in LoadAttrMethodWithValues
- Apply adaptive backoff on version tag assignment failure
- Remove duplicate imports in frame.rs