Specialized ops by youknowone · Pull Request #7322 · RustPython/RustPython

youknowone
youknowone added a commit to youknowone/RustPython that referenced this pull request
* Add CALL_ALLOC_AND_ENTER_INIT specialization

Optimizes user-defined class instantiation MyClass(args...)
when tp_new == object.__new__ and __init__ is a simple
PyFunction. Allocates the object directly and calls __init__
via invoke_exact_args, bypassing the generic type.__call__
dispatch path.

* Invalidate JIT cache when __code__ is reassigned

Change jitted_code from OnceCell to PyMutex<Option<CompiledCode>> so
it can be cleared on __code__ assignment. The setter now sets the
cached JIT code to None to prevent executing stale machine code.

* Atomic operations for specialization cache

- range iterator: deduplicate fast_next/next_fast
- Replace raw pointer reads/writes in CodeUnits with atomic
  operations (AtomicU8/AtomicU16) for thread safety
- Add read_op (Acquire), read_arg (Relaxed), compare_exchange_op
- Use Release ordering in replace_op to synchronize cache writes
- Dispatch loop reads opcodes atomically via read_op/read_arg
- Fix adaptive counter access: use read/write_adaptive_counter
  instead of read/write_cache_u16 (was reading wrong bytes)
- Add pre-check guards to all specialize_* functions to prevent
  concurrent specialization races
- Move modified() before attribute changes in type.__setattr__
  to prevent use-after-free of cached descriptors
- Use SeqCst ordering in modified() for version invalidation
- Add Release fence after quicken() initialization

* Fix slot wrapper override for inherited attributes

For __getattribute__: only use getattro_wrapper when the type
itself defines the attribute; otherwise inherit native slot from
base class via MRO.

For __setattr__/__delattr__: only store setattro_wrapper when
the type has its own __setattr__ or __delattr__; otherwise keep
the inherited base slot.

* Fix StoreAttrSlot cache overflow corrupting next instruction

write_cache_u32 at cache_base+3 writes 2 code units (positions 3 and 4),
but STORE_ATTR only has 4 cache entries (positions 0-3). This overwrites
the next instruction with the upper 16 bits of the slot offset.

Changed to write_cache_u16/read_cache_u16 since member descriptor offsets
fit within u16 (max 65535 bytes).

* Exclude method_descriptor from has_python_cmp check

has_python_cmp incorrectly treated method_descriptor as Python-level
comparison methods, causing richcompare slot to use wrapper dispatch
instead of inheriting the native slot.

* Fix CompareOpFloat NaN handling

partial_cmp returns None for NaN comparisons. is_some_and incorrectly
returned false for all NaN comparisons, but NaN != x should be true
per IEEE 754 semantics.

* Fix invoke_exact_args borrow in CallAllocAndEnterInit

* Distinguish Python method vs not-found in slot MRO lookup

Change lookup_slot_in_mro to return a 3-state SlotLookupResult
enum (NativeSlot/PythonMethod/NotFound) instead of Option<T>.

Previously, both "found a Python-level method" and "found nothing"
returned None, causing incorrect slot inheritance. For example,
class Test(Mixin, TestCase) would inherit object.slot_init from
Mixin via inherit_from_mro instead of using init_wrapper to
dispatch TestCase.__init__.

Apply this fix consistently to all slot update sites:
update_main_slot!, update_sub_slot!, TpGetattro, TpSetattro,
TpDescrSet, TpHash, TpRichcompare, SqAssItem, MpAssSubscript.

* Extract specialization helper functions to reduce boilerplate

- deoptimize() / deoptimize_at(): replace specialized op with base op
- adaptive(): decrement warmup counter or call specialize function
- commit_specialization(): replace op on success, backoff on failure
- execute_binary_op_int() / execute_binary_op_float(): typed binary ops

Removes 10 duplicate deoptimize_* functions, consolidates 13 adaptive
counter blocks, 6 binary op handlers, and 7 specialize tail patterns.
Also replaces inline deopt blocks in LoadAttr/StoreAttr handlers.

* Improve specialization guards and fix mark_stacks

- CONTAINS_OP_SET: add frozenset support in handler and specialize
- TO_BOOL_ALWAYS_TRUE: cache type version instead of checking slots
- LOAD_GLOBAL_BUILTIN: cache builtins dict version alongside globals
- mark_stacks: deoptimize specialized opcodes for correct reachability

* Auto-format: cargo fmt --all

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>