Specialized ops by youknowone · Pull Request #7322 · RustPython/RustPython
youknowone added a commit to youknowone/RustPython that referenced this pull request
* Add CALL_ALLOC_AND_ENTER_INIT specialization Optimizes user-defined class instantiation MyClass(args...) when tp_new == object.__new__ and __init__ is a simple PyFunction. Allocates the object directly and calls __init__ via invoke_exact_args, bypassing the generic type.__call__ dispatch path. * Invalidate JIT cache when __code__ is reassigned Change jitted_code from OnceCell to PyMutex<Option<CompiledCode>> so it can be cleared on __code__ assignment. The setter now sets the cached JIT code to None to prevent executing stale machine code. * Atomic operations for specialization cache - range iterator: deduplicate fast_next/next_fast - Replace raw pointer reads/writes in CodeUnits with atomic operations (AtomicU8/AtomicU16) for thread safety - Add read_op (Acquire), read_arg (Relaxed), compare_exchange_op - Use Release ordering in replace_op to synchronize cache writes - Dispatch loop reads opcodes atomically via read_op/read_arg - Fix adaptive counter access: use read/write_adaptive_counter instead of read/write_cache_u16 (was reading wrong bytes) - Add pre-check guards to all specialize_* functions to prevent concurrent specialization races - Move modified() before attribute changes in type.__setattr__ to prevent use-after-free of cached descriptors - Use SeqCst ordering in modified() for version invalidation - Add Release fence after quicken() initialization * Fix slot wrapper override for inherited attributes For __getattribute__: only use getattro_wrapper when the type itself defines the attribute; otherwise inherit native slot from base class via MRO. For __setattr__/__delattr__: only store setattro_wrapper when the type has its own __setattr__ or __delattr__; otherwise keep the inherited base slot. * Fix StoreAttrSlot cache overflow corrupting next instruction write_cache_u32 at cache_base+3 writes 2 code units (positions 3 and 4), but STORE_ATTR only has 4 cache entries (positions 0-3). This overwrites the next instruction with the upper 16 bits of the slot offset. Changed to write_cache_u16/read_cache_u16 since member descriptor offsets fit within u16 (max 65535 bytes). * Exclude method_descriptor from has_python_cmp check has_python_cmp incorrectly treated method_descriptor as Python-level comparison methods, causing richcompare slot to use wrapper dispatch instead of inheriting the native slot. * Fix CompareOpFloat NaN handling partial_cmp returns None for NaN comparisons. is_some_and incorrectly returned false for all NaN comparisons, but NaN != x should be true per IEEE 754 semantics. * Fix invoke_exact_args borrow in CallAllocAndEnterInit * Distinguish Python method vs not-found in slot MRO lookup Change lookup_slot_in_mro to return a 3-state SlotLookupResult enum (NativeSlot/PythonMethod/NotFound) instead of Option<T>. Previously, both "found a Python-level method" and "found nothing" returned None, causing incorrect slot inheritance. For example, class Test(Mixin, TestCase) would inherit object.slot_init from Mixin via inherit_from_mro instead of using init_wrapper to dispatch TestCase.__init__. Apply this fix consistently to all slot update sites: update_main_slot!, update_sub_slot!, TpGetattro, TpSetattro, TpDescrSet, TpHash, TpRichcompare, SqAssItem, MpAssSubscript. * Extract specialization helper functions to reduce boilerplate - deoptimize() / deoptimize_at(): replace specialized op with base op - adaptive(): decrement warmup counter or call specialize function - commit_specialization(): replace op on success, backoff on failure - execute_binary_op_int() / execute_binary_op_float(): typed binary ops Removes 10 duplicate deoptimize_* functions, consolidates 13 adaptive counter blocks, 6 binary op handlers, and 7 specialize tail patterns. Also replaces inline deopt blocks in LoadAttr/StoreAttr handlers. * Improve specialization guards and fix mark_stacks - CONTAINS_OP_SET: add frozenset support in handler and specialize - TO_BOOL_ALWAYS_TRUE: cache type version instead of checking slots - LOAD_GLOBAL_BUILTIN: cache builtins dict version alongside globals - mark_stacks: deoptimize specialized opcodes for correct reachability * Auto-format: cargo fmt --all --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>