gh-144586: Use CPU-specific instructions for `_Py_yield` (AArch64 only) by dpdani · Pull Request #149784 · python/cpython
This PR adds a nano_delay function to avoid relying on the OS scheduler when using _Py_yield to back off from a contended mutation. Only AArch64 code paths have been added.
The _PyMutex_LockTimed function was updated to use an exponential backoff, which improves acquisition throughput on highly contended locks.
In this PR the nano_delay implementation based on the wfet instruction was omitted because it requires runtime dispatching: not all AArch64 CPUs implement this feature. Using compiler macros would not be a sufficient check. It is possible for another PR to also add it.
This change shows performance improvements on the lockbench benchmark, tested with the following parameters:
- low contention:
--work-inside 5 --work-outside 50 --num-locks 24 --acquisitions 3 --random-locks - high contention:
--work-inside 5 --work-outside 5
The execution on the Graviton 3 machine, which has a high core count, exhibited major bottlenecks in scalability past a certain number of processors, and this was also reproduced on a number of other machines. This is a problem also on main and I will open a separate issue for that in the future.



