gh-144586: Use CPU-specific instructions for `_Py_yield` (AArch64 only) by dpdani · Pull Request #149784 · python/cpython

bedevere-app

This PR adds a nano_delay function to avoid relying on the OS scheduler when using _Py_yield to back off from a contended mutation. Only AArch64 code paths have been added.

The _PyMutex_LockTimed function was updated to use an exponential backoff, which improves acquisition throughput on highly contended locks.

In this PR the nano_delay implementation based on the wfet instruction was omitted because it requires runtime dispatching: not all AArch64 CPUs implement this feature. Using compiler macros would not be a sufficient check. It is possible for another PR to also add it.

This change shows performance improvements on the lockbench benchmark, tested with the following parameters:

low contention: --work-inside 5 --work-outside 50 --num-locks 24 --acquisitions 3 --random-locks
high contention: --work-inside 5 --work-outside 5

The execution on the Graviton 3 machine, which has a high core count, exhibited major bottlenecks in scalability past a certain number of processors, and this was also reproduced on a number of other machines. This is a problem also on main and I will open a separate issue for that in the future.

Issue: Improve _Py_yield to use light weight cpu instruction #144586