◐ Shell
clean mode source ↗

gh-144586: Use CPU-specific instructions for `_Py_yield` (AArch64 only) by dpdani · Pull Request #149784 · python/cpython

This PR adds a nano_delay function to avoid relying on the OS scheduler when using _Py_yield to back off from a contended mutation. Only AArch64 code paths have been added.

The _PyMutex_LockTimed function was updated to use an exponential backoff, which improves acquisition throughput on highly contended locks.

In this PR the nano_delay implementation based on the wfet instruction was omitted because it requires runtime dispatching: not all AArch64 CPUs implement this feature. Using compiler macros would not be a sufficient check. It is possible for another PR to also add it.

This change shows performance improvements on the lockbench benchmark, tested with the following parameters:

  • low contention: --work-inside 5 --work-outside 50 --num-locks 24 --acquisitions 3 --random-locks
  • high contention: --work-inside 5 --work-outside 5

graviton_3_high graviton_3_low m4_max_high m4_max_low

The execution on the Graviton 3 machine, which has a high core count, exhibited major bottlenecks in scalability past a certain number of processors, and this was also reproduced on a number of other machines. This is a problem also on main and I will open a separate issue for that in the future.