GH-135904: Improve the JIT's performance on macOS by brandtbucher · Pull Request #136528 · python/cpython

brandtbucher

Conversation

This PR makes a couple of minor tweaks to the JIT that result in 1.7% faster performance on macOS overall:

Our AArch64 code doesn't need to be 8-byte aligned, just the data. Currently, we guarantee this by aligning all code anyways, since the data follows immediately after it. This is wasteful, since it means about half of all stencils end in a nop. Instead, don't pad any stencils, and just align the data when it's compiled. 🤦🏼
The textual assembly "optimizer" pass has a bug where it interprets lines that are commented with ; as instructions. By recognizing these commented lines, we can remove more zero-length jumps at the end of stencils. 🤦🏼
During this same pass, we can represent the address of the next instruction (the end of the template, or the _JIT_CONTINUE label) as a "local" label, which allows the assembler to resolve it at compile time and encode it more efficiently. There's a special (platform-dependent) prefix to signal this.
Finally, instead of declaring jump targets (_JIT_CONTINUE, _JIT_ERROR_TARGET, and _JIT_JUMP_TARGET) as extern symbols, just declare them as local functions. This results in more efficient jumps (and also allows us to remove a somewhat hacky pre-processing step for the textual assembly on Windows to force these efficient jumps).

This is amazing! I can't review it, but thanks for all your assembler wizardry on this.

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The PR looks great, thanks for making the JIT so much clearer to reason about.

taegyunkim pushed a commit to taegyunkim/cpython that referenced this pull request

Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull request