Hot-cold splitting for JIT stencils

Fidget-Spinner

Feature or enhancement

Proposal:

We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.

Currently this is _BINARY_OP_ADD_INT:

    // _BINARY_OP_ADD_INT_r23.o:      file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         jne     0x3f <_JIT_ENTRY+0x3f>
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

With hot-cold splitting, it will be split into:

_BINARY_OP_ADD_INT_r23.HOT:
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         jne     0x3f <_JIT_ENTRY+0x3f>
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

_BINARY_OP_ADD_INT_r23.COLD:
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

Running the current jump inversion and zero length jump removal then gives us:

_BINARY_OP_ADD_INT_r23.HOT:
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         je    _BINARY_OP_ADD_INT_r23.COLD
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

_BINARY_OP_ADD_INT_r23.COLD:
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.

This builds on #142228.

In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple _BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common _BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.

I will work on this.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

gh-143158: Hot cold code splitting for JIT compiler #149292