Hot-cold splitting for JIT stencils
Feature or enhancement
Proposal:
We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.
Currently this is _BINARY_OP_ADD_INT:
// _BINARY_OP_ADD_INT_r23.o: file format elf64-x86-64 // // Disassembly of section .text: // // 0000000000000000 <_JIT_ENTRY>: // 0: 55 pushq %rbp // 1: 48 83 ec 10 subq $0x10, %rsp // 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp) // a: 48 89 fb movq %rdi, %rbx // d: 4c 89 fd movq %r15, %rbp // 10: 4c 89 ff movq %r15, %rdi // 13: 48 83 e7 fe andq $-0x2, %rdi // 17: 48 89 de movq %rbx, %rsi // 1a: 48 83 e6 fe andq $-0x2, %rsi // 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24> // 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 // 24: 48 83 f8 01 cmpq $0x1, %rax // 28: 75 15 jne 0x3f <_JIT_ENTRY+0x3f> // 2a: 49 89 ef movq %rbp, %r15 // 2d: 48 89 df movq %rbx, %rdi // 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi // 35: 48 83 c4 10 addq $0x10, %rsp // 39: 5d popq %rbp // 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4 // 3f: 49 89 c7 movq %rax, %r15 // 42: 48 89 ef movq %rbp, %rdi // 45: 48 89 de movq %rbx, %rsi // 48: 48 83 c4 10 addq $0x10, %rsp // 4c: 5d popq %rbp
With hot-cold splitting, it will be split into:
_BINARY_OP_ADD_INT_r23.HOT: // 0000000000000000 <_JIT_ENTRY>: // 0: 55 pushq %rbp // 1: 48 83 ec 10 subq $0x10, %rsp // 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp) // a: 48 89 fb movq %rdi, %rbx // d: 4c 89 fd movq %r15, %rbp // 10: 4c 89 ff movq %r15, %rdi // 13: 48 83 e7 fe andq $-0x2, %rdi // 17: 48 89 de movq %rbx, %rsi // 1a: 48 83 e6 fe andq $-0x2, %rsi // 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24> // 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 // 24: 48 83 f8 01 cmpq $0x1, %rax // 28: 75 15 jne 0x3f <_JIT_ENTRY+0x3f> // 3f: 49 89 c7 movq %rax, %r15 // 42: 48 89 ef movq %rbp, %rdi // 45: 48 89 de movq %rbx, %rsi // 48: 48 83 c4 10 addq $0x10, %rsp // 4c: 5d popq %rbp _BINARY_OP_ADD_INT_r23.COLD: // 2a: 49 89 ef movq %rbp, %r15 // 2d: 48 89 df movq %rbx, %rdi // 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi // 35: 48 83 c4 10 addq $0x10, %rsp // 39: 5d popq %rbp // 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4
Running the current jump inversion and zero length jump removal then gives us:
_BINARY_OP_ADD_INT_r23.HOT: // 0000000000000000 <_JIT_ENTRY>: // 0: 55 pushq %rbp // 1: 48 83 ec 10 subq $0x10, %rsp // 5: 48 89 74 24 08 movq %rsi, 0x8(%rsp) // a: 48 89 fb movq %rdi, %rbx // d: 4c 89 fd movq %r15, %rbp // 10: 4c 89 ff movq %r15, %rdi // 13: 48 83 e7 fe andq $-0x2, %rdi // 17: 48 89 de movq %rbx, %rsi // 1a: 48 83 e6 fe andq $-0x2, %rsi // 1e: ff 15 00 00 00 00 callq *(%rip) # 0x24 <_JIT_ENTRY+0x24> // 0000000000000020: R_X86_64_GOTPCRELX _PyCompactLong_Add-0x4 // 24: 48 83 f8 01 cmpq $0x1, %rax // 28: 75 15 je _BINARY_OP_ADD_INT_r23.COLD // 3f: 49 89 c7 movq %rax, %r15 // 42: 48 89 ef movq %rbp, %rdi // 45: 48 89 de movq %rbx, %rsi // 48: 48 83 c4 10 addq $0x10, %rsp // 4c: 5d popq %rbp _BINARY_OP_ADD_INT_r23.COLD: // 2a: 49 89 ef movq %rbp, %r15 // 2d: 48 89 df movq %rbx, %rdi // 30: 48 8b 74 24 08 movq 0x8(%rsp), %rsi // 35: 48 83 c4 10 addq $0x10, %rsp // 39: 5d popq %rbp // 3a: e9 00 00 00 00 jmp 0x3f <_JIT_ENTRY+0x3f> // 000000000000003b: R_X86_64_PLT32 _JIT_JUMP_TARGET-0x4
We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.
This builds on #142228.
In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple _BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common _BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.
I will work on this.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response