gh-104909: Split BINARY_OP into micro-ops by gvanrossum · Pull Request #104910 · python/cpython
At a higher level, I'm worried that this will be pretty cumbersome to carry out further if we don't design a good solution for sharing locals/opargs/etc. across uops, or maybe even redesigning the instruction format to maybe reorder the parts of the stream in the order they're used.
I don't have any key insights here, but I sense that we'll be sort of fighting with the current generator/instruction format otherwise.
Thanks for this observation -- one reason I started pulling on this particular thread now was to make sure that things like this would surface sooner rather than later.
I am beginning to think that one thing we're running into here is that the needs for the optimizer, the machine code generator, and the Tier-2 interpreter are all somewhat different. My own focus has been largely on the Tier-2 interpreter, since I'm more confident that I can build one. Assuming we keep it a stack-based VM using the same stack as the base interpreter (not a forgone conclusion yet, but makes some things easier), the only way for uops to pass things to each other is via the stack. The generator has a strategy for this for the base interpreter where such items aren't really pushed onto the evaluation stack, but stored in local variables, with the hope that the C compiler does something semi-intelligent with those (see e.g. how _tmp_1 and _tmp_2 are used in the generated code for BINARY_OP_ADD_INT, or _tmp_1 in LOAD_LOCALS).
This strategy may also work for the optimizer -- I assume it'll be maintaining some kind of internal representation of what's on the stack anyway. But for the machine code generator we would need something different -- in fact a register-based VM might make more sense here.
In any case, I'd love to hear from @markshannon about this, which probably means we'll have to wait until he and I are both back from our respective vacations.