If fusing a certain pair would significantly improve performance of most code, y...

If fusing a certain pair would significantly improve performance of most code, you'd just add that fused instruction to your bytecode and let the C compiler optimize the combined code in the interpreter. I have to assume CPython as already done that for all the low hanging fruit.

In fact, for such a fused instruction to be optimized that way on a copy-and-patch JIT it'd need to exist as a new bytecode in interpreter. A JIT that fuses instructions is no longer a copy-and-patch JIT.

A copy-and-patch JIT reduces interpretation overhead by making sure the branches in the executed machine code are the branches in the code to be interpreted, not branches in the interpreter.

This is make a huge difference in more naive interpreters, not so much in an heavily optimized threaded-code interpreter.

The 10% is great, and nothing to sneeze at for a first commit. But I'd actually like some realistic analysis of next steps for improvement, because I'm skeptical instruction fusing and other things being hand waved are it. Certainly not on a copy-and-patch JIT.

For context: I spent significant effort trying to add such instruction fusing to a simple WASM AOT compiler and got nowhere (the equivalent of constant loading was precisely one of the pairs). Only moving to a much smarter JIT (capable of looking at whole basic blocks of instructions) started making a difference.