Interesting hack. This is roughly equivalent to writing a dynamic code generator (JIT) for generating tiny little trampolines, except it's architecture-independent. It achieves portability by patching up some existing code (for an unknown instruction set) rather than generating code from scratch (which would require you to understand the instruction set).
It would need some changes to be truly architecture-independent; for example it should use memcmp() instead of this idiom which can throw SIGBUS on some platforms due to the unaligned access:
if (*((intptr_t *)code) == look) {
I used a similar hack once when I wanted to create an ELF object file at runtime, which is necessary to use the GDB JIT interface (http://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html). I didn't want to implement the whole ELF format myself, but I needed this ELF file to point to some JIT code that is located at an address that is only known at runtime. I achieved this by generating the ELF object file with gcc at build time, but used a similarly distinctive constant for the address so I could patch it up at runtime.
If you wanted to do this through straight-up codegen (instead of patching) this would be really easy with DynASM, and you could use my tutorial here as a guide:
It isn't truly arch-independent till it assumes PARAMETER_CONSTANT and FUNCTION_CONSTANT will be stored as direct immediates in the generated code. On some archs, for instance, 0xFEEDBEEF might be too big a constant; and the compiler would then be forced to move it to a register (or a stack slot) in parts.
Edit: and of course, you run the possibility that on some archs, 0XFEEDBEEF is actually a valid encoding for some instruction. :)
For examples of a CPU that behave differently, lok at RISC CPUs. 32-bit PowerPC, for example, would translate an immediate long load into an immediate 'load short into high word and zero out low word' and a signed immediate addition (it would load $DEAE first, then add -$4111 to get $DEADBEEF)
The list of problems is way longer, by the way. This code makes assumptions on pointer size (I don't think it will run on x64 with common ABI's)
There also is no guarantee that function pointers point to the memory where the function's code can be found (there could be a trampoline in-between, or a devious compiler writer could encrypt function addresses).
Neither is there a guarantee that functions defined sequentially get addresses that are laid out sequentially (there is no portable way to figure out the size of a function in bytes).
Finally, I don't think there is a guarantee that one can read a function's code (its memory could be marked 'execute only').
I guess those more familiar with The C standard will find more portability issues.
ARM (quite common these days) will sometimes do this as well, as the immediate value of an instruction has to fit inside of the fixed-width instruction, and the instruction is only itself 32-bits (some required for overhead). While it is slightly more likely you will see a PC-relative load, that requires an extra data fetch; so, instead, a "movt" (move top) instruction is used to set the upper 16-bits of a register after first setting the lower 16-bits. This requires just as much space as the load and separate data literal, and runs entirely out of the instruction cache.
Not certain if you are disagreeing, so to clarify: you can either do a load like that or two moves. The load requires one instruction word and one data word, so two words. The moves require two instruction words, so also two words. The load however, requires a data fetch. This fetch has to come from somewhere nearby, as the load instruction cannot offset very far, and so is going to be in the same segment as the code (called "text", for historic reasons: this has nothing to do with strings, which will be stored in the "data" segment, or sometimes even a "strings" segment), and most of the time will be directly after (or, for long functions, inside) of the code for that function. This is a trade off: moves are fast (cycle counts of instructions are not all the same, so "single step" doesn't mean much), have better guarantees that they will always have the same result (code changing has much heavier penalties than data changing), and might use up an asynchronous memory access that other parts of your code might be saturating (although I honestly am not certain if ARM does this like other architectures do). It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data.
"It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data."
I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors. I'm working on intersecting lists of compressed integers at really quickly, and JIT-like approaches seem to be the frontier for fast decompression schemes. It seems likely you've thought about this. Any suggestions?
> I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors.
My basic mental model is that whenever you write to an executable page you pay a significant overhead. Basically all of the performance caveats that apply to self-modifying code would apply here; a good starting point might be http://en.wikipedia.org/wiki/Self-modifying_code
In my application the code generation is a one-time up-front cost that is easily amortized over the subsequent execution, so I haven't had a need to explore this question in more depth.
> JIT-like approaches seem to be the frontier for fast decompression schemes.
Interesting, I have been feeling lately like this might be an area where JIT-like approaches could yield a big benefit, but hadn't seen any actual work in this area. Do you have references to anyone doing any kind of work like that?
It would need some changes to be truly architecture-independent; for example it should use memcmp() instead of this idiom which can throw SIGBUS on some platforms due to the unaligned access:
I used a similar hack once when I wanted to create an ELF object file at runtime, which is necessary to use the GDB JIT interface (http://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html). I didn't want to implement the whole ELF format myself, but I needed this ELF file to point to some JIT code that is located at an address that is only known at runtime. I achieved this by generating the ELF object file with gcc at build time, but used a similarly distinctive constant for the address so I could patch it up at runtime.If you wanted to do this through straight-up codegen (instead of patching) this would be really easy with DynASM, and you could use my tutorial here as a guide:
http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-s...