Interesting hack. This is roughly equivalent to writing a dynamic code generator...

thedigitalengel · on July 21, 2013

It isn't truly arch-independent till it assumes PARAMETER_CONSTANT and FUNCTION_CONSTANT will be stored as direct immediates in the generated code. On some archs, for instance, 0xFEEDBEEF might be too big a constant; and the compiler would then be forced to move it to a register (or a stack slot) in parts.

Edit: and of course, you run the possibility that on some archs, 0XFEEDBEEF is actually a valid encoding for some instruction. :)

Someone · on July 21, 2013

For examples of a CPU that behave differently, lok at RISC CPUs. 32-bit PowerPC, for example, would translate an immediate long load into an immediate 'load short into high word and zero out low word' and a signed immediate addition (it would load $DEAE first, then add -$4111 to get $DEADBEEF)

The list of problems is way longer, by the way. This code makes assumptions on pointer size (I don't think it will run on x64 with common ABI's)

There also is no guarantee that function pointers point to the memory where the function's code can be found (there could be a trampoline in-between, or a devious compiler writer could encrypt function addresses).

Neither is there a guarantee that functions defined sequentially get addresses that are laid out sequentially (there is no portable way to figure out the size of a function in bytes).

Finally, I don't think there is a guarantee that one can read a function's code (its memory could be marked 'execute only').

I guess those more familiar with The C standard will find more portability issues.

saurik · on July 22, 2013

ARM (quite common these days) will sometimes do this as well, as the immediate value of an instruction has to fit inside of the fixed-width instruction, and the instruction is only itself 32-bits (some required for overhead). While it is slightly more likely you will see a PC-relative load, that requires an extra data fetch; so, instead, a "movt" (move top) instruction is used to set the upper 16-bits of a register after first setting the lower 16-bits. This requires just as much space as the load and separate data literal, and runs entirely out of the instruction cache.

raverbashing · on July 22, 2013

What I saw an ARM compiler doing once is putting the 0xDEADBEEF into a data segment (in the 'text' area I think) then loading from there

Like it would load a string, but only 4 bytes, it can load in one step

saurik · on July 22, 2013

Not certain if you are disagreeing, so to clarify: you can either do a load like that or two moves. The load requires one instruction word and one data word, so two words. The moves require two instruction words, so also two words. The load however, requires a data fetch. This fetch has to come from somewhere nearby, as the load instruction cannot offset very far, and so is going to be in the same segment as the code (called "text", for historic reasons: this has nothing to do with strings, which will be stored in the "data" segment, or sometimes even a "strings" segment), and most of the time will be directly after (or, for long functions, inside) of the code for that function. This is a trade off: moves are fast (cycle counts of instructions are not all the same, so "single step" doesn't mean much), have better guarantees that they will always have the same result (code changing has much heavier penalties than data changing), and might use up an asynchronous memory access that other parts of your code might be saturating (although I honestly am not certain if ARM does this like other architectures do). It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data.

raverbashing · on July 22, 2013

I'm not disagreeing, just mentioning

"It thereby comes down to circumstance, configuration, and the whims of the people who wrote your compiler as to whether you will get mov+movt or a single ldr pc,.X+data."

Yes, this sums it nicely

nkurz · on July 21, 2013

I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors. I'm working on intersecting lists of compressed integers at really quickly, and JIT-like approaches seem to be the frontier for fast decompression schemes. It seems likely you've thought about this. Any suggestions?

http://stackoverflow.com/questions/17738154/how-can-i-write-...

haberman · on July 21, 2013

> I'm very interested in how dynamically generated code interacts with instruction and micro-op caching on modern processors.

My basic mental model is that whenever you write to an executable page you pay a significant overhead. Basically all of the performance caveats that apply to self-modifying code would apply here; a good starting point might be http://en.wikipedia.org/wiki/Self-modifying_code

In my application the code generation is a one-time up-front cost that is easily amortized over the subsequent execution, so I haven't had a need to explore this question in more depth.

> JIT-like approaches seem to be the frontier for fast decompression schemes.

Interesting, I have been feeling lately like this might be an area where JIT-like approaches could yield a big benefit, but hadn't seen any actual work in this area. Do you have references to anyone doing any kind of work like that?

thedigitalengel · on July 22, 2013

I don't know if this counts, but the linux kernel famously uses a JIT to filter network packets:

https://github.com/torvalds/linux/blob/master/arch/x86/net/b...