Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

The fun fact being that older CPUs decode ENDBR64 as a slightly weird NOP (with no architectural effects), but it'll fault on original Pentiums: https://stackoverflow.com/questions/56120231/how-do-old-cpus...


There's a good question in the comments there that I still don't see the answer to. How does this work if there's an interrupt between the branch and the endbranch? Does the OS need to save/restore the "branchness" bit?


Yes, on arm the branch type is saved in SPSR_EL1 in the BTYPE field. That stands for Saved Program State Register for Kernel Mode (Exception Level 1) and Branch Type. https://developer.arm.com/documentation/ddi0595/2021-12/AArc...


there is no branchness bit, if there's an endbranch you can jump to it


Ah so when you return from an interrupt, the check is no longer done?


I'd assume so since it wouldn't be a call/jmp coming from a computed address in a register. That said I haven't read the documentation for any of this. But interrupts should be having a stack pointer change and other things happening that would be different, which is why they use the IRET instruction and not the RET one.


Various architectures do other interesting things with NOPs, IIRC one convention on PowerPC had something vaguely related to debugging or tracing (I can't remember the details or find any references right now).


Not just architectures, but different OSes and ABIs have found ways to repurpose no-ops. One example[1] is Windows using the 2-byte "MOV EDI, EDI" as a hot-patch point: it gets replaced by a "JMP $-5" instruction which jumps 5 bytes before the start of a function into a spot reserved for patching. That 5 bytes is enough to contain a full jump instruction that can then jump wherever you need it to.

## Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?

[1]: https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=95...


Interesting, thanks for pointing this out! Just yesterday I was gazing at some program containing two consecutive xor rax, rax. I thought what’s the point? But as you point out it might be a NOP sled designed to be that specific length.


That would be surprising. xor is often used like that to set a register to 0, which is far from a nop. I'm not sure why it would do it twice, but it might be as simple as the compiler being stupid.


The second one is effectively a nop though.

The fact that it’s xor rax, rax rather than xor eax, eax is also interesting as it’s one byte longer for exactly the same effect (modifying the bottom 32 bits of a register clears the upper 32 bits). It makes me think there’s something weird going on other than compiler stupidity. I’d be interested in seeing the code it was compiled from.


I wonder if this is still true. Whenever I go to hook Win32 API functions, I use an off-the-shelf length disassembler to create a trampoline with the first n bytes of instructions and a jmp back, and then just patch in a jmp to my hook, but if this hot-patch point exists it'd be a lot less painful since you can avoid basically all of that.

Though, I guess even if it was, it'd be silly to rely on it even on x86 only. Maybe it would still make for a nice fast-path? Dunno.


Good read. Thank you.

This just worsens my fear of changing "unnecessary" code when I don't know the original motivation for it.


Intel Vtune will do this with 5-byte NOPs directly. I think LLVM's x-ray tracing suite did this with a much bigger NOP, also, to capture more information.


RISC-V has a whole HINT space that's basically just morphs of load immediate into zero register.

AArch64 has a similar space: https://developer.arm.com/documentation/ddi0596/2020-12/Base...

And yes, PowerPC has a similar space as well holding hints like 'give priority to the other hardware threads on this core' and the like. https://utcc.utoronto.ca/~cks/space/blog/tech/PowerPCInstruc...


I was wondering where did I read about PowerPC, and this is exactly the article! So, it was for thread priority. Strikes me as an odd design choice, this probably should've been something to be managed by the OS more explicitly.


I think the idea of exposing it to user space is to better handle concurrency before trapping into the kernel.

So consider the case of a standard mutex in the contended case. Normally the code will spin for a little bit before informing the kernel scheduler on the off chance that the thread that owns the lock is currently scheduled on another hardware thread. In that case it's in the best interest of the thread trying to grab the lock to shift most of the intracore priority to any other hardware threads so that it can potentially help the other hardware thread holding the lock get to a point where it gives up the lock quicker.


https://www.ibm.com/docs/en/aix/7.3?topic=h-hpmstat-command:

“random_samp_ele_crit=name

Specifies the random criteria for selecting the instructions for sampling. Valid values for this option are as follows:

ALL_INSTR

All instructions are eligible. This value is the default setting.

LOAD_STORE

The operation is routed to the Load Store Unit (LSU); for example, load, store.

PROB_NOP

Sample only special no-operation instructions, which are called Probe NOP events.

[…]”


Some MIPS cores had a superscalar NOP that would stall every ALU by one cycle, which was necessary because they lacked synchronization instructions.


That’s really clever use of the opcode space. Thanks for passing that along.


NOP on intels is in fact xchg eax, eax




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: