It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.
It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.
Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.
Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.
Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.
Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)
Adding that to my ARM wish-list.
Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!
64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?
@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?
(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)
@pervognsen @jasonmolenda I don't really – @comex's substitute comes to mind, but is unmaintained (and was maybe only intended to target function prologues, rather than arbitrary locations?) I haven't kept up with what people are using instead.
(IIRC it just refuses to inject if the hook would overlap a possible jump target (as determined by scanning forwards), then rewrites the replaced instructions to fix any RIP-relative operands.)
@pervognsen @dougall @jasonmolenda My library didn't step; it checked which instruction in the patched region the thread was at, and moved the PC to the corresponding instruction in the relocated code.
For the relocated code, PC-relative instructions were converted to sequences of instructions that did the same thing using absolute addresses. This is for x86: https://github.com/comex/substitute/blob/95f2beda374625dd503bfb51a758b6f6ced57887/lib/x86/arch-transform-dis.inc.h#L24 On x86 it had to use the stack, uniike ARM (either 32 or 64).
@dougall @pervognsen @jasonmolenda Aside from being unmaintained, my library also only worked on macOS.
The thing about injecting into the middle of a function is that you must have already looked at the function in a disassembler, or else you wouldn't know where to inject or which registers or stack locations have the data you want. So you could just manually ensure the injection point doesn't overlap a jump target. Though it would be neat if a library let you just inject anywhere.