It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.
It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.
Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.
Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.
Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.
Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)
Adding that to my ARM wish-list.
Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!
64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?
Flag suppression is great for doing more complex things with flags. It's not like they had a choice, but it's awkward to have to use longer encodings for it. Flag setting forms will remain the default, so x86 hardware can't take advantage of that in the way that ARM can (e.g. only half the integer ports can read/write flags on M1).
I'm also a fan of ARM's CBZ, which acts as a flag-preserving branch-if-zero. Technically x86 has LOOP, but they'd have to make that fast for it to be an alternative.
@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?
(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)
@pervognsen @jasonmolenda I don't really – @comex's substitute comes to mind, but is unmaintained (and was maybe only intended to target function prologues, rather than arbitrary locations?) I haven't kept up with what people are using instead.
(IIRC it just refuses to inject if the hook would overlap a possible jump target (as determined by scanning forwards), then rewrites the replaced instructions to fix any RIP-relative operands.)
@pervognsen @dougall @jasonmolenda My library didn't step; it checked which instruction in the patched region the thread was at, and moved the PC to the corresponding instruction in the relocated code.
For the relocated code, PC-relative instructions were converted to sequences of instructions that did the same thing using absolute addresses. This is for x86: https://github.com/comex/substitute/blob/95f2beda374625dd503bfb51a758b6f6ced57887/lib/x86/arch-transform-dis.inc.h#L24 On x86 it had to use the stack, uniike ARM (either 32 or 64).
@dougall @pervognsen @jasonmolenda Aside from being unmaintained, my library also only worked on macOS.
The thing about injecting into the middle of a function is that you must have already looked at the function in a disassembler, or else you wouldn't know where to inject or which registers or stack locations have the data you want. So you could just manually ensure the injection point doesn't overlap a jump target. Though it would be neat if a library let you just inject anywhere.
@dougall *puts on tinfoil hat* CCMP lets you turn whatever condition you want into OF set (just do a CCMP on !cond of reg with itself and set OF to 1 on the "not set" path) and then you can just use INTO!
Ok sure, *technically* not allowed in x86-64 ever, but still!!11