It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Introducing Intel® Advanced Performance Extensions (Intel® APX)

These extensions expand the x86 instruction set with access to registers and features that improve performance.

Intel

Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.

Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.

The "Balanced PUSH/POP Hint" is a little odd, but mirrors an optimisation used by the M1 that detects matching pushes and pops with the register numbers typically used by compilers in function prologues and epilogues, and performs fast forwarding. There's a store-to-load-forwarding-like penalty in cases where the instructions are incorrectly paired, but don't actually alias, so I guess this hint could avoid that.

Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.

Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)

Adding that to my ARM wish-list.

Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!

64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?

@dougall Yeah, curious about the absolute jump. They definitely wouldn't include it for no reason but I can't think of a strong motivator either.
@pervognsen @dougall I can't imagine this was a consideration, but one thing we've looked at in lldb is hot-patching instructions in a binary to jump to code we've jitted into the process. We need to replace instruction(s) with our jump, and the more instructions we have to replace, the trickier this gets. (the one time we've tried this technique was for conditional breakpoints, evaluating the condition in the process, putting a breakpoint in the evals-true case.)

@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?

(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)

@dougall @jasonmolenda Anything but single-byte code injection is such a nightmare on x86. Do you know if there's a smallish, self-contained library? There seems to be so many prerequisites to handle the general case, at a minimum you need the CFG for the enclosing function. It doesn't seem many steps short of a full recompiler, unfortunately, and those tend to be heavyweight.
@dougall @jasonmolenda And speaking of the general/worst case, what if there are absolute return addresses on thread stacks? Now you need to do the moral equivalent of what JITs call on-stack replacement to remap the old code addresses to new addresses and fix them up on all thread stacks/registers.