It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Introducing Intel® Advanced Performance Extensions (Intel® APX)

These extensions expand the x86 instruction set with access to registers and features that improve performance.

Intel

Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.

Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.

The "Balanced PUSH/POP Hint" is a little odd, but mirrors an optimisation used by the M1 that detects matching pushes and pops with the register numbers typically used by compilers in function prologues and epilogues, and performs fast forwarding. There's a store-to-load-forwarding-like penalty in cases where the instructions are incorrectly paired, but don't actually alias, so I guess this hint could avoid that.

Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.

Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)

Adding that to my ARM wish-list.

@dougall *puts on tinfoil hat* CCMP lets you turn whatever condition you want into OF set (just do a CCMP on !cond of reg with itself and set OF to 1 on the "not set" path) and then you can just use INTO!

Ok sure, *technically* not allowed in x86-64 ever, but still!!11