It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Introducing Intel® Advanced Performance Extensions (Intel® APX)

These extensions expand the x86 instruction set with access to registers and features that improve performance.

Intel

Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.

Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.

The "Balanced PUSH/POP Hint" is a little odd, but mirrors an optimisation used by the M1 that detects matching pushes and pops with the register numbers typically used by compilers in function prologues and epilogues, and performs fast forwarding. There's a store-to-load-forwarding-like penalty in cases where the instructions are incorrectly paired, but don't actually alias, so I guess this hint could avoid that.

Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.

Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)

Adding that to my ARM wish-list.

Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!

64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?

@dougall Yeah, curious about the absolute jump. They definitely wouldn't include it for no reason but I can't think of a strong motivator either.
@pervognsen @dougall I can't imagine this was a consideration, but one thing we've looked at in lldb is hot-patching instructions in a binary to jump to code we've jitted into the process. We need to replace instruction(s) with our jump, and the more instructions we have to replace, the trickier this gets. (the one time we've tried this technique was for conditional breakpoints, evaluating the condition in the process, putting a breakpoint in the evals-true case.)

@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?

(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)

@dougall @jasonmolenda @pervognsen apart from saving a byte, it also allows to jump without dirtying a register, but I still fail to imagine the intended use
@amonakov @jasonmolenda @pervognsen True, yeah... Maybe it is for debugger hooks? Having to subtract from RSP to skip the red zone, then push to save the value from the temporary register would make it more like 17 bytes vs 11 bytes.
@dougall @jasonmolenda @pervognsen oh don't get me started on debugger hooks. I once discovered, during a seminar, that if you put breakpoint on an AVX instruction, GDB will segfault your program when single-stepping from that instruction (because it tries to "relocate" the instruction, using a 1999-vintage opcode map to check if the instruction has a ModRM byte). I don't think the debugger is supposed to modify the debuggee like that, silently.
More details in my report: https://sourceware.org/bugzilla/show_bug.cgi?id=28999
28999 – amd64_get_insn_details wrong for some AVX instructions

@dougall @jasonmolenda @pervognsen I added a link to the bugreport in my toot, which reminded me: the poor soul who discovered it before me hit it in a situation where the incorrectly-relocated instruction accessed data at a wrong address without immediately segfaulting.
Six breakpoints (via debug registers) should be enough for everyone. Just simply ask the user to opt into intrusive breakpointing if they need more than that.
@amonakov @dougall @pervognsen It's been 16 years since I worked on gdb but what???? A software breakpoint replaces the first byte with 0xcc/int3 on intel, that traps to gdb, user resumes execution, it puts the original instruction byte back, instruction-steps the process, puts the breakpoint back. They're moving the instruction to execute it? Never heard of a software breakpoint algorithm that would do that, unless repeated icache flushing was a problem? Maybe a failure of my imagination!
@jasonmolenda @dougall @pervognsen What you described leads to potentially-missed breakpoints in multithreaded debuggees (unless you stop all the threads, which can be many), so they added this "displaced stepping" algorithm and made it the default 🙁
@amonakov @dougall @pervognsen oh yes! this non-stop debugging I've heard about, that was added after I stopped working on gdb. OK that makes sense. lldb stops all threads and instruction steps the thread that has a breakpoint when you resume, then resumes all threads.
@jasonmolenda @dougall @pervognsen Non-stop indeed! We have bugs in yo debugger so you can debug while you debug. A relentless! Debugging! Experience!