It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Introducing Intel® Advanced Performance Extensions (Intel® APX)

These extensions expand the x86 instruction set with access to registers and features that improve performance.

Intel

Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.

Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.

The "Balanced PUSH/POP Hint" is a little odd, but mirrors an optimisation used by the M1 that detects matching pushes and pops with the register numbers typically used by compilers in function prologues and epilogues, and performs fast forwarding. There's a store-to-load-forwarding-like penalty in cases where the instructions are incorrectly paired, but don't actually alias, so I guess this hint could avoid that.

Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.

Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)

Adding that to my ARM wish-list.

Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!

64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?

Flag suppression is great for doing more complex things with flags. It's not like they had a choice, but it's awkward to have to use longer encodings for it. Flag setting forms will remain the default, so x86 hardware can't take advantage of that in the way that ARM can (e.g. only half the integer ports can read/write flags on M1).

I'm also a fan of ARM's CBZ, which acts as a flag-preserving branch-if-zero. Technically x86 has LOOP, but they'd have to make that fast for it to be an alternative.

@dougall Yeah, curious about the absolute jump. They definitely wouldn't include it for no reason but I can't think of a strong motivator either.
@pervognsen Yeah, I can imagine using it instead of a GOT, with the linker patching the instructions to avoid using the BTB unnecessarily, but that's neither compatible with W^X memory protections (assuming lazy linking), nor is it good for ahead-of-time linking (like the dyld_shared_cache on macOS), unless you've given up on ASLR. I guess it also saves BTB for jumping out of JIT (or calling via a local call-to-jump), but it just doesn't quite add up.
@pervognsen @dougall I can't imagine this was a consideration, but one thing we've looked at in lldb is hot-patching instructions in a binary to jump to code we've jitted into the process. We need to replace instruction(s) with our jump, and the more instructions we have to replace, the trickier this gets. (the one time we've tried this technique was for conditional breakpoints, evaluating the condition in the process, putting a breakpoint in the evals-true case.)

@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?

(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)

@dougall @jasonmolenda Anything but single-byte code injection is such a nightmare on x86. Do you know if there's a smallish, self-contained library? There seems to be so many prerequisites to handle the general case, at a minimum you need the CFG for the enclosing function. It doesn't seem many steps short of a full recompiler, unfortunately, and those tend to be heavyweight.

@pervognsen @jasonmolenda I don't really – @comex's substitute comes to mind, but is unmaintained (and was maybe only intended to target function prologues, rather than arbitrary locations?) I haven't kept up with what people are using instead.

(IIRC it just refuses to inject if the hook would overlap a possible jump target (as determined by scanning forwards), then rewrites the replaced instructions to fix any RIP-relative operands.)

https://github.com/comex/substitute

GitHub - comex/substitute: A free runtime modification library.

A free runtime modification library. Contribute to comex/substitute development by creating an account on GitHub.

GitHub
@pervognsen @jasonmolenda @comex Oh, and I think it does a dance of stopping all other threads and then stepping any of them that happen to be currently executing in the hook region? (The end of a call operation would also be considered a possible jump target and injection would fail. But this still doesn't rule out some other code elsewhere in the binary jumping into the middle of your hook. And it's impractical to rule out indirect jumps into the middle of the hook.)
@dougall @jasonmolenda @comex Yeah, every time I've contemplated doing this in a debugger/instrumentation, I've come to the conclusion that the never-ending tail of fundamentally hard issues make it untenable as a general solution (e.g. good luck "stepping to the end" for a while (!quit) { ... } loop). However, you can do some kind of best-effort local code modification and then fall back to int3 when you fail.
@pervognsen @dougall @comex Yeah doing this in arbitrary code becomes very tricky; it wasn't implemented beyond a prototype. I can't remember how he picked which instructions to execute in the jitted block (did he avoid anything with pc-relative encodings??) and how he avoided inserting this across a branch point. It is nice to not need to use a register. I didn't think this was the real motivator, but an external debugger is one place where an absolute address might be useful.
@pervognsen @dougall @comex all that being said, conditional breakpoints in a debugger, where the debugger has to halt the process every time, are nearly useless in a hot codepath so there's a real benefit for trying to find a way to do this in-process, even if it isn't possible in all cases and we have to fall back to stop & compare.
@jasonmolenda @dougall @comex Yup, absolutely. At some point mode switching (perhaps via an int3 handler) to a fully software-based machine code emulator becomes more attractive once certain int3 break/probe points become too hot. That has its own massive engineering challenges but gives a lot of flexibility and can have decent performance (certainly compared to int3 thrashing).
@jasonmolenda @pervognsen @dougall As a frequent user of conditional breakpoints and breakpoint commands, I’d love to see that. Even better would be to also support JITting a subset of breakpoint commands. (Ever used GDB's dprintf?)

@pervognsen @dougall @jasonmolenda My library didn't step; it checked which instruction in the patched region the thread was at, and moved the PC to the corresponding instruction in the relocated code.

For the relocated code, PC-relative instructions were converted to sequences of instructions that did the same thing using absolute addresses. This is for x86: https://github.com/comex/substitute/blob/95f2beda374625dd503bfb51a758b6f6ced57887/lib/x86/arch-transform-dis.inc.h#L24 On x86 it had to use the stack, uniike ARM (either 32 or 64).

substitute/lib/x86/arch-transform-dis.inc.h at 95f2beda374625dd503bfb51a758b6f6ced57887 · comex/substitute

A free runtime modification library. Contribute to comex/substitute development by creating an account on GitHub.

GitHub
@comex @pervognsen @jasonmolenda Oh, right – that makes more sense. (Thanks for publishing it! I really enjoyed reading it however many years ago – lots of great tricks and ideas.)

@dougall @pervognsen @jasonmolenda Aside from being unmaintained, my library also only worked on macOS.

The thing about injecting into the middle of a function is that you must have already looked at the function in a disassembler, or else you wouldn't know where to inject or which registers or stack locations have the data you want. So you could just manually ensure the injection point doesn't overlap a jump target. Though it would be neat if a library let you just inject anywhere.

@dougall @jasonmolenda And speaking of the general/worst case, what if there are absolute return addresses on thread stacks? Now you need to do the moral equivalent of what JITs call on-stack replacement to remap the old code addresses to new addresses and fix them up on all thread stacks/registers.
@pervognsen @dougall @jasonmolenda The general case generally needs to handle code that reads itself. Unfortunately this usually requires bundling an emulator.
@dougall @jasonmolenda @pervognsen apart from saving a byte, it also allows to jump without dirtying a register, but I still fail to imagine the intended use
@amonakov @jasonmolenda @pervognsen True, yeah... Maybe it is for debugger hooks? Having to subtract from RSP to skip the red zone, then push to save the value from the temporary register would make it more like 17 bytes vs 11 bytes.
@dougall @jasonmolenda @pervognsen oh don't get me started on debugger hooks. I once discovered, during a seminar, that if you put breakpoint on an AVX instruction, GDB will segfault your program when single-stepping from that instruction (because it tries to "relocate" the instruction, using a 1999-vintage opcode map to check if the instruction has a ModRM byte). I don't think the debugger is supposed to modify the debuggee like that, silently.
More details in my report: https://sourceware.org/bugzilla/show_bug.cgi?id=28999
28999 – amd64_get_insn_details wrong for some AVX instructions

@dougall @jasonmolenda @pervognsen I added a link to the bugreport in my toot, which reminded me: the poor soul who discovered it before me hit it in a situation where the incorrectly-relocated instruction accessed data at a wrong address without immediately segfaulting.
Six breakpoints (via debug registers) should be enough for everyone. Just simply ask the user to opt into intrusive breakpointing if they need more than that.
@amonakov @dougall @pervognsen It's been 16 years since I worked on gdb but what???? A software breakpoint replaces the first byte with 0xcc/int3 on intel, that traps to gdb, user resumes execution, it puts the original instruction byte back, instruction-steps the process, puts the breakpoint back. They're moving the instruction to execute it? Never heard of a software breakpoint algorithm that would do that, unless repeated icache flushing was a problem? Maybe a failure of my imagination!
@jasonmolenda @dougall @pervognsen What you described leads to potentially-missed breakpoints in multithreaded debuggees (unless you stop all the threads, which can be many), so they added this "displaced stepping" algorithm and made it the default 🙁
@amonakov @dougall @pervognsen oh yes! this non-stop debugging I've heard about, that was added after I stopped working on gdb. OK that makes sense. lldb stops all threads and instruction steps the thread that has a breakpoint when you resume, then resumes all threads.
@jasonmolenda @dougall @pervognsen Non-stop indeed! We have bugs in yo debugger so you can debug while you debug. A relentless! Debugging! Experience!
@dougall My guess would be for use as import thunks for large model code (>2GB text) without burning indirect target predictor slots on it
@dougall My guess is the absolute jump is going there so the dynamic linker can directly put the target address in the plt using a relocation, avoiding a load from the got and a register use.

@dougall *puts on tinfoil hat* CCMP lets you turn whatever condition you want into OF set (just do a CCMP on !cond of reg with itself and set OF to 1 on the "not set" path) and then you can just use INTO!

Ok sure, *technically* not allowed in x86-64 ever, but still!!11