It's beautiful! Several overdue improvements to keep x86 competitive with ARM. I love competition.

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Introducing Intel® Advanced Performance Extensions (Intel® APX)

These extensions expand the x86 instruction set with access to registers and features that improve performance.

Intel

Intel's new PUSH2/POP2 are similar to ARM's LDP/STP. I think these are extremely underrated instructions. Loads and stores are quite expensive, but these processors already support 128-bit loads and stores for vector instructions.

Zen 3 and the Apple M1 can both do 3 loads per cycle, but with LDP, the M1 can load 2x the scalar registers per cycle – kinda crazy. It's a shame compilers aren't better at using these instructions, and that the Intel paired load/store is restricted to stack push/pop.

The "Balanced PUSH/POP Hint" is a little odd, but mirrors an optimisation used by the M1 that detects matching pushes and pops with the register numbers typically used by compilers in function prologues and epilogues, and performs fast forwarding. There's a store-to-load-forwarding-like penalty in cases where the instructions are incorrectly paired, but don't actually alias, so I guess this hint could avoid that.

Conditional loads and stores are the biggest surprise to me so far. But they kind of make sense – you already have predicated loads and stores happening on the vector side, so it's nice to see that as an option in scalar code too.

Should also allow for a conditional trap by NULL-pointer-deref (or by writing to RIP+0 if you have W^X and want to save a byte?)

Adding that to my ARM wish-list.

Conditional compare is great – hopefully x86 people will improve compiler support – it's a surprisingly tricky problem with only one flags register. And ARM doesn't have conditional test, so that's exciting!

64-bit absolute jump... Nice to have, but why? You can't use it in position independent code. You can't use it as a call-out-of-JIT. Is it just a jump-out-of-JIT? Wasn't "mov rax, target ; jmp rax" fine? Maybe for dynamic linkers, JITs, or dynamic instrumentation?

@dougall Yeah, curious about the absolute jump. They definitely wouldn't include it for no reason but I can't think of a strong motivator either.
@pervognsen @dougall I can't imagine this was a consideration, but one thing we've looked at in lldb is hot-patching instructions in a binary to jump to code we've jitted into the process. We need to replace instruction(s) with our jump, and the more instructions we have to replace, the trickier this gets. (the one time we've tried this technique was for conditional breakpoints, evaluating the condition in the process, putting a breakpoint in the evals-true case.)

@jasonmolenda @pervognsen Nice! Yeah, that came to mind, but "mov + jmp" is 12 bytes, and I'm pretty sure JMPABS is 11 bytes? Saving a byte is nice, but I can't imagine it'll make a huge difference?

(It's given as "REX2,NO66,NO67,NOREP MAP0 W0 A1 target64" – REX2 would be the new 2-byte prefix. I guess "MAP0 W0" are the bits in the REX2 payload? "A1" would be a 1-byte opcode, then 8-bytes for "target64".)

@dougall @jasonmolenda Anything but single-byte code injection is such a nightmare on x86. Do you know if there's a smallish, self-contained library? There seems to be so many prerequisites to handle the general case, at a minimum you need the CFG for the enclosing function. It doesn't seem many steps short of a full recompiler, unfortunately, and those tend to be heavyweight.

@pervognsen @jasonmolenda I don't really – @comex's substitute comes to mind, but is unmaintained (and was maybe only intended to target function prologues, rather than arbitrary locations?) I haven't kept up with what people are using instead.

(IIRC it just refuses to inject if the hook would overlap a possible jump target (as determined by scanning forwards), then rewrites the replaced instructions to fix any RIP-relative operands.)

https://github.com/comex/substitute

GitHub - comex/substitute: A free runtime modification library.

A free runtime modification library. Contribute to comex/substitute development by creating an account on GitHub.

GitHub
@pervognsen @jasonmolenda @comex Oh, and I think it does a dance of stopping all other threads and then stepping any of them that happen to be currently executing in the hook region? (The end of a call operation would also be considered a possible jump target and injection would fail. But this still doesn't rule out some other code elsewhere in the binary jumping into the middle of your hook. And it's impractical to rule out indirect jumps into the middle of the hook.)
@dougall @jasonmolenda @comex Yeah, every time I've contemplated doing this in a debugger/instrumentation, I've come to the conclusion that the never-ending tail of fundamentally hard issues make it untenable as a general solution (e.g. good luck "stepping to the end" for a while (!quit) { ... } loop). However, you can do some kind of best-effort local code modification and then fall back to int3 when you fail.
@pervognsen @dougall @comex Yeah doing this in arbitrary code becomes very tricky; it wasn't implemented beyond a prototype. I can't remember how he picked which instructions to execute in the jitted block (did he avoid anything with pc-relative encodings??) and how he avoided inserting this across a branch point. It is nice to not need to use a register. I didn't think this was the real motivator, but an external debugger is one place where an absolute address might be useful.
@pervognsen @dougall @comex all that being said, conditional breakpoints in a debugger, where the debugger has to halt the process every time, are nearly useless in a hot codepath so there's a real benefit for trying to find a way to do this in-process, even if it isn't possible in all cases and we have to fall back to stop & compare.
@jasonmolenda @dougall @comex Yup, absolutely. At some point mode switching (perhaps via an int3 handler) to a fully software-based machine code emulator becomes more attractive once certain int3 break/probe points become too hot. That has its own massive engineering challenges but gives a lot of flexibility and can have decent performance (certainly compared to int3 thrashing).
@jasonmolenda @pervognsen @dougall As a frequent user of conditional breakpoints and breakpoint commands, I’d love to see that. Even better would be to also support JITting a subset of breakpoint commands. (Ever used GDB's dprintf?)

@pervognsen @dougall @jasonmolenda My library didn't step; it checked which instruction in the patched region the thread was at, and moved the PC to the corresponding instruction in the relocated code.

For the relocated code, PC-relative instructions were converted to sequences of instructions that did the same thing using absolute addresses. This is for x86: https://github.com/comex/substitute/blob/95f2beda374625dd503bfb51a758b6f6ced57887/lib/x86/arch-transform-dis.inc.h#L24 On x86 it had to use the stack, uniike ARM (either 32 or 64).

substitute/lib/x86/arch-transform-dis.inc.h at 95f2beda374625dd503bfb51a758b6f6ced57887 · comex/substitute

A free runtime modification library. Contribute to comex/substitute development by creating an account on GitHub.

GitHub
@comex @pervognsen @jasonmolenda Oh, right – that makes more sense. (Thanks for publishing it! I really enjoyed reading it however many years ago – lots of great tricks and ideas.)

@dougall @pervognsen @jasonmolenda Aside from being unmaintained, my library also only worked on macOS.

The thing about injecting into the middle of a function is that you must have already looked at the function in a disassembler, or else you wouldn't know where to inject or which registers or stack locations have the data you want. So you could just manually ensure the injection point doesn't overlap a jump target. Though it would be neat if a library let you just inject anywhere.