Mastodawn

I spent some time looking at GBA stuff after posting about it a few days ago. It's been so long since I've touched ARM32 that I'd forgotten the insane shit you can do in one instruction, e.g. LDMEQFD SP!, {R0, R2-R5, PC}.

Show thread

Nick Ludlam 19h ago

@pervognsen Is that still technically RISC? Or has RISC just shifted its baseline because of how execution architecture has matured?

Show thread

Per Vognsen 19h ago

@nick I'm pretty sure that LDM would have worked as-is with the original ARM1 instruction set so this was there in the beginning. ARM has never been RISC in any meaningful sense. It's a load/store architecture with a bunch of GPRs but that's about it. I guess if you wanted to be snide, you could say that it shares in the earliest RISC tradition of shipping parts of your microarchitecture as the ISA (barrel shifter, predication, etc) like MIPS did with branch delay slots and imprecise exceptions.

Show thread

Fabian Giesen 19h ago

@pervognsen @nick FWIW the string-ish multi-loads were in early POWER as well and that was definitely sticker-label RISC

Show thread

Per Vognsen

@rygorous @nick The pièce de résistance is combining it with predication and the PC as a pseudo-GPR. Now we're cooking.

Show thread

Fabian Giesen 19h ago

@pervognsen @nick reference on early POWER multi-loads https://bitsavers.org/pdf/ibm/IBM_Journal_of_Research_and_Development/341/ibmrd3401E.pdf pp. 7-10 starting with "The RS/6000 architecture has adopted the following strategy for dealing with misaligned data."

Load-multiple section starts. on p. 9 "Another aspect of including string operations..."

Show thread

Fabian Giesen 19h ago

@pervognsen @nick I will say that they are IMO bang on the money here on _all_ counts - calling out that

a) mem copies/string copies etc. are important and usually unaligned
b) Alpha-esque "we give you a way to do SWAR loops for this" only gets you so far,
c) for load/store multiple, that function prologues/epilogues are the key use case

other ISAs have struggled to learn that lesson 30 years later...

Show thread

Wolf480pl 19h ago

@rygorous @pervognsen @nick

> The architecture allows for the partial
completion of an operation and thegeneration of an
alignment-check interrupt when the datacrosses a cache-
line boundary. System softwarecan then complete the
instruction by fixing up the affected registersor memory
locations.

this has EINTR vibes

Show thread

Fabian Giesen 19h ago

@wolf480pl @pervognsen @nick also how REP MOVS/STOS, the new ARM mem block copies/sets, ARM SVE loads/stores (first fault lane!) etc. work! (At page not cache line level)

Show thread

Fabian Giesen 19h ago

@wolf480pl @pervognsen @nick specifically it's very interesting that, 30 years after POWER initially defined this (and, mind, they deprecated this for most of the intervening time), we're now back to a world where more and more ISAs are coming around to their original PoV, for pretty much the exact reasons they gave

Show thread

Owen Anderson 18h ago

@rygorous re: prologues/epilogues, it's also interesting to observe that register windows identified the problem correctly, just not the right solution.

Show thread

Fabian Giesen 18h ago

@resistor Yup!

And I think part of the reason the uptake was so delayed was that there was a big detour in the middle where when RISCs were originally defined, it was rare for compilers to do aggressive global opts or aggressive inlining.

First-order, especially for small frequently-called subroutines, inlining is better !/$ than making call sequences cheap.

But now we're all the way around to aggressive inlining + deep superscalar + giant code working sets.

Show thread

Fabian Giesen 18h ago

@resistor And suddenly we care a lot about decreasing call overhead again because inlining even big-ish fns just to avoid prologue/epilogue overhead is in many ways a cure worse than the disease again

Show thread

Tom Forsyth 17h ago

@rygorous @wolf480pl @pervognsen @nick And gather/scatter 🙂

Show thread

Fabian Giesen 17h ago

@TomF @wolf480pl @pervognsen @nick well they don't actually work so.... (ever since GDS)

Show thread

Tom Forsyth 17h ago

@rygorous @wolf480pl @pervognsen @nick Oh, I had not kept up to date with this. Fun!

Show thread

Wolf480pl 16h ago

@TomF
@rygorous what's gather/scatter?

Show thread

Wolf480pl 16h ago

@TomF
@rygorous
oh, this stuff? https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)

specifically the AVX2 implementation of it?

Gather/scatter (vector addressing) - Wikipedia

Show thread

Tom Forsyth 16h ago

@wolf480pl @rygorous e.g. https://www.felixcloutier.com/x86/vgatherdps:vgatherdpd

VGATHERDPS/VGATHERDPD — Gather Packed Single, Packed Double with Signed Dword Indices

Show thread

Fabian Giesen 16h ago

@TomF @wolf480pl @pervognsen @nick I mean the instructions are still there but they just bail into full microcode fallback now

Show thread

Tom Forsyth 14h ago

@rygorous @wolf480pl @pervognsen @nick I'm a little surprised these cores don't have a segregated mode on a chicken bit for all their register files by now. How many bugs of essentially the same format is this now?

Show thread

Fabian Giesen 14h ago

@TomF not nearly as many as there are different named exploits, a lot of them were Intel doctoring around on symptoms because the real underlying issue was a fundamental problem with the cache access path design that was unfixable without a major uArch rev

Show thread

Fabian Giesen 14h ago

@TomF specifically the Spectre stuff (which boils down to data-dependent branches cause data to leak into branch history) was exploitable ~everywhere, on every uArch and every ISA, and arguably not really Intel's fault, it's a fundamental issue with speculation.

The thing that really reamed Intel, Meltdown/L1TF and friends, was an unforced mistake in their L1 access path design.

Show thread

Fabian Giesen 14h ago

@TomF Namely, everyone else either does privilege checks up front, or at most did them in parallel with the access path and made sure to mux in 0 on the data returns in case of privilege check failure.

Intel did the privilege checks in parallel/late and makes the instruction raise an exception on retirement, but did forward the actual privileged data (that you weren't supposed to be able to read) onwards to dependent insns regardless.

Show thread

Fabian Giesen 14h ago

@TomF As for GDS, I am really surprised that all the Spectre-era exploits apparently did not cause Intel to do an internal audit of all speculative state and see if it might leak to attackers.

I am not surprised that the bug exists in Skylake/SKX era uArchs, and it would be totally fine if Intel found this in a post-Spectre security audit but kept quiet about it until it was discovered externally or similar, but it doesn't look like that's what happened.

Show thread

Fabian Giesen 14h ago

@TomF Instead, from the response (and the fact that it affects many post-SKL uArchs), the likely conclusion is that they still hadn't gone over all shared and potentially security-sensitive state in the memory access path with a fine-toothed comb by 2023, 5 years after Meltdown, which is disappointing to say the least.

Show thread

Tom Forsyth 13h ago

@rygorous Five years at Intel is like six months anywhere else.

Show thread

Fabian Giesen 13h ago

@TomF one would assume that by the third time you step on that particular rake, you maybe start looking for these issues on your own and try to prevent them even if someone hasn't fed you a PoC exploit yet

Show thread

Josh Jersild 13h ago

@rygorous @TomF "surely all the rakes have been stepped on by now"

Show thread

Tom Forsyth 13h ago

@JoshJers @rygorous Just going to turn Transactional Memory on again, BRB.

Show thread

Fabian Giesen 12h ago

@TomF @JoshJers look, it's simple math, there's a finite number of possible bugs there so eventually we have to run out

...right?

Show thread

Fabian Giesen 12h ago

@TomF @JoshJers it's a simple plan, we just take the most cursed problem in comp arch (memory access) and make it worse, what could go wrong?

Show thread

Josh Jersild 11h ago

@rygorous @TomF I see no flaws in this plan

Show thread

Per Vognsen 19h ago

@wolf480pl @rygorous @nick Regarding EINTR vibes, this is also true with something like REP MOVSB at page boundaries if there are soft faults. Or interrupts for that matter, but it happens even with exceptions, analogous to the cache line case.

Show thread

Wolf480pl 18h ago

@pervognsen @rygorous @nick

hmm are there any Unix syscalls that can partially happen and then return EINTR? I guess not... read() and write() can partially complete but then they return a length, and you don't get to know if it was short because of a signal...

so it looks like IBM's string instructions requiring "fixing up registers or memory" is even worse

Show thread

Wolf480pl 18h ago

@pervognsen @rygorous @nick
but my point was more about "it's an edge case we don't want to handle, let's create a new edge case one layer up and let those folks handle it"

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick check out how MIPS handles exceptions triggered from branch delay slots one day :P

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
sounds fun!

but first I'll try to guess what it does :P

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick (you can't just save the address of the faulting instruction and resume there, because if it's in a branch delay slot, now you end up falling through the branch instead of executing it)

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick likewise why are MIPS k0 and k1 registers "reserved for the kernel"? Can't the kernel save its own regs when it needs to? And why do they need to be reserved all the time, can't they just be reserved around syscalls or something? :P

Show thread

Nick Ludlam 18h ago

@rygorous @wolf480pl @pervognsen If you're really feeling masochistic you could try out some RISC-V. Or maybe there's some pleasure in the strictness over there.

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
Hmm so upon an exception, a MIPS CPU only:
- disables interrupts
- saves PC in EPC
- fills the Cause register
- jumps to a hard-wired address
?

So it doesn't save any of the GPRs for you, and unlike in ARM, there is no separate copy of a subset of registers for each type of exception?

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick

So to save a register, you need an address to save it to. I'm guessing on MIPS you don't get to put a literal address in the store instruction.

So you need to put the address in a register.

Some other CPUs may save the stack pointer for you, and replace it with one defined in the exception vector. But not MIPS.

So you will have to clobber one of the user's registers to build an address to save the registers to.

Hence k0.

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
I was thinking you'd need both k0 and k1 when returning from an exception, but seems like this is not the case?

let's say the address to your register save area is in $t0

- store return address in k0

- for all regs except t0, k0:
lw <reg> <off>($t0)

- lw $t0 <off>($t0)

- jr $k0

I must be missing sth on entry then.

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick sort of. the idea is k0/k1 are permanently roped off for use of the exception handler, _especially_ the TLB miss (soft fault) handler, and ideally you don't save any regs in there at all, you just try and make do with just the 2 regs.

If you take exceptions on TLB miss, you want there to be as little state-saving around it as humanly possible.

Show thread

Wolf480pl 17h ago

@rygorous @pervognsen @nick
wow...

That makes sense.

Kinda reminds me of how on x86_64-unknown-linux-gnu, the thunk of PLT that calls into the dynamic linker when the address in GOT is not filled yet, and the only register it can clobber is RAX

Show thread

Alexander Monakov 11h ago

@wolf480pl @rygorous @pervognsen @nick

well, not exactly!

it is very definitely not allowed to clobber RAX, because AL carries the count of SSE registers with floating-point arguments when calling a variadic function!

hence, Glibc moves RAX to R11 after returning from the full resolver to the asm stub, restores RAX, uses R11 to make the final jump into the resolved function

Show thread

Alexander Monakov 11h ago

@wolf480pl @rygorous @pervognsen @nick and, going off a tangent, one of the bugs I consider quite famous, is: there's a range of Glibc versions where, if you call a function that receives 512-bit vector arguments via PLT, their upper 256-bit halves are zeroed out on the first call

(because of course the old dynamic linker has no idea what even AVX-512 is, it just saves/restores 256-bit YMM registers, and 256-bit loads are not merging into the 512-register, they zero out the high part)

Show thread

Alexander Monakov 11h ago

@wolf480pl @rygorous @pervognsen @nick

(didn't happen with the even older dynamic linker that had no idea what AVX even is, because 128-bit loads _are_ merging into the 256-bit register)

Show thread

Fabian Giesen 9h ago

@amonakov @wolf480pl @pervognsen @nick only if not VEX encoded!

Show thread

Alexander Monakov 9h ago

@rygorous @wolf480pl @pervognsen @nick
thanks!

I should probably mention that history is not going to repeat itself if ZMM width is doubled, because the dynamic linker is using forward-compatible xsave instruction now, which dumps extended state on its own given a long-enough buffer

So if it breaks again, it will be in a new and exciting manner

Show thread

Fabian Giesen 9h ago

@amonakov @wolf480pl @pervognsen @nick I think we're good on vector width for the next 1.5 decades at least, they pushed into 512b way earlier than it really made sense to. (Granted, which is a large part of why they then proceeded to not actually ship AVX-512 on most SKUs for the next 10 years after.)

The APX-induced GPR-breakage-maybe will hit this year, and I think our next big sorta-ABI-breaking thing is probably going to have to be cache line size.

Show thread

Fabian Giesen 9h ago

@amonakov @wolf480pl @pervognsen @nick That one is more transparent than most but we're not gonna get double-cacheline-wide vector regs. Unaligned is one thing but that is just too gross to seriously contemplate

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
ok so this will probably make anyone who actually knows MIPS cringe, but

if you somehow knew the branch target (maybe it's saved somewhere, next to the faulting instruction's address)

and if branches in delay slots are legal (I'd be impressed if they are)

maybe you could (in pseudocode):

jr $reg_saved_pc
jr $reg_saved_branch_target

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick branches inside MIPS branch delay slots are not legal, but the branch target address is not saved for you anywhere

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
bummer

is it legal for an instruction in a branch delay slot to be branch target?

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick yes

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
ok, now I'm convinced it's impossible to return from an exception on MIPS

yet they somehow do it so I'll have to look it up

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick ta daaa

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
This is smart.

I suspected such a flag would be needed, but I didn't think the "should I subtract 4 or not" logic would be in hardware.

It's the opposite of EINTR, it actively tries to make the upper layer's life easier.

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick IIRC it didn't use to be

This is the R3000 version. I believe the R2000 (?) signaled it by setting bit 1 of EPC (which is always 4B aligned otherwise) and expected the handler to fix it up. Something along those lines.

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
and does setting bit 0 of EPC unlock the secret cow level?

wait no that'd be ARM

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick if by secret cow level you mean jazelle, then yes

Show thread

Wolf480pl 18h ago

@rygorous @pervognsen @nick
I meant Thumb

Show thread

Fabian Giesen 18h ago

@wolf480pl @pervognsen @nick that's bit 1