Mastodawn

I spent some time looking at GBA stuff after posting about it a few days ago. It's been so long since I've touched ARM32 that I'd forgotten the insane shit you can do in one instruction, e.g. LDMEQFD SP!, {R0, R2-R5, PC}.

Show thread

Nick Ludlam 20h ago

@pervognsen Is that still technically RISC? Or has RISC just shifted its baseline because of how execution architecture has matured?

Show thread

Per Vognsen 20h ago

@nick I'm pretty sure that LDM would have worked as-is with the original ARM1 instruction set so this was there in the beginning. ARM has never been RISC in any meaningful sense. It's a load/store architecture with a bunch of GPRs but that's about it. I guess if you wanted to be snide, you could say that it shares in the earliest RISC tradition of shipping parts of your microarchitecture as the ISA (barrel shifter, predication, etc) like MIPS did with branch delay slots and imprecise exceptions.

Show thread

Fabian Giesen 20h ago

@pervognsen @nick FWIW the string-ish multi-loads were in early POWER as well and that was definitely sticker-label RISC

Show thread

Per Vognsen 20h ago

@rygorous @nick The pièce de résistance is combining it with predication and the PC as a pseudo-GPR. Now we're cooking.

Show thread

Fabian Giesen 20h ago

@pervognsen @nick reference on early POWER multi-loads https://bitsavers.org/pdf/ibm/IBM_Journal_of_Research_and_Development/341/ibmrd3401E.pdf pp. 7-10 starting with "The RS/6000 architecture has adopted the following strategy for dealing with misaligned data."

Load-multiple section starts. on p. 9 "Another aspect of including string operations..."

Show thread

Fabian Giesen 20h ago

@pervognsen @nick I will say that they are IMO bang on the money here on _all_ counts - calling out that

a) mem copies/string copies etc. are important and usually unaligned
b) Alpha-esque "we give you a way to do SWAR loops for this" only gets you so far,
c) for load/store multiple, that function prologues/epilogues are the key use case

other ISAs have struggled to learn that lesson 30 years later...

Show thread

Wolf480pl 20h ago

@rygorous @pervognsen @nick

> The architecture allows for the partial
completion of an operation and thegeneration of an
alignment-check interrupt when the datacrosses a cache-
line boundary. System softwarecan then complete the
instruction by fixing up the affected registersor memory
locations.

this has EINTR vibes

Show thread

Fabian Giesen 20h ago

@wolf480pl @pervognsen @nick also how REP MOVS/STOS, the new ARM mem block copies/sets, ARM SVE loads/stores (first fault lane!) etc. work! (At page not cache line level)

Show thread

Fabian Giesen

@wolf480pl @pervognsen @nick specifically it's very interesting that, 30 years after POWER initially defined this (and, mind, they deprecated this for most of the intervening time), we're now back to a world where more and more ISAs are coming around to their original PoV, for pretty much the exact reasons they gave

Show thread

Owen Anderson 19h ago

@rygorous re: prologues/epilogues, it's also interesting to observe that register windows identified the problem correctly, just not the right solution.

Show thread

Fabian Giesen 19h ago

@resistor Yup!

And I think part of the reason the uptake was so delayed was that there was a big detour in the middle where when RISCs were originally defined, it was rare for compilers to do aggressive global opts or aggressive inlining.

First-order, especially for small frequently-called subroutines, inlining is better !/$ than making call sequences cheap.

But now we're all the way around to aggressive inlining + deep superscalar + giant code working sets.

Show thread

Fabian Giesen 19h ago

@resistor And suddenly we care a lot about decreasing call overhead again because inlining even big-ish fns just to avoid prologue/epilogue overhead is in many ways a cure worse than the disease again