I spent some time looking at GBA stuff after posting about it a few days ago. It's been so long since I've touched ARM32 that I'd forgotten the insane shit you can do in one instruction, e.g. LDMEQFD SP!, {R0, R2-R5, PC}.
@pervognsen Is that still technically RISC? Or has RISC just shifted its baseline because of how execution architecture has matured?
@nick I'm pretty sure that LDM would have worked as-is with the original ARM1 instruction set so this was there in the beginning. ARM has never been RISC in any meaningful sense. It's a load/store architecture with a bunch of GPRs but that's about it. I guess if you wanted to be snide, you could say that it shares in the earliest RISC tradition of shipping parts of your microarchitecture as the ISA (barrel shifter, predication, etc) like MIPS did with branch delay slots and imprecise exceptions.
@pervognsen @nick FWIW the string-ish multi-loads were in early POWER as well and that was definitely sticker-label RISC
@rygorous @nick The pièce de résistance is combining it with predication and the PC as a pseudo-GPR. Now we're cooking.

@pervognsen @nick reference on early POWER multi-loads https://bitsavers.org/pdf/ibm/IBM_Journal_of_Research_and_Development/341/ibmrd3401E.pdf pp. 7-10 starting with "The RS/6000 architecture has adopted the following strategy for dealing with misaligned data."

Load-multiple section starts. on p. 9 "Another aspect of including string operations..."

@pervognsen @nick I will say that they are IMO bang on the money here on _all_ counts - calling out that

a) mem copies/string copies etc. are important and usually unaligned
b) Alpha-esque "we give you a way to do SWAR loops for this" only gets you so far,
c) for load/store multiple, that function prologues/epilogues are the key use case

other ISAs have struggled to learn that lesson 30 years later...

@rygorous @pervognsen @nick

> The architecture allows for the partial
completion of an operation and thegeneration of an
alignment-check interrupt when the datacrosses a cache-
line boundary. System softwarecan then complete the
instruction by fixing up the affected registersor memory
locations.

this has EINTR vibes

@wolf480pl @rygorous @nick Regarding EINTR vibes, this is also true with something like REP MOVSB at page boundaries if there are soft faults. Or interrupts for that matter, but it happens even with exceptions, analogous to the cache line case.

@pervognsen @rygorous @nick

hmm are there any Unix syscalls that can partially happen and then return EINTR? I guess not... read() and write() can partially complete but then they return a length, and you don't get to know if it was short because of a signal...

so it looks like IBM's string instructions requiring "fixing up registers or memory" is even worse

@pervognsen @rygorous @nick
but my point was more about "it's an edge case we don't want to handle, let's create a new edge case one layer up and let those folks handle it"
@wolf480pl @pervognsen @nick check out how MIPS handles exceptions triggered from branch delay slots one day :P
@wolf480pl @pervognsen @nick (you can't just save the address of the faulting instruction and resume there, because if it's in a branch delay slot, now you end up falling through the branch instead of executing it)
@wolf480pl @pervognsen @nick likewise why are MIPS k0 and k1 registers "reserved for the kernel"? Can't the kernel save its own regs when it needs to? And why do they need to be reserved all the time, can't they just be reserved around syscalls or something? :P

@rygorous @pervognsen @nick
Hmm so upon an exception, a MIPS CPU only:
- disables interrupts
- saves PC in EPC
- fills the Cause register
- jumps to a hard-wired address
?

So it doesn't save any of the GPRs for you, and unlike in ARM, there is no separate copy of a subset of registers for each type of exception?

1/

@rygorous @pervognsen @nick

So to save a register, you need an address to save it to. I'm guessing on MIPS you don't get to put a literal address in the store instruction.

So you need to put the address in a register.

Some other CPUs may save the stack pointer for you, and replace it with one defined in the exception vector. But not MIPS.

So you will have to clobber one of the user's registers to build an address to save the registers to.

Hence k0.

2/

@wolf480pl @pervognsen @nick sort of. the idea is k0/k1 are permanently roped off for use of the exception handler, _especially_ the TLB miss (soft fault) handler, and ideally you don't save any regs in there at all, you just try and make do with just the 2 regs.

If you take exceptions on TLB miss, you want there to be as little state-saving around it as humanly possible.

@rygorous @pervognsen @nick
wow...

That makes sense.

Kinda reminds me of how on x86_64-unknown-linux-gnu, the thunk of PLT that calls into the dynamic linker when the address in GOT is not filled yet, and the only register it can clobber is RAX

@wolf480pl @rygorous @pervognsen @nick

well, not exactly!

it is very definitely not allowed to clobber RAX, because AL carries the count of SSE registers with floating-point arguments when calling a variadic function!

hence, Glibc moves RAX to R11 after returning from the full resolver to the asm stub, restores RAX, uses R11 to make the final jump into the resolved function

@wolf480pl @rygorous @pervognsen @nick and, going off a tangent, one of the bugs I consider quite famous, is: there's a range of Glibc versions where, if you call a function that receives 512-bit vector arguments via PLT, their upper 256-bit halves are zeroed out on the first call

(because of course the old dynamic linker has no idea what even AVX-512 is, it just saves/restores 256-bit YMM registers, and 256-bit loads are not merging into the 512-register, they zero out the high part)

@wolf480pl @rygorous @pervognsen @nick

(didn't happen with the even older dynamic linker that had no idea what AVX even is, because 128-bit loads _are_ merging into the 256-bit register)
@amonakov @wolf480pl @pervognsen @nick only if not VEX encoded!

@rygorous @wolf480pl @pervognsen @nick
thanks!

I should probably mention that history is not going to repeat itself if ZMM width is doubled, because the dynamic linker is using forward-compatible xsave instruction now, which dumps extended state on its own given a long-enough buffer

So if it breaks again, it will be in a new and exciting manner

@amonakov @wolf480pl @pervognsen @nick I think we're good on vector width for the next 1.5 decades at least, they pushed into 512b way earlier than it really made sense to. (Granted, which is a large part of why they then proceeded to not actually ship AVX-512 on most SKUs for the next 10 years after.)

The APX-induced GPR-breakage-maybe will hit this year, and I think our next big sorta-ABI-breaking thing is probably going to have to be cache line size.

@amonakov @wolf480pl @pervognsen @nick That one is more transparent than most but we're not gonna get double-cacheline-wide vector regs. Unaligned is one thing but that is just too gross to seriously contemplate
@amonakov @wolf480pl @pervognsen @nick honestly 512b still seems a bit questionable, and I was really surprised that Zen 5 actually went for all SIMD pipes 512b wide. By all indications, they can't really keep that fed, to no one's surprise.
@rygorous @amonakov Nova Lake's P-cores are supposed to have 512-bit SIMD data paths, right?

@pervognsen @amonakov don't know! They've announced so far that this time, for real, it's gonna support AVX10.2, but Alder Lake was gonna ship with AVX512 too, right up until it didn't.

At this point they've changed their minds so many times on this that I'll only believe it when it's on shelves.

I am quite excited about getting APX for integer code though, which they seem less likely to back out of. I hope?

@pervognsen Not least because that's enough regs to port some sweet tricks over that Oodle AArch64 versions have been shipping with for 5+ years. :)

@pervognsen Plus at this point the E-cores are real beasts as well.

They're well above Skylake class by now.

If you have random data/code access or need all the mem BW you can get the P-cores are your best bet but the E-cores are actually pretty good at code that is running loops and not just chewing through mem BW and branch history entries like they're going out of fashion

Skymont: Intel’s E-Cores reach for the Sky

The 2020s were a fun time for Intel’s Atom line, which went from an afterthought to playing a major role across Intel’s high performance client offerings.

Chips and Cheese
@pervognsen "old" site is probably better since you can actually click on the images and get a readable version https://old.chipsandcheese.com/2024/10/03/skymont-intels-e-cores-reach-for-the-sky/
Skymont: Intel’s E-Cores reach for the Sky

The 2020s were a fun time for Intel’s Atom line, which went from an afterthought to playing a major role across Intel’s high performance client offerings. Intel’s latest mobile ch…

Chips and Cheese
@pervognsen anyway: 3x(32B/cycle, 3 insn/cycle) instruction decoding, 8-wide rename/dispatch, 416-entry ROB, ~280 entry int RF, ~280x128b entry FP RF, 4x 128b FP ALUs (including a 128b FMA in every FP port), 3 loads/cycle, 2 stores/cycle, 3 branches/cycle, 8 integer ALUs... yup those all sound like classic teensy tiny small core numbers for me, nothing to see here!
@pervognsen just having a normal one with my normal width backend with 26 ports
@amonakov @wolf480pl @pervognsen @nick Intel's play of either 3x256b or 2x512b seemed sensible to me. AMD going all-in on 4x 512b ALU + 2x load/store is... either future-looking or foolhardy. :P