is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?

https://www.techpowerup.com/340779/amd-claims-arm-isa-doesnt-offer-efficiency-advantage-over-x86

@regehr

There are two different questions here:

  • Is AArch64 a better ISA for modern microarchitectures than x86-64?
  • Does the architecture limit the performance of an implementation?

The latter is trivially true. We have a load of examples of dead ends. Stack machines make extracting instruction-level parallelism really hard so lost completely to register machines.

Complex microcode makes out-of-order execution hard because you have to be able to serialise machine state on interrupt and decoded microops may have a bunch of state that isn't architectural. This one is quite interesting because it's a very sharp step change. Building a microcode engine that works is quite easy: serialise the pipeline, disable interrupts, run a bunch of microops, reenable interrupts. Building one that is efficient and allows multiple microcoded instructions to run in parallel is really hard, but if you do it then the complexity is amortised across a potentially large number of instructions. x86 chips took the first approach until fairly recently because there was a lot of lower-hanging fruit and microcoded instructions were rare. Having one instruction that requires complex microcode is absolutely the worst case.

Different ISAs favour different implementation choices. AArch32's choice to make the program counter architectural was great on simple pipelines, for example. It made PC-relative addressing trivial (just use PC as the base of any load or add) and made short relative jumps just an add to the PC. This became more annoying for more complex pipelines because you can't tell whether an instruction is a jump until you've done full decode (on most other RISCy ISAs, you can tell from the major opcode), which impacts where you do branch prediction. Similarly, all of the predication in AArch32 is great for avoiding using the branch predictor in common cases with simple pipelines, but you need that state anyway on big out-of-order machines. Thumb-2's if-then-else instruction provided a denser way of packing predication that scales nicely up to dual-issue in-order cores, but really hurts if you want to decode multiple instructions in parallel.

The question of AArch64 vs x86-64 is much more interesting.

Register rename is the biggest single consumer of power and the bottleneck on a lot of very high-end implementations. Complex addressing modes really help reduce this overhead, but so do memory-register operations where you avoid needing to keep a rename register live for a value that's used only once.

At MS, we did a lot of work on dataflow architectures to try to avoid this. This was largely driven by two observations:

  • Around 2/3 of values are used exactly once.
  • Around 2/3 of values (not the same 2/3, but an overlapping set) of values are used only within the same basic block where they are created.

The theory was that, by encoding this directly in the ISA (input operands were implicit, output operands were the distance in executed instruction stream to the instruction that consumed the result) you'd be able to significantly reduce rename register pressure. Unfortunately, it turned out that speculative execution required you to do something that looked a lot like register rename for these values.

AArch64 intentionally tries to provide a useful common set of fused operations. x86-64 does it largely by accident, but there isn't a clear winner here.

The one big win that we found has not really made it into any instruction set, which continues to surprise me. A cheap way of marking a register as dead can massively improve performance. I've seen a 2x speedup on x86 from putting an xor rax, rax at the end of a tight loop because the pipeline was stalling having to keep all of the old rax values around in rename registers, even though no possible successor blocks used them. If I were designing a new ISA, for high-performance systems I'd be tempted to do one of the following:

  • Have a short instruction with a bitmap of dead registers that compilers could insert at end of the basic block for the back arc in a loop to mark multiple registers as dead.
  • Put an extra bit in each source operand to mark it as a kill.

The latter hurts density, but would probably be a bigger win because it would let you rewrite a load of operations from allocating rename registers to using forwarding in the front end.

The other bottleneck is parallel decode. A64 ditched the variable-length instruction set that T32 introduced because it makes orthogonal decoding trivial. Fetch 16 bytes, decode four instructions. Apple's implementations make a lot of use of this and also have a nice sideways forwarding path that allows values produced in one instruction to be directly forwarded to a consumer in the same bundle without going via register rename (if the value isn't clobbered, they still need to allocate a rename register).

x86-64 is staggeringly bad here. As Stephen Dolan quipped, x86 chips don't have an instruction decoder, they have an instruction parser. Instructions can be very long, or as short as a single byte. Mostly this doesn't matter on modern x86-64 chips because 90% of dynamic execution is in loops and modern x86-64 chips do at least some caching of decoded things (at the very least, caching the locations of instruction boundaries, often caching of decoded micro-ops) in loops.

And this is where it gets really interesting. The extra decode steps and the extra caches add area, but they also save instruction cache by providing a dense instruction encoding (x86-64 isn't perfect there, it's not that close to a Huffman encoding over common instruction sequences, but it's a moderately good approximation). Is that a good tradeoff? It almost certainly varies between processes (the relative power and area costs of SRAM vs logic vary a surprising amount).

The thing that I find really surprising is the memory model. Arm's weak memory model is supposed to make high-performance implementations easier by enabling more reorderings in the core, whereas x86's TSO is far more constrained. Apple's processors have a mode that (as I understand it) decodes loads and stores into something with the same semantics as the load-acquire / store-release instructions but with the normal addressing modes. It doesn't seem to hurt performance (it's only used by Rosetta 2, so it's hard to do an exact comparison, but x86-64 emulation is ludicrously fast on these machines, so it isn't hurting that much. I'd love to see some benchmarks enabling it by default on everything).

Doing any kind of apples-to-apples comparison is really hard because things in the architecture force implementation choices. You would not implement an x86-64 and an AArch64 core in exactly the same way. There was a paper at ISCA in 2015 that I absolutely hated (not least because it was used to justify a load of bad design decisions in RISC-V) that claimed it did this, but when you looked at their methodology they'd simulated cores that were not at all how you would implement the ISAs that they were discussing.

@david_chisnall this was fascinating, thanks for writing it. i rarely get to read about how ISA design affects uarch design :)
@david_chisnall @regehr While not taped out as a silicon chip, the ForwardCom ISA by Agner Fog has "clear" instruction to mark (vector) registers unused. Agner is known as the x86 optimization guru, so it is natural that Agner's ISA has that instruction.

@omasanori @regehr

It's not a separate instruction, but x86 does document that xoring a register with itself is fast clear, and internally points that register to a canonical zero rename register, so doesn't consume a rename register. The thing I want is either a way of doing it in any instruction with no extra instructions, or a way of zeroing a set of registers in a single instruction.

You could probably also do it by making a handful of registers explicitly temporary registers, so that reading them implicitly zeroed them. I'm not sure how many you'd want though. On a system with 32 registers, four would probably be enough.

@david_chisnall @regehr I'd point out that ForwardCom's clear instruction can clear multiple registers at once but my sentence was unclear on that point, sorry.

@omasanori @regehr

Oh, great! Yes, that's what I want. And Agner is the person I'd expect to understand the value of it.

@david_chisnall @omasanori @regehr I like the explicit temporary registers concept. Very cute.
@regehr @david_chisnall Yeah, the idea that ISA doesn’t matter at all makes no technical sense. It’s very understandable why AMD has decided to stick with x86, but it is fundamentally a business decision.
@regehr @david_chisnall Is there anything wrong with the set of abstract instructions that x86 provides? No, it’s basically fine and obviously possible to implement with high efficiency. But the encoding for those instructions is really bad, and while that can be overcome, that comes with real overheads and restrictions. The same instruction set with a different encoding would be more efficiently implementable, without question.
@regehr @david_chisnall There are several major decisions that cannot reasonably be considered trade-offs and simply make the encoding worse. We can debate the value of variable-length encodings, but if you’re going to use one, you should at least reap the benefits of density; x86-64 gets some of that, but it throws a lot of it away on things like REX prefixes

@rjmccall @regehr

And, from a business perspective, it's x86 has a lot of advantages for AMD:

They can add new instructions to differentiate their products. Arm no longer gives out licenses that allow you to add custom instructions (blame Intel for that. Wireless MMX ruined it for everyone).

They own a load of patents required to implement x86 and have a cross-licensing agreement with Intel (and Via). That's a market where no one else can enter easily, whereas Arm will sell architecture licenses to anyone with a big pile of money and the patience to sit through the nonsense their lawyers insist on for a couple of years.

The Arm ecosystem is growing, but there's a lot more x86-only software than there is Arm-only software. Android and iOS are the only ecosystems where AArch64-only software is common and AMD doesn't have any parts in the right power envelope for that market currently.

Apple is not an AMD competitor. Apple isn't going to say 'oh, these chips are 20% faster, let's buy from AMD instead of using our own designs' even if AMD did manage this, and Apple isn't selling their chips to other people. And instruction set is not a dominant factor for most people. People don't buy Apple laptops because of AArch64 and they don't buy Dell because of x86-64, they buy them because of the performance and battery live second and the ability to run the software that they care about first. In the Arm world, AMD would be competing against a load of companies with products from the tiny to the huge. In the x86 market, they're competing against Via who are MIA and Intel who are committing the worlds slowest corporate suicide.

@rjmccall @regehr

'Ruining it for everyone' is actually Intel's unofficial motto.

@david_chisnall might you have a reference/title for that ISCA'15 paper?
https://dl.acm.org/doi/proceedings/10.1145/2749469
Didn't seem to have one that sounded relevant. Thanks.
@david_chisnall @regehr https://www.sciencedirect.com/science/article/pii/S1383762124000390 measures the TSO impact on M1 Ultra using Linux (where TSO is switchable per process using the @AsahiLinux downstream kernel). On the measured parallel floating point benchmarks from SPEC 2017 TSO decreases the score by ~9% on average

@janne @regehr @AsahiLinux

Thanks! That's very interesting, and supports Arm's design choice. A 9% speedup from an ISA choice is a big win. And, given how much effort Apple went to for x86 emulation, it's probably the closest comparison that's possible.

@david_chisnall @regehr @AsahiLinux probably the best comparison we’ll ever get but I’d say it is a worst case estimate. While x86 emulation performance was important to Apple I doubt maximal perf under TSO was design criteria especially when it hurts non-TSO performance.
A TSO only version of the same design would probably take a smaller perf hit. It still has to be significant though otherwise the switchable behavior could have been avoided. M1 would have been a TSO arm64 design like Fujitsu’s A64FX and Nvidia’s carmel/denver
@david_chisnall @regehr re tso, i just saw this paper that runs a bunch of benchmarks on Apple M1 with and without tso and concludes it’s 9% slower https://www.sciencedirect.com/science/article/pii/S1383762124000390

@david_chisnall @regehr

Isn't this one of those questions where theory comes up short and the answer has to come from practice ?

Presumably AMD, Intel and Apple did the best they could, but as far as I can tell, Apple's Mx CPU's clearly, often dramatically, use less power than AMD's and Intel's ?

Isn't that a proof of the pudding, so to speak?

@bsdphk @regehr

Apple had three choices:

  • AArch64 (they also had a very nice ‘we founded ARM’ architecture license).
  • RISC-V
  • Something in house.

Doing x86 was never on the cards, so we don’t know if they would have been able to do something similar in terms of performance with x86. They’d already invested a lot in the toolchain for AArch64 and their kernel supported it, so that probably made it an easy choice.

A lot of the performance from their cores comes from controlling the whole system. They learned the lesson from the G4, which easily outperformed an Intel CPU on workloads that fitted in cache but normally spent most of its time waiting for memory. At each price-performance point, they know the exact amount of memory, the exact timings of that memory, and can scale the number of memory controllers, the design of the memory controllers, depth and width of store queues, and so on quite precisely to avoid bottlenecks. Intel and AMD can’t do this because a single CPU SKU has to end up in a hundred different laptop models, paired with arbitrary third-party-specified amounts and speeds of memory.

We had a similar choice when I was at Microsoft. We also considered PowerPC. For Xbox 360, MS had already done ports of Windows to PowerPC and Visual Studio could target it. And since the VirtualPC acquisition (and used for backwards compat on the 360), MS even owned an x86-on-PowerPC emulator. This was largely discounted because the ISA is ‘open source’ according to IBM press releases but even MS lawyers couldn’t get IBM to explain what that meant (what is the license? Does it cover implementations with custom extensions? Does it include IBM patents necessary for implementation?). We also looked at RISC-V but discarded it because it’s a terrible ISA.

The choices for us were either:

AArch64, which had a mature ecosystem (including a Windows port), but which didn’t allow custom extensions.

An in-house ISA. The custom ISA that we designed was really nice. We learned a load of lessons from AArch64 (both in terms of what worked well and what didn’t). The perf folks estimated that we could, everything else being equal, get a modest performance improvement relative to AArch64 (under 20%, I can’t remember the exact number). But we’d need to spend several billion on the software ecosystem. Importantly, being even 20% faster than an in-house core that we didn’t build was not actually the relevant metric, we needed to be 20% faster than competitors’ cores and going with AArch64 left us able to license third-party cores easily if the in-house efforts didn’t work (and, for various reasons, that was the path they took in the end).

My main lessons from this project were:

  • A really good ISA is no more than a few percent better than a quite good one, and a mediocre one is probably not more than 10% worse than a quite good one, all other things being equal.
  • ISA design is really hard and no one teaches it (that’s why I wrote the CACM article).
  • There are a lot more no-core places to win/lose performance than places on the core. Working on high-performance CHERI designs, there were a lot of places where the performance impact was 2-3% overall and a few where it was larger. In contrast, people working on the core are really happy with things that give a 1% speedup and often work on things that give less (you do a hundred things that each give a 0.5% speedup and now you’ve got a nice core).
  • Having complete control over your memory hierarchy is an enormous benefit.

@david_chisnall @regehr

The fact that x86 was "never on the cards" was probably because Apple had many years experience with the self proclaimed champion in that space not producing the CPU Apple wanted ?

@bsdphk @regehr

No, it was that x86 is a patent minefield, with a bunch of patents on things necessary to implement various bits of the ISA. Intel, AMD, and Via have a cross-licensing agreement that covers them, but it basically means no new x86 implementers. Unless you want to stick to a 20-year-old version of x86.

@david_chisnall @regehr
> Put an extra bit in each source operand to mark it as a kill.

Fun fact, this was a feature of the very first ISA (https://h14s.p5r.org/2012/11/analytical-programming.html). Now computer scientists don't read old papers do they? :)

Analytical Programming

This is my third post about Babbage’s calculating engines. The first two were about the difference engine: why it was important at the time and how it worked. This post is about the analytical engine. The analytical engine is famously the first programmable computing machine, and there was much programming involved both in designing and operating it. In this post I’ll take a look at the various ways you could program the engine and the way programs were used to control the engine internally.

Hummus and Magnets
@david_chisnall @regehr I wonder what's going on in E2k land these days. Just a decade ago they didn't have register rename at all. The heart of the design was a large (64 or 128) multiport (4 minimum) register file. They claimed that performance was competitive against x86 made on a similar process (although sadly not world-beating).
@pro @regehr
No idea, I’ve heard various interesting things about their architecture but never any detailed documentation in English. Without register rename, you can’t speculatively execute past a read of a speculatively written register, which limits what you can do quite a lot. Lots of architectural registers may make that less frequent but you’re still going to hit places where the ABI requires reuse (e.g. argument registers).
@david_chisnall @regehr this is the kind of well-informed deep-dive I love to see on here, thanks for taking the time to write it.
@david_chisnall Is your work on dataflow architectures published somewhere?

@aedancullen

No, it’s very sad. Doug Berger gave a keynote at ISCA where he talked about the E2 architecture. There were two follow-on ISAs that improved that and then a RISC ISA. I think most of the folks who worked on that project have left MS (Microsoft’s senior leadership does not create an environment conducive to people wanting to work on projects with a multi-year time to market) so that expertise is now scattered across the industry, and probably too hard to collect enough people to write up some decent publications. Midori (clean-slate OS written in .NET) suffered a similar fate, where the only publications relate to an earlier research prototype.

@david_chisnall @aedancullen there's some non-academic stuff about midori from joe duffy which I think includes the later stages of it? https://joeduffyblog.com/2015/11/03/blogging-about-midori/

Of course it will be far from comprehensive, but still plenty of interesting details (I really like the error model part of it)
Joe Duffy - Blogging about Midori

Joe Duffy's Blog | Adventures in the high-tech underbelly

@ignaloidas @aedancullen

Yup. Joe’s blogs are about the only public info about Midori, which was the result of well over a hundred person-years of effort at MS before being cancelled.