is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?
https://www.techpowerup.com/340779/amd-claims-arm-isa-doesnt-offer-efficiency-advantage-over-x86
is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?
https://www.techpowerup.com/340779/amd-claims-arm-isa-doesnt-offer-efficiency-advantage-over-x86
There are two different questions here:
The latter is trivially true. We have a load of examples of dead ends. Stack machines make extracting instruction-level parallelism really hard so lost completely to register machines.
Complex microcode makes out-of-order execution hard because you have to be able to serialise machine state on interrupt and decoded microops may have a bunch of state that isn't architectural. This one is quite interesting because it's a very sharp step change. Building a microcode engine that works is quite easy: serialise the pipeline, disable interrupts, run a bunch of microops, reenable interrupts. Building one that is efficient and allows multiple microcoded instructions to run in parallel is really hard, but if you do it then the complexity is amortised across a potentially large number of instructions. x86 chips took the first approach until fairly recently because there was a lot of lower-hanging fruit and microcoded instructions were rare. Having one instruction that requires complex microcode is absolutely the worst case.
Different ISAs favour different implementation choices. AArch32's choice to make the program counter architectural was great on simple pipelines, for example. It made PC-relative addressing trivial (just use PC as the base of any load or add) and made short relative jumps just an add to the PC. This became more annoying for more complex pipelines because you can't tell whether an instruction is a jump until you've done full decode (on most other RISCy ISAs, you can tell from the major opcode), which impacts where you do branch prediction. Similarly, all of the predication in AArch32 is great for avoiding using the branch predictor in common cases with simple pipelines, but you need that state anyway on big out-of-order machines. Thumb-2's if-then-else instruction provided a denser way of packing predication that scales nicely up to dual-issue in-order cores, but really hurts if you want to decode multiple instructions in parallel.
The question of AArch64 vs x86-64 is much more interesting.
Register rename is the biggest single consumer of power and the bottleneck on a lot of very high-end implementations. Complex addressing modes really help reduce this overhead, but so do memory-register operations where you avoid needing to keep a rename register live for a value that's used only once.
At MS, we did a lot of work on dataflow architectures to try to avoid this. This was largely driven by two observations:
The theory was that, by encoding this directly in the ISA (input operands were implicit, output operands were the distance in executed instruction stream to the instruction that consumed the result) you'd be able to significantly reduce rename register pressure. Unfortunately, it turned out that speculative execution required you to do something that looked a lot like register rename for these values.
AArch64 intentionally tries to provide a useful common set of fused operations. x86-64 does it largely by accident, but there isn't a clear winner here.
The one big win that we found has not really made it into any instruction set, which continues to surprise me. A cheap way of marking a register as dead can massively improve performance. I've seen a 2x speedup on x86 from putting an xor rax, rax at the end of a tight loop because the pipeline was stalling having to keep all of the old rax values around in rename registers, even though no possible successor blocks used them. If I were designing a new ISA, for high-performance systems I'd be tempted to do one of the following:
The latter hurts density, but would probably be a bigger win because it would let you rewrite a load of operations from allocating rename registers to using forwarding in the front end.
The other bottleneck is parallel decode. A64 ditched the variable-length instruction set that T32 introduced because it makes orthogonal decoding trivial. Fetch 16 bytes, decode four instructions. Apple's implementations make a lot of use of this and also have a nice sideways forwarding path that allows values produced in one instruction to be directly forwarded to a consumer in the same bundle without going via register rename (if the value isn't clobbered, they still need to allocate a rename register).
x86-64 is staggeringly bad here. As Stephen Dolan quipped, x86 chips don't have an instruction decoder, they have an instruction parser. Instructions can be very long, or as short as a single byte. Mostly this doesn't matter on modern x86-64 chips because 90% of dynamic execution is in loops and modern x86-64 chips do at least some caching of decoded things (at the very least, caching the locations of instruction boundaries, often caching of decoded micro-ops) in loops.
And this is where it gets really interesting. The extra decode steps and the extra caches add area, but they also save instruction cache by providing a dense instruction encoding (x86-64 isn't perfect there, it's not that close to a Huffman encoding over common instruction sequences, but it's a moderately good approximation). Is that a good tradeoff? It almost certainly varies between processes (the relative power and area costs of SRAM vs logic vary a surprising amount).
The thing that I find really surprising is the memory model. Arm's weak memory model is supposed to make high-performance implementations easier by enabling more reorderings in the core, whereas x86's TSO is far more constrained. Apple's processors have a mode that (as I understand it) decodes loads and stores into something with the same semantics as the load-acquire / store-release instructions but with the normal addressing modes. It doesn't seem to hurt performance (it's only used by Rosetta 2, so it's hard to do an exact comparison, but x86-64 emulation is ludicrously fast on these machines, so it isn't hurting that much. I'd love to see some benchmarks enabling it by default on everything).
Doing any kind of apples-to-apples comparison is really hard because things in the architecture force implementation choices. You would not implement an x86-64 and an AArch64 core in exactly the same way. There was a paper at ISCA in 2015 that I absolutely hated (not least because it was used to justify a load of bad design decisions in RISC-V) that claimed it did this, but when you looked at their methodology they'd simulated cores that were not at all how you would implement the ISAs that they were discussing.
See also: How to design an ISA.
It's not a separate instruction, but x86 does document that xoring a register with itself is fast clear, and internally points that register to a canonical zero rename register, so doesn't consume a rename register. The thing I want is either a way of doing it in any instruction with no extra instructions, or a way of zeroing a set of registers in a single instruction.
You could probably also do it by making a handful of registers explicitly temporary registers, so that reading them implicitly zeroed them. I'm not sure how many you'd want though. On a system with 32 registers, four would probably be enough.
Oh, great! Yes, that's what I want. And Agner is the person I'd expect to understand the value of it.
And, from a business perspective, it's x86 has a lot of advantages for AMD:
They can add new instructions to differentiate their products. Arm no longer gives out licenses that allow you to add custom instructions (blame Intel for that. Wireless MMX ruined it for everyone).
They own a load of patents required to implement x86 and have a cross-licensing agreement with Intel (and Via). That's a market where no one else can enter easily, whereas Arm will sell architecture licenses to anyone with a big pile of money and the patience to sit through the nonsense their lawyers insist on for a couple of years.
The Arm ecosystem is growing, but there's a lot more x86-only software than there is Arm-only software. Android and iOS are the only ecosystems where AArch64-only software is common and AMD doesn't have any parts in the right power envelope for that market currently.
Apple is not an AMD competitor. Apple isn't going to say 'oh, these chips are 20% faster, let's buy from AMD instead of using our own designs' even if AMD did manage this, and Apple isn't selling their chips to other people. And instruction set is not a dominant factor for most people. People don't buy Apple laptops because of AArch64 and they don't buy Dell because of x86-64, they buy them because of the performance and battery live second and the ability to run the software that they care about first. In the Arm world, AMD would be competing against a load of companies with products from the tiny to the huge. In the x86 market, they're competing against Via who are MIA and Intel who are committing the worlds slowest corporate suicide.
Thanks! That's very interesting, and supports Arm's design choice. A 9% speedup from an ISA choice is a big win. And, given how much effort Apple went to for x86 emulation, it's probably the closest comparison that's possible.
Isn't this one of those questions where theory comes up short and the answer has to come from practice ?
Presumably AMD, Intel and Apple did the best they could, but as far as I can tell, Apple's Mx CPU's clearly, often dramatically, use less power than AMD's and Intel's ?
Isn't that a proof of the pudding, so to speak?
Apple had three choices:
Doing x86 was never on the cards, so we don’t know if they would have been able to do something similar in terms of performance with x86. They’d already invested a lot in the toolchain for AArch64 and their kernel supported it, so that probably made it an easy choice.
A lot of the performance from their cores comes from controlling the whole system. They learned the lesson from the G4, which easily outperformed an Intel CPU on workloads that fitted in cache but normally spent most of its time waiting for memory. At each price-performance point, they know the exact amount of memory, the exact timings of that memory, and can scale the number of memory controllers, the design of the memory controllers, depth and width of store queues, and so on quite precisely to avoid bottlenecks. Intel and AMD can’t do this because a single CPU SKU has to end up in a hundred different laptop models, paired with arbitrary third-party-specified amounts and speeds of memory.
We had a similar choice when I was at Microsoft. We also considered PowerPC. For Xbox 360, MS had already done ports of Windows to PowerPC and Visual Studio could target it. And since the VirtualPC acquisition (and used for backwards compat on the 360), MS even owned an x86-on-PowerPC emulator. This was largely discounted because the ISA is ‘open source’ according to IBM press releases but even MS lawyers couldn’t get IBM to explain what that meant (what is the license? Does it cover implementations with custom extensions? Does it include IBM patents necessary for implementation?). We also looked at RISC-V but discarded it because it’s a terrible ISA.
The choices for us were either:
AArch64, which had a mature ecosystem (including a Windows port), but which didn’t allow custom extensions.
An in-house ISA. The custom ISA that we designed was really nice. We learned a load of lessons from AArch64 (both in terms of what worked well and what didn’t). The perf folks estimated that we could, everything else being equal, get a modest performance improvement relative to AArch64 (under 20%, I can’t remember the exact number). But we’d need to spend several billion on the software ecosystem. Importantly, being even 20% faster than an in-house core that we didn’t build was not actually the relevant metric, we needed to be 20% faster than competitors’ cores and going with AArch64 left us able to license third-party cores easily if the in-house efforts didn’t work (and, for various reasons, that was the path they took in the end).
My main lessons from this project were:
The fact that x86 was "never on the cards" was probably because Apple had many years experience with the self proclaimed champion in that space not producing the CPU Apple wanted ?
No, it was that x86 is a patent minefield, with a bunch of patents on things necessary to implement various bits of the ISA. Intel, AMD, and Via have a cross-licensing agreement that covers them, but it basically means no new x86 implementers. Unless you want to stick to a 20-year-old version of x86.
@david_chisnall @regehr
> Put an extra bit in each source operand to mark it as a kill.
Fun fact, this was a feature of the very first ISA (https://h14s.p5r.org/2012/11/analytical-programming.html). Now computer scientists don't read old papers do they? :)
This is my third post about Babbage’s calculating engines. The first two were about the difference engine: why it was important at the time and how it worked. This post is about the analytical engine. The analytical engine is famously the first programmable computing machine, and there was much programming involved both in designing and operating it. In this post I’ll take a look at the various ways you could program the engine and the way programs were used to control the engine internally.
No, it’s very sad. Doug Berger gave a keynote at ISCA where he talked about the E2 architecture. There were two follow-on ISAs that improved that and then a RISC ISA. I think most of the folks who worked on that project have left MS (Microsoft’s senior leadership does not create an environment conducive to people wanting to work on projects with a multi-year time to market) so that expertise is now scattered across the industry, and probably too hard to collect enough people to write up some decent publications. Midori (clean-slate OS written in .NET) suffered a similar fate, where the only publications relate to an earlier research prototype.
Yup. Joe’s blogs are about the only public info about Midori, which was the result of well over a hundred person-years of effort at MS before being cancelled.