@cstross @oddhack @laird
Wow, it’s been a long time since I read that paper. The i860 was designed as a general-purpose processor and, as the name hints, as a successor to the x86 line. I remember seeing a load of ads for it in BYTE as the next big thing. It didn’t succeed there, but did end up in a load of graphics accelerators.
I think that undersells NVIDIA somewhat though. It turns out that 2D graphics has quite a lot in common with other common compute tasks. A lot of the performance-critical bit is simply memcpy (or BITBLT, as Dan Ingalls called it), maybe with some blending, but the maximum number of sprite pixels you’ll want to composite is a linear function of the number of destination pixels (you don’t need to bother drawing things that are completely occluded and calculating occlusion for 2D scenes is not hard). You can make a 2D accelerator that’s faster than a generic RISC chip. Most 2D vector graphics operations on bitmaps or geometry (skew, scale, rotate, and so on) can be encoded as 3x vector multiples by a 3x3 matrix, but specialising a chip for 3x1 x 3x3 matrix multiplication is a load of work. And your colour blending is RGBA, so that needs other dedicated hardware. At the same time, you want fast scalar floating-point on your RISC chip for other things and you can easily keep up with display resolutions rendering PostScript-style graphics.
In the ‘90s, ‘Windows Accelerators’ were common. They let GUIs offload window drawing (lines, rectangles, simple sprite blitting) to the graphics card. They were mostly a speedup because PCs often didn’t have FPUs back then.
A lot of RISC chips were, in spite of the hype, not actually very good. The 860, on paper, was around double the performance of a 486 (and used fewer transistors!), but actually compiling code to target it was very hard and in real-world performance it was typically slower. Bolting a decent FPU on the side of a RISC chip was quite easy (especially when it didn’t need to handle baroque things like x87’s 80-bit floating point representation and the bizzare ‘do this operation as binary floating point but then apply a correction so that the error is what it would have been if you’d done it as binary-coded decimal’ instructions that x86 chips needed).
It wasn’t that RISC chips were good at graphics, so much as RISC chips were cheap and the places where they sacrificed performance didn’t matter for 2D graphics and so they were a lot cheaper per unit performance than doing the same work on the main CPU (even when that CPU was another RISC chip that had learned a few more lessons and had a more balanced performance profile).
3D is different in a bunch of ways. The data in a 3D scene is either XYZW vectors for geometry or RGBA vectors for colour spaces. And a lot of the primitive operations you do on colours are the same as the ones you do on vertexes. You can have a common data representation for both colour and geometry, which means you can spend more effort on vector operations in this data type.
CPUs also got 4-way vector units at around this time. AMD called theirs 3DNow! because they expected it to enable fast software 3D. Quake 2 had a mode to use 3DNow! and it was a bit faster than the default software renderer. And it still looked much worse than the OpenGL version.
Because the second thing about 3D graphics is that it’s embarrassingly parallel. You want to do the same thing to every vertex in the scene. And the same thing to every texel on a triangle. This is also true of 2D, but with 2D you hit diminishing returns way earlier. Humans need around 25fps to perceive smooth motion. For dynamically rendered scenes there will be a little bit of jitter for slight variations in the underlying lotion and so 60fps is a nice place to aim for. If you can render a 2D scene in 10ms, you’re well ahead of what you need. Adding parallelism to render it in 1ms provides no benefit. But with 3D, the data is much bigger. Even the first NVIDIA cards handled scenes with millions of triangles. You simply wouldn’t build a 2D scene that big. And, even then, you’d see pixelation if you walked up to a wall because it was a low-resolution texture that looked fine in a distance, but putting a texture in video memory that was as big as the screen (so looked good when you got close to it) was completely infeasible to do for every surface in a 3D scene. Texture compression made this somewhat possible, but more modern 3D accelerators have thousands of times as much texture memory as they need for a single frame buffer.
And that increase in memory brings me to the third thing. CPUs are optimise around the idea that workloads exhibit a lot of locality of reference in both spatial and temporal dimensions. If you access some data, you are likely to access it again soon. You are also likely to access nearby data soon. At the same time, data access patterns are hard to predict. For graphics, this is far less true than a lot of other workloads. Most memory tends to be touched once per scene, so caches don’t help, and a lot of memory is read to render each scene, so you need to be able to stream a lot more memory through the compute units than elsewhere. At the same time, most memory accesses are predictable, so you can program the access pattern (which may be something complex like recursive Z shapes) into the memory controller and stream data at DRAM line rates. All of this ends up with a very different design to most CPUs and trying to either build it out of cheap RISC cores or build something that works well as both a GPU and CPU will involve a lot of compromises that will hurt either or both workloads.
I touched on a lot of this a few years ago in my CACM article No Such Thing as a General-Purpose Processor.