Do modern GPUs still do blitting?
@sleet01 @argv_minus_one @fvzappa
> Apparently GPUs themselves do a lot of fast memory block copies via DMA (…) shaders
*me grimmacing my face*… kind-of… sort-of…
Okay, first things first: GPUs still do have dedicated hardware that also enables bit blitting. Specifically the part of the raster engine that's responsible for resolving antialiased frame buffers. Graphics APIs still expose this with functions carrying 'blit' in their name:
https://registry.khronos.org/OpenGL-Refpages/gl4/html/glBlitFramebuffer.xhtml
https://docs.vulkan.org/refpages/latest/refpages/source/vkCmdBlitImage.html
@sleet01 @argv_minus_one @fvzappa
Second: There are certain aspects of blitting operations that are outside the scope of shaders. Specifically raster logic operations munch source and destination values. If you wanted to implement that in a shader, you'd have to feed back the destination buffer as a source into the shader, which, technically can be done, but is slooooow.
So for things like alpha blending and ROPs, those are done through the raster engine, which is also a blit engine.
@sleet01 @argv_minus_one @fvzappa
No worries – GPUs are weird beasts and in places kind of counterintuitive. In a way my whole career is founded on other engineers having misconceptions about GPUs. :-)
Alas, the raster engine isn't merely there for 2D acceleration, but also forms a vital part in 3D rendering. Besides blitting and ROPing it also implements depth testing and blending.
@sleet01 @argv_minus_one @fvzappa
Basically every operation that takes a generated fragment (pixel value tuple) and in-place merge it into the destination framebuffer pixel. If you did that in a fragment shader you'd build a data path feedback loop which gets messy if you have multiple elements in a single draw call hitting the same destination pixels. You can use memory barriers to sort the writes. But this is inefficient.
@sleet01 @argv_minus_one @fvzappa
Depth testing is basically a in-place compare-and-select operation. And in case the fragment shader doesn't modify the depth value, depth testing is executed before the fragment shader, potentially saving a lot of compute for rejected fragments.
@argv_minus_one @fvzappa Modern GPUs have fundamentally different challenges, so the solution on the original computers doesn't necessarily help with the new computers.
Old computers were slow to run instructions, had limited memory space, but had ample memory bandwidth, so having one function that can do X different actions with few opcodes saves on memory space, and you can take extra memory bandwidth to read and write memory because that's available.
Nowadays we have ample space and the cores are extremely fast, but the limiting factor is memory bandwidth. I believe something like less than 100 instructions for a pixel shader wouldn't be able to saturate the cores at all, so most of the time they're just waiting for memory to get there. A solution that doubles memory bandwidth consumption doesn't help with that.
@[email protected] @[email protected] @[email protected] > Apparently GPUs themselves do a lot of fast memory block copies via DMA (…) shaders *me grimmacing my face*… kind-of… sort-of… Okay, first things first: GPUs still do have dedicated hardware that also enables bit blitting. Specifically the part of the raster engine that's responsible for resolving antialiased frame buffers. Graphics APIs still expose this with functions carrying 'blit' in their name: https://registry.khronos.org/OpenGL-Refpages/gl4/html/glBlitFramebuffer.xhtml https://docs.vulkan.org/refpages/latest/refpages/source/vkCmdBlitImage.html
I believe this is slightly misleading. There wasn't really a canonical microcode for Alto. Each language implemented its own VM in Alto bytecodes. It was a bytecode, so you had 255 instructions (I think 0 was reserved? I might be misremembering) that you'd use to implement the common operations for your language. There were Algol, Smalltalk, and a few other VMs.
The Smalltalk bytecode, which included BitBLT, was documented in the Smalltalk Blue Book.
Mostly unrelated, but meeting Dan Ingalls was probably the time in my life when it's been hardest to not make happy fanboy squee noises.
Dan is a Hero
His work was some of the prior art that helped to break the Cadtrak patent back in the day.
@fvzappa And the optimizations you mention are a great example of on-the-fly (or JIT) code generation, explained by Raymond Chen in https://devblogs.microsoft.com/oldnewthing/20180209-00/?p=97995
The original paper describing the optimizations was written by Rob Pike, Leo Guibas and Dan Ingalls (Unix and Smalltalk people working together!) and can be found at https://pdos.csail.mit.edu/~rsc/pike84bitblt.pdf
@ssavitzky perhaps relevant to your interests...
Xerox PARC had visionaries in abundance; what it lacked was upper management that was able to actually do something with it.
@fvzappa Most of that should still reside in the Squeak code base as well, in case people are interested in a "historical" implementation.
Also, Dan is one of the nicest and most modest people I've ever had the pleasure meeting.