I declare that today, Nov. 19, 2025 is the 50th anniversary of BitBLT, a routine so fundamental to computer graphics that we don't even think about it having an origin. A working (later optimized) implementation was devised on the Xerox Alto by members of the Smalltalk team. It made it easy to arbitrarily copy and move arbitrary rectangles of bits in a graphical bitmap. It was this routine that made Smalltalk's graphical interface possible. Below is part of a PARC-internal memo detailing it:
BitBLT was implemented in microcode on the Alto and exposed to the end-user as just another assembly language instruction, alongside your regular old Nova instructions -- this is how foundational it was. And since it was an integral part of the Alto, it enabled all sorts of interesting experimentation with graphics: user interfaces and human/computer interaction, font rasterization, laser printing... maybe a game or three...

@fvzappa

Do modern GPUs still do blitting?

@argv_minus_one @fvzappa Apparently GPUs themselves do a lot of fast memory block copies via DMA, (kind of like blitting without the XOR operations), but use shader programs to do what blitter hardware used to do for small memory areas on a pixel-by-pixel scale.

@sleet01 @argv_minus_one @fvzappa

> Apparently GPUs themselves do a lot of fast memory block copies via DMA (…) shaders

*me grimmacing my face*… kind-of… sort-of…

Okay, first things first: GPUs still do have dedicated hardware that also enables bit blitting. Specifically the part of the raster engine that's responsible for resolving antialiased frame buffers. Graphics APIs still expose this with functions carrying 'blit' in their name:

https://registry.khronos.org/OpenGL-Refpages/gl4/html/glBlitFramebuffer.xhtml

https://docs.vulkan.org/refpages/latest/refpages/source/vkCmdBlitImage.html

@sleet01 @argv_minus_one @fvzappa

Second: There are certain aspects of blitting operations that are outside the scope of shaders. Specifically raster logic operations munch source and destination values. If you wanted to implement that in a shader, you'd have to feed back the destination buffer as a source into the shader, which, technically can be done, but is slooooow.

So for things like alpha blending and ROPs, those are done through the raster engine, which is also a blit engine.

@datenwolf @argv_minus_one @fvzappa Apologies, I wasn't _trying_ to invoke Cunningham's Law ^_^
I'd totally forgotten about the 2D acceleration stuff, since it's mostly mentioned (now, at least) in the context of GUI acceleration.
Thanks for the corrections!

@sleet01 @argv_minus_one @fvzappa

No worries – GPUs are weird beasts and in places kind of counterintuitive. In a way my whole career is founded on other engineers having misconceptions about GPUs. :-)

Alas, the raster engine isn't merely there for 2D acceleration, but also forms a vital part in 3D rendering. Besides blitting and ROPing it also implements depth testing and blending.

@datenwolf @argv_minus_one @fvzappa Blending sounds reasonable, but depth testing? Is that because it can compare int values quickly, and the depth is stored as a 2D bitmap? I vaguely recall something like that...

@sleet01 @argv_minus_one @fvzappa

Basically every operation that takes a generated fragment (pixel value tuple) and in-place merge it into the destination framebuffer pixel. If you did that in a fragment shader you'd build a data path feedback loop which gets messy if you have multiple elements in a single draw call hitting the same destination pixels. You can use memory barriers to sort the writes. But this is inefficient.

@sleet01 @argv_minus_one @fvzappa

Depth testing is basically a in-place compare-and-select operation. And in case the fragment shader doesn't modify the depth value, depth testing is executed before the fragment shader, potentially saving a lot of compute for rejected fragments.

@argv_minus_one @fvzappa Modern GPUs have fundamentally different challenges, so the solution on the original computers doesn't necessarily help with the new computers.

Old computers were slow to run instructions, had limited memory space, but had ample memory bandwidth, so having one function that can do X different actions with few opcodes saves on memory space, and you can take extra memory bandwidth to read and write memory because that's available.

Nowadays we have ample space and the cores are extremely fast, but the limiting factor is memory bandwidth. I believe something like less than 100 instructions for a pixel shader wouldn't be able to saturate the cores at all, so most of the time they're just waiting for memory to get there. A solution that doubles memory bandwidth consumption doesn't help with that.

glBlitFramebuffer - OpenGL 4 - docs.gl

@nina_kali_nina @dascandy @argv_minus_one @fvzappa
Does anything similar exist in Vulkan?
@brouhaha @nina_kali_nina @dascandy @argv_minus_one @fvzappa Don't know if you saw it, but a Vulkan-link was posted in another response: https://chaos.social/@datenwolf/115575595195740316
datenwolf (@[email protected])

@[email protected] @[email protected] @[email protected] > Apparently GPUs themselves do a lot of fast memory block copies via DMA (…) shaders *me grimmacing my face*… kind-of… sort-of… Okay, first things first: GPUs still do have dedicated hardware that also enables bit blitting. Specifically the part of the raster engine that's responsible for resolving antialiased frame buffers. Graphics APIs still expose this with functions carrying 'blit' in their name: https://registry.khronos.org/OpenGL-Refpages/gl4/html/glBlitFramebuffer.xhtml https://docs.vulkan.org/refpages/latest/refpages/source/vkCmdBlitImage.html

chaos.social
@pianosaurus
Thanks for bringing that to my attention!

@fvzappa

I believe this is slightly misleading. There wasn't really a canonical microcode for Alto. Each language implemented its own VM in Alto bytecodes. It was a bytecode, so you had 255 instructions (I think 0 was reserved? I might be misremembering) that you'd use to implement the common operations for your language. There were Algol, Smalltalk, and a few other VMs.

The Smalltalk bytecode, which included BitBLT, was documented in the Smalltalk Blue Book.

Mostly unrelated, but meeting Dan Ingalls was probably the time in my life when it's been hardest to not make happy fanboy squee noises.

@david_chisnall The microcode built into the Alto's 1K microcode ROM included the BitBLT routine. The Alto's microcode engine was actually specialized for Nova instruction decoding, the "native" instruction set was an extended Nova ISA. The Smalltalk emulator was bytecode oriented, as was Mesa. But there was no such thing as "Alto bytecode," the Alto's microcode was implemented in a 32-bit horizontal format that directly controlled the datapaths, ALU, memory (and many special functions.)
@david_chisnall The later D-machines (Dolphin, Dorado, Dandelion (Star), etc.) were designed to execute bytecodes efficiently (specifically for Mesa, but Smalltalk also took advantage of this). The Dorado could (in theory) execute 16 million bytecodes/sec which was pretty impressive in 1979.