RE: https://mas.to/@gabrielesvelto/116171755938331430

It is amazing that computers work at all.

One of my favorite research papers along these lines is "When the CRC and TCP checksum disagree": https://dl.acm.org/doi/10.1145/347059.347561

By looking at cases where the Ethernet CRC indicated no error, but the TCP checksum was invalid, the authors found a whole host of bugs in networking hardware, such as faulty DMA engines and buggy router memory.

Hardware data corruption is everywhere!

(There are no doubt newer studies, I'm just personally acquainted with one of the authors.)

@markgritter this should also work for PCI Express, for same reasons

@markgritter When I worked at Google, the most amazing thing to me was learning that the special sauce that made Google work behind-the-scenes is that at every level of their service design, from frontend to storage... They assume every component can fail. The result is multiple layers of failover handler, possibly dozens to hundreds of them depending on the nature of the operation. And it turns out when you do that, you end up with a system that can work without huge chunks of what other systems might consider "core" and thats okay.

Why did they do it? Because while other companies were trying to solve search with hardware bought from SGI, Larry and Sergey started from no-money and rolled their own machines. Machines that looked like this.

Note the sagging plywood backboards for the rack units.

Google's founders couldn't afford better hardware, so they had to build software that would run on bullshit that sucked. And the rest is history.

@markgritter TIL 10% of Firefox crashes are due to bit flips. 🤯