A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. ๐Ÿงต 1/5
In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5
In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5
And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5
And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5

@gabrielesvelto this makes me think

  • decreasing size in RAM would inherently decrease physical errors
  • there will be undercounting from bit-flips which don't cause crashes (if a bit flips in the text I'm entering now, it'd be a typo not a crash)
  • maybe non-RAM physical errors could be estimated by looking at crashes from machines with ECC?

but of those the most fascinating outcomes to me are that

  • the same logically correct codebase, compiled to two binaries of different size, should crash less as the smaller binary
  • changes to code that reduce its compiled size will decrease crashes, if correctness is unchanged
@datum yes, absolutely. Coincidentally the bulk of Firefox code was compiled for size, not speed by default, as smaller code proved faster in such a large codebase. Nowadays it's a complex PGO/LTO dance but the focus on small executable footprint has remained.

@datum
> decreasing size in RAM would inherently decrease physical errors

actually, the opposite is true: smaller feature sizes on DDR5 lead to significantly more errors. DRAM cells are closer together, have higher leakage currents, and are more likely to interfere with each other. as DRAM shrinks, attacks like rowhammer as well as transient issues from normal execution become *far* more frequent.

edit: sorry I misunderstood your post, you meant code size, not physical size ๐Ÿ˜