A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5
In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5
And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5
And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5

@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.

Few years later got a second 16 GB kit to expand the PC to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.

Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.

I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error warning from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.

It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.

@raulinbonn @gabrielesvelto ECC RAM is only worth anything if the error corrections are being monitored and there's a process in place to replace the faulty component. If not then it's just a waste of money to generate warnings in a log file nobody reads.
@raulinbonn @gabrielesvelto @stark ECC ram detects and corrects single bit errors, making them immune to this issue.
The BMC reports if a dimm has gone bad and is spewing errors
@stark @raulinbonn I disagree, even simple SECDED ECC would significantly lengthen the lifetime of consumer electronics. It takes a while before a system develops multiple bit failures within the same chunk that cannot be corrected (unless it's a catastrophic failure such as an entire bit/word-line going bust)