A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧡 1/5
In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5
In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5
And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5
And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5
@gabrielesvelto is it lots of different devices, each one experiencing rare crashes at random, or is there a small number of really shitty computers accounting for a large share of the crashes?
@guenther I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate.
@guenther generally speaking a single machine won't send a lot of crashes. It's very common that they only have one bad bit across their whole installed RAM. They'll hit it eventually, especially if it's in the lower address ranges, but not all of the time. And in order to crash some important data needs to end up there, like a pointer or an instruction.
@gabrielesvelto @guenther why would the lower address ranges be special? This is confusing to me. I can't imagine firefox runs on any OS without virtual memory (or ASLR), so it doesn't seem like that should correlate strongly with any physical aspect.
@vathpela @guenther I meant in the lower *physical* address ranges, because it's more likely to be used early on even on a lightly loaded machine. I once had a laptop with a bad bit at the very end of the physical range, I would hit it only when running Firefox OS builds which were massive (basically building Firefox + a good chunk of Android's base system at the same time)
@gabrielesvelto @guenther That still seems really weird to me - why would firefox be likely to get a low physical address? If anything is likely to have a higher chance of getting that memory, I would think it would be the kernel (which has genuine lowmem requirements sometime for e.g. dma bufs and such on some platforms), but a userland process seems odd.
@vathpela @guenther oh it's not for Firefox specifically. Users with bad bits in lower address ranges will be more likely to encounter problems with *everything*, including the kernel. I also don't literally means the *lowest* ranges. Say, if you have a bad bit in the first GiB of physical memory you'll see its effects far more often than if you have it in the last one on a 32 GiB machine
@gabrielesvelto @vathpela @guenther Linux kernel has CONFIG_SHUFFLE_PAGE_ALLOCATOR to randomize which memory gets allocated first, which generally distros enable, but probably no one activates it by the boot param page_alloc.shuffle=y ;)
@vbabka @gabrielesvelto @guenther all boot params are policy failures :)