A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5
In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5
And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5
And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5
@gabrielesvelto is it lots of different devices, each one experiencing rare crashes at random, or is there a small number of really shitty computers accounting for a large share of the crashes?
@gabrielesvelto and what is the ratio of people who ever get a (bit-flip) crash out of all those who opted into this telemetry?
@guenther I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate.
@guenther generally speaking a single machine won't send a lot of crashes. It's very common that they only have one bad bit across their whole installed RAM. They'll hit it eventually, especially if it's in the lower address ranges, but not all of the time. And in order to crash some important data needs to end up there, like a pointer or an instruction.
@gabrielesvelto @guenther why would the lower address ranges be special? This is confusing to me. I can't imagine firefox runs on any OS without virtual memory (or ASLR), so it doesn't seem like that should correlate strongly with any physical aspect.
@vathpela @guenther I meant in the lower *physical* address ranges, because it's more likely to be used early on even on a lightly loaded machine. I once had a laptop with a bad bit at the very end of the physical range, I would hit it only when running Firefox OS builds which were massive (basically building Firefox + a good chunk of Android's base system at the same time)
@gabrielesvelto @guenther That still seems really weird to me - why would firefox be likely to get a low physical address? If anything is likely to have a higher chance of getting that memory, I would think it would be the kernel (which has genuine lowmem requirements sometime for e.g. dma bufs and such on some platforms), but a userland process seems odd.
@vathpela @guenther oh it's not for Firefox specifically. Users with bad bits in lower address ranges will be more likely to encounter problems with *everything*, including the kernel. I also don't literally means the *lowest* ranges. Say, if you have a bad bit in the first GiB of physical memory you'll see its effects far more often than if you have it in the last one on a 32 GiB machine
@gabrielesvelto @vathpela @guenther Linux kernel has CONFIG_SHUFFLE_PAGE_ALLOCATOR to randomize which memory gets allocated first, which generally distros enable, but probably no one activates it by the boot param page_alloc.shuffle=y ;)
@vbabka @gabrielesvelto @guenther all boot params are policy failures :)

@vbabka If it is not enabled by default, then it is not important.

@gabrielesvelto @vathpela @guenther

@oleksandr @vbabka @gabrielesvelto @guenther no seriously, basing things on boot params should just be considered a bug. It's always a bad choice, usually thought to be necessary because of some other bad choice or trade-off.
@vathpela @oleksandr @gabrielesvelto @guenther it's meant for hardening and there's some performance trade-off, which is typical. But IMHO it's better if hardening options can be enabled just by boot parameters and not require a different distro kernel flavor.
@vbabka @oleksandr @gabrielesvelto @guenther you should be able to turn it on with a running kernel.
@vbabka @oleksandr @gabrielesvelto @guenther IMO it's never about boot time vs compile time, and always about being able to turn it on and transition in to it. Of course that's sometimes the hardest way and why we make bad trade-offs, but it also keeps us from being able to enable a lot of features we want on the boot path.
@vbabka @oleksandr @gabrielesvelto @guenther Obviously I have some bias here, but it's because people want things from booting that command line variability makes intractable.
@vbabka @oleksandr @gabrielesvelto @guenther and other OSes simply do not have this problem at all. It's optional.

@vathpela How would that even work? Any memory allocated up until the point where you switch the setting will have to remain in place, negating the benefits or randomized allocation for everything that starts early. And both allocators would have to work with the same in-memory format, which may or may not be possible?

@vbabka @oleksandr @gabrielesvelto

@guenther @vbabka @oleksandr @gabrielesvelto right, like I said there are always difficulties and trade-offs. You might have to flip the switch and then re-start tasks or kexec or other things, or something else (who knows, someone would have to design it).
@vathpela @vbabka @oleksandr @gabrielesvelto @guenther sounds like a stupid comment. A running kernel has already completed the majority of memory allocations it will ever need, so toggling such an option by then would have no effect. Unless you want such toggling to force a complete realloc of all kernel memory, which would be even more stupid.

@hyc And while you are solving this, make it possible to page out kernel memory :D.

@vathpela @vbabka @gabrielesvelto @guenther

@gabrielesvelto I had some faulty memory that only showed symptoms when more new memory was added to the machine. however the physical memory layout changed it pushed the faulty bit of ram somewhere that was used a lot more often. took me a few days to nail that one down.
@gabrielesvelto @vathpela @guenther Similar to my recent RAM issues where the crash would only happen when the build would fill > 125GB of RAM because the broken bits were at the end ...
@gabrielesvelto i have found every one of your discussions of this topic immensely fascinating and have been able to revise many assumptions i had about the cpu and memory system. i want to additionally commend you for both identifying that more invasive telemetry could have been useful and then making it unequivocal that it's always opt-in and still anonymized on top of that. i have had to push back very strongly on this sort of thing before and it takes my breath away to find someone else with extremely high standards for measurement work and user safety
@gabrielesvelto @guenther plus it makes sense that with a global userbase some % of users have kit running in places with nearby sources of radioactivity, possibly unknown to them. imagine folks living near the CEZ or Fukushima etc, but radioactivity occurs naturally in some parts of the earth etc and is invisible to the layman
@guenther @gabrielesvelto ah yes, bitflips georg who lives in a plutonium mine and gets a thousand bit-flip related crashes every day is an outlier adn should not have counted
@gabrielesvelto hopefully those MacBooks could run Linux with the badram option.
@gabrielesvelto I did not know that bit flips refer to reproducible bad ram issues....I thought they are random...
@adingbatponder people and research usually focused on random bit-flips caused by high-energy radiation and similar phenomenons. Actual RAM going bad is a poorly documented and researched problem, mostly because the industry doesn't care. This is a more extensive thread on the issue: https://fosstodon.org/@gabrielesvelto/112407741329145666
Gabriele Svelto [moved] (@[email protected])

Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail. I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17

Fosstodon

@gabrielesvelto @adingbatponder At the end of that thread: " I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17"

Was that article published, or is it approaching publication? I'd be very interested.

@raulinbonn @adingbatponder I never had the time to write it, it's on my TODO list for this year
@gabrielesvelto @adingbatponder Great! In that thread you also mention something else that is quite important: RAM will only deteriorate and get worse with age and usage. So even if this is the absolute worst time in history when one would need to look for new RAM, I will plan to re-run RAM tests on this PC some time soon to see how it's going.
@gabrielesvelto @adingbatponder I had a coworker whose PhD was analyzing bit errors, and concluded that running without at least ecc ram, particularly in a data center setting was madness. But it had serious repercussions for users doing data analysis on their (non ecc) desktops for research.
@trouble @adingbatponder yes, at the datacenter level the amount of errors you get is enormous. SECDED ECC doesn't cut it there anymore so usually more robust detection/correction systems are used.

@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.

Few years later got a second 16 GB kit to expand the PC to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.

Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.

I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error warning from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.

It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.

@raulinbonn yes, both hardware and big software vendors have handwaved this problem away for years by claiming that software bugs are more common. In my testing hardware issues are common enough that they often drown the software issues.

@raulinbonn @gabrielesvelto
I had a G.Skill DDR4 kit (RipJaws V, 2x16GB) go bad after about 4 years (had crashes, memtest86 confirmed it was the memory). RMA'd it, got a replacement which worked fine at first but broke after 2½ years.. switched to more expensive ECC RAM from Kingston then (thankfully a few months before the prices skyrocketed), no issues since, I hope it remains like that..

So yeah, memory failure is not uncommon at all :-/

@raulinbonn @gabrielesvelto
BTW, I agree that new computers must run prime95 and memtest86 without any errors. Any miscalculation indicates bugs that might under other circumstances also lead to crashes, "it's good enough for gaming" is pure BS

@raulinbonn

> And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately.

Are these people aware that a bit flip in some file system code could nuke somebody's hard drive?

@gabrielesvelto

@argv_minus_one @raulinbonn yes, the worst outcome of a bit-flip is when data that will be written to disk happen to overlap it, which then makes it all the way to the drive. And BTW this is one of the reasons why competent filesystems should always implement checksums for both data and metadata, as it increases the chances of detecting these issues early, before they do permanent damage.

@gabrielesvelto

Checksums will let you detect that there is a problem, but won't actually save your data. If a bit flip causes the file system driver to write to the wrong LBA or corrupt a key file system data structure or something, the damage will still be quite permanent unless you have a backup.

@raulinbonn

@argv_minus_one @raulinbonn yes, detection can only do so much. Even in our case there are indirect crashes that we simply cannot detect. For example in Rust code a bit-flip will often cause an invariant to be broken, so the code will fail cleanly but the origin of the crash will be lost as it happened before the point where we can detect it.

@raulinbonn @gabrielesvelto I can concur, that I had RAM modules that would fail memtest brand new.

However, if you want to get some reliable data, I suggest searching for server RAM going bad data.
They have ECC and multiple failover modes (failed addresses get remapped, spare modules, etc). I'm pretty sure some sort of research exists, on how fast RAM chip goes bad.

@atis @raulinbonn there have been lots of studies about memory failure in servers but they're not very relevant to what users see because of the different conditions they happen in. Servers run in controlled environments, with cleaner power delivery, controlled temperature, lower clocks and shorter lifespans than client devices. So even the same physical DRAM chips will behave differently between a server and a client device.
@gabrielesvelto @raulinbonn I mean, it would give some sort of baseline.
As for timespan, I would disagree, a server running 24/7 for 10 years easily beats any office user (40 hours/week) or typical home user (even less)

@atis @raulinbonn in usage hours yes, a server will always beat a client device (well, maybe not an always-on one such as a phone) however there is an extremely strong correlation in our data between machine age and the number of observed failures. Anyway this is an older but still relevant study if you're interested: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf

The rates that they reported are way, way lower than what I see.

@raulinbonn @gabrielesvelto ECC RAM is only worth anything if the error corrections are being monitored and there's a process in place to replace the faulty component. If not then it's just a waste of money to generate warnings in a log file nobody reads.
@raulinbonn @gabrielesvelto @stark ECC ram detects and corrects single bit errors, making them immune to this issue.
The BMC reports if a dimm has gone bad and is spewing errors
@stark @raulinbonn I disagree, even simple SECDED ECC would significantly lengthen the lifetime of consumer electronics. It takes a while before a system develops multiple bit failures within the same chunk that cannot be corrected (unless it's a catastrophic failure such as an entire bit/word-line going bust)
@raulinbonn @gabrielesvelto It's too easy to blame industry, as on the other side (almost) no user is really ready to pay the 12% premium for ECC ram and the additional logic on the mainboards.