Mastodawn

Frederik Seiffert Mar 4

Gabriele Svelto [moved]

Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail.

I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17

Gabriele Svelto [moved]May 8, 2024

First of all let's talk briefly about how memory works. What you have in your PC or phone is what we call dynamic random access memory. That is memory that stores bits by putting a minuscule amount of charge into vanishingly small capacitors (or not putting it in if we're storing a zero).

These capacitors continuously leak this charge, so it needs to be refreshed periodically - every few milliseconds - which is why it's called "dynamic". 2/17

Gabriele Svelto [moved]May 8, 2024

This design is *extremely* analog in nature. When your machine needs to read some bits the capacitors holding them are connected to a bunch of wires. The very small voltage difference that happens in the wire is detected by the use of a circuit that turns it into a clear 0 or 1 value (this is called a sense amplifier). 3/17

Gabriele Svelto [moved]May 8, 2024

So how can this fail? In a huge number of ways. Circuits age with time and use. The ability of the individual capacitors to hold the charge goes down slowly over time, the transistors in the sense amplifiers degrade, points of contact oxidize, etc... Past a certain point this can make the whole process end up outside of the thresholds required to reliably read, write and retain the bits in memory. 4/17

Gabriele Svelto [moved]May 8, 2024

This can lead to different failures: a very common one is a stuck bit, which ends up being always read as 1 or 0, regardless of what was written into it. Another type is timing-dependent failures, which cause a bit to flip but only if it's not touched in due time by an access or a refresh. More catastrophic errors can affect entire lines - which is what happens when a sense amplifier starts to fail. 5/17

Gabriele Svelto [moved]May 8, 2024

Either way, even a single bit error which happens once in a blue moon is catastrophic to a consumer machine. Sometimes it will cause a pixel to slightly change color, but sometimes it will affect an important computation and lead to a crash. Or worse: it'll cause some user data to be corrupted before it's written to disk, and when it is, the damage has become permanent. 6/17

Gabriele Svelto [moved]May 8, 2024

If your machine exhibits rare but hard-to-explain crashes, or if you're forced to reinstall programs - or even the operating system - because of mysterious failures, or experience random reboots or BSODs, then it's very likely that your memory is failing and you need to replace it. 7/17

Gabriele Svelto [moved]May 8, 2024

Diagnosing it is hard. Windows has a memory diagnostic tool which will catch the worst offenders and is easy to use: https://www.microsoft.com/en-us/surface/do-more-with-surface/how-to-use-windows-memory-diagnostic

It's not enough though, some errors can only be caught with more extensive testing. I recommend the open-source memtest86+ (https://memtest.org/) tool or the closed source memtest86 one (https://www.memtest86.com/) 8/17

How to Use Windows Memory Diagnostic | Microsoft Surface

Optimize your PC performance and prevent slowdowns with Windows Memory Diagnostic and RAM tools. Learn how to test RAM and improve performance with Windows Memory Diagnostic.

Surface

Gabriele Svelto [moved]May 8, 2024

Naturally what happens on PCs also happens on phones, network devices, printers, TVs, etc... but you can't test them. This is a disaster because these failures are common, and they become more and more common as the device ages. If we want to have repairable devices that last for a long time, the industry will have to change its practices, but more about this later. 9/17

Gabriele Svelto [moved]May 8, 2024

Now you might wonder: how often does this actually happen? The common wisdom on this topic is that hardware failures are so rare that software bugs will always dwarf them. As I found out this is demonstrably false.

While investigating Firefox crashes I've come to the conclusion that several of the most common issues we were dealing with were likely caused by flaky hardware. This led me to come up with a simple heuristic to detect crashes potentially caused by bit-flips. 10/17

Gabriele Svelto [moved]May 8, 2024

Deploying this heuristic to Mozilla's crash reporting infrastructure has been eye opening: if I take the 10 most common crashes on Windows, 7 are out-of-memory conditions - that is, not bugs - and 3 are likely caused by bad memory.

You've read that right, three out of the ten most common reasons why Firefox crashes on Windows are caused by memory that's gone bad. 11/17

Gabriele Svelto [moved]May 8, 2024

Now there's a few things that are worth mentioning: users with bad hardware will be over-represented in this category, their machines will crash far more often than others.

The second thing is that Firefox is exceptionally stable, we've driven down its crash rate by more than >70% in the last few years. But Firefox is also a 30 million-lines-of-code monster. There are bugs in there, but they're less common than hardware failures! 12/17

Gabriele Svelto [moved]May 8, 2024

Plotting these types of crashes against time yields interesting trends: the more machines age the more likely they are to encounter hardware-related failures. You might think that's obvious, and indeed it is, but until now the industry has looked the other way, based on the hand-wavy excuse that hardware failures were less common than bugs. 13/17

Gabriele Svelto [moved]May 8, 2024

So what needs to change? First of all, error detection and correction must become commonplace. You can already build a desktop machine with ECC memory (https://en.wikipedia.org/wiki/ECC_memory), but it's uncommon in laptops, even mobile workstations, and completely absent on phones and other consumer appliances. This will measurably lengthen the usable life of these devices. 14/17

ECC memory - Wikipedia

Gabriele Svelto [moved]May 8, 2024

Note that detection is more important than correction. The user needs to know that there's something wrong without having to run a memory testing program. Think of the lights that turn on in cars if something's malfunctioning, or the error beeps that your washing machine makes when it thinks it's leaking water. These are extremely common, they need to be on computing devices too. 15/17

Gabriele Svelto [moved]May 8, 2024

Finally hardware design must change to make devices repairable and prolong their useful life. Yes, I'm looking at non-ECC memories soldered on the motherboard or worse, on the same substrate as the CPU. 16/17

Gabriele Svelto [moved]May 8, 2024

To end the thread I'd like to thank my colleagues Alex Franchuk and @willcage who did the implementation work and my boss Gian-Carlo Pascutto who plotted crashes against machine age. I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17

Jon (now at neuromatch.social)May 9, 2024

@gabrielesvelto really interesting thread, thanks for writing it up!

Chris [list of emoji]May 11, 2024

@jdp23 @gabrielesvelto

Seconded.

Osma A 🇫🇮🇺🇦May 9, 2024

@gabrielesvelto
Thanks for an interesting thread! It brought back memories from 90s computing that was full of repeatedly running memory checkers and reseating chips - and how that taught me for a while to only accept motherboards which supported ECC memory. Indeed no longer an option available for our devices.

Raven Onthill May 9, 2024

@gabrielesvelto @willcage IBM designed parity checks everywhere in the System 360, so that the system would quickly stop in the event of hardware failure. ECC was implemented in main memory systems when it was discovered that solid state RAM was subject to transient memory failures from cosmic rays (really.) Early PCs used memory parity checks (which weren't adequate) until Apple abandoned them for cost reasons. Bad mistake.

https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

The computer errors from outer space

The Earth is subjected to a hail of subatomic particles from the Sun and beyond our solar system which could be the cause of glitches that afflict our phones and computers.

BBC

AN/CRM-114 May 9, 2024

@ravenonthill @gabrielesvelto @willcage what was the year when Apple abandoned them?

Louis Gerbarg May 9, 2024

@flyingsaceur @ravenonthill @gabrielesvelto @willcage

1976... I understand the point they are making, but it is somewhat disingenuous.

Apple started manufacturing computers targeted at enthusiasts who wanted something affordable. The lack of parity was not exclusive to Apple, few home computers had it. 5 years later when the IBM PC came out it was aimed at businesses and did support parity (and was very expensive).

Having said that, I also bemoan the lack of ECC capable systems daily.

AN/CRM-114 May 9, 2024

@lgerbarg @ravenonthill @gabrielesvelto @willcage Yeah, that’s what I thought. I haven’t thought about ECC DRAM in like ten years and that’s bothering me. But no discussion of it flips would be complete without the expert testimony from Bookout V. Toyota, that’s a real horror classic

💡𝚂𝗆𝖺𝗋𝗍𝗆𝖺𝗇 𝙰𝗉𝗉𝗌📱May 9, 2024

@ravenonthill @gabrielesvelto @willcage
Squawk! Pieces of eight.
Squawk! Pieces of eight.
Squawk! Pieces of seven.
Program halt, parroty error.

Rob Fahrni May 9, 2024

@gabrielesvelto @willcage Looking forward to your article, Gabriele.

@gabrielesvelto @willcage

Thanks for the thread, and the interesting data. As the author of another memory-testing tool, I'd also like to point out that memory errors are also quite commonly triggered by things other than DRAM actually going bad. This is one of the reasons that swapping memory sometimes doesn't make the errors go away.

Includes: flaky power supply, overheating CPU or RAM, micro-cracks or scratches in the mainboard, oxidized contacts in memory slots or on modules ...

1/x

@gabrielesvelto @willcage

... the list goes on. Sometimes I'm amazed anything computer-related works at all.

My tool, by the way, is called `memtester` - not really of use for Windows users, but it works on anything vaguely Posix-like, including Linux, the BSDs, all commercial Unices, and some weird stuff like VxWorks -- and VMS, the weirdest of them all.

It's actually pretty good at finding intermittent errors that memtest86+ misses.

</plug>

Gabriele Svelto [moved]May 11, 2024

@cazabon @willcage yeah, there are definitely a bunch of other reasons. I've seen transient memory errors go away on old boxes by pulling the DIMMs out, cleaning the contacts on the motherboard and also "drenching" the chips with some contact cleaner spray, so that it gets rid of the dust that might have accumulated around the solder balls below the chip.

Ludovic :Firefox: :FreeBSD:May 11, 2024

@gabrielesvelto Man great thread give a voyager1 recent issue référence and turn this into a blog post please

Chris [list of emoji]May 11, 2024

@gabrielesvelto @willcage

So this reminded me of a Linux kernel feature I'd read about some time back where the kernel could be told on boot to avoid curtain bad sections of RAM, so I searched for it and it's there (see Grub option GRUB_BADRAM). But I also discovered that there's now a kernel option ("memtest=") that will do a memory test on boot and automatically not use addresses it detects are bad, which is *really* *cool*.

Chris [list of emoji]May 11, 2024

@gabrielesvelto @willcage

I think there's a lot that can be done in this vein at the OS level. Consider an OS thread that during times of low activity tests individual pages of RAM and adds failures to the kernel's block list (and also exports it to the userland so it can be saved for the next boot.)

That would let one continue to use failing RAM (mostly) safely for as long as it was still *mostly* okay and keep it out of e-waste longer.

Tommy Þ May 9, 2024

@gabrielesvelto Working with existing non-ecc system, couldn't some of this be caught if the OS ran a low-priority process scrubbing memory, eg. writing and checking a random, but check-summed pattern to free pages (similar to ZFS scrubbing). Even better, important data structures should be check summed (something I actually did in a database engine I wrote many decades ago).

Gabriele Svelto [moved]May 11, 2024

@tommythorn one of the things we plan to do is running a brief, low-priority scan in processes that crashed. If we find some bad memory we'll notify the user and flag future crashes from their machine as low-value so we don't spend too much time looking at them.

Gorgeous na Shock!May 8, 2024

@gabrielesvelto My dream for a while had been that ECC memory becomes as commonplace as encryption suddenly did circa ~2013 and not just some weird thing only I and a few of my nerd friends do because we're overcautious and weird. 😌

William D. Jones May 9, 2024

@gabrielesvelto Really depressing that we've reached the physical limits of creating "memory we're confident that actually will store it's value reliably" :(.

We've went from PARITY CHECK 1/2 to "memory works fine without detection or correction" to "oh now not even parity check is enough". In that sense, it's WORSE than 40 years ago :P.

Gabriele Svelto [moved]May 9, 2024

@cr1901 yes, it is worse than 40 years ago! This is an area where we've actively regressed

@[email protected]May 9, 2024

@gabrielesvelto @cr1901 maybe that's why it sometimes felt like those old machines were rock-solid in spite of their limitations: hardware has become less reliable faster than software became more reliable.

Noah Gibbs May 9, 2024

@jbqueru @gabrielesvelto @cr1901

Also many of the machines just had simple software. Something like AppleSoft BASIC was just easier to keep bug-free (ish) than the giant software of today.

The same size/complexity software is more reliable today, but what they actually shipped was much smaller and simpler.

@gabrielesvelto Maybe saving someone else a Web search: Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory: https://en.wikipedia.org/wiki/ECC_memory.

ECC memory - Wikipedia

Gabriele Svelto [moved]May 9, 2024

@tomayac good point, I'd add the link to the post

Colm Ó Donnchadha May 10, 2024

@gabrielesvelto LPDDR5 has inline ecc, all the time,

Gabriele Svelto [moved]May 10, 2024

@ColmDonoghue that only applies to the link. It's so noisy that it wouldn't work without it. The memory arrays are unprotected.

Forbearance Oct 28

@gabrielesvelto Some of this is User Error too, because of the way people love to overclock PCs: you cut the RAM timings and up the clocks on your new Ultimate RGB Gaming PC until it can run a game for twenty minutes without you noticing a problem, then leave it like that forever. It's not really stable, it just flips an important bit once every 6 hours instead of 6 times an hour now.

Alistair Buxton May 9, 2024

@gabrielesvelto

How can memory errors cause the same crash on two different computers unless they both have the same error at the same address? (Which is of course astronomically unlikely.)

Gabriele Svelto [moved]May 9, 2024

@ali1234 because statistically they happen more frequently in pieces of code that touch a lot of memory. Firefox' JavaScript garbage collector is one such example. It traverses the heap using GC's typical mark & sweep behavior and touches thousands upon thousands of objects, crossing over an enormous amount of pointers. Because it's far more likely to hit a bad bit than the rest of the code it will show up far more often. Same for code that traverses huge hash tables, etc...

Alistair Buxton May 9, 2024

@gabrielesvelto

I see. A similar phenomenon was observed in initializing bitcoin full nodes, where verifying the full chain (currently 5TB and growing) exposed a lot of memory errors due to hash mismatches.

That happened essentially because everyone was doing the exact same calculations on the exact same data, in the exact same order, and expecting a known result.

I would have expected a lot more randomness in Firefox, but I forgot how much manual memory management it does.

Gabriele Svelto [moved]May 9, 2024

@ali1234 this is very interesting, yeah that's another workload which ends up touching a ton of memory

rlb May 9, 2024

@gabrielesvelto I appreciate your focus on improving diagnosis. Those over-represented crashes distract people from real bugs. I don't understand why you included OOMs though -- allocation failures can and will happen, and it's our responsibility to handle them gracefully.

Gabriele Svelto [moved]May 9, 2024

@rlb we've reduced OOM crashes massively in the past few years: https://hacks.mozilla.org/2022/11/improving-firefox-stability-with-this-one-weird-trick/

We already handle gracefully all the failures that we can realistically handle, but for a lot of them there's nothing we can do. That's especially true on non-Windows platforms where allocations never fails. Both on Linux and Android the kernel will kill processes to save memory without informing them or allowing for any type of reaction, so graceful handling is impossible.

Improving Firefox stability with this one weird trick – Mozilla Hacks - the Web developer blog

Poorly behaving web pages and apps are no longer capable of crashing the browser by exhausting memory, thanks to a simple trick.

Mozilla Hacks – the Web developer blog

rlb May 9, 2024

@gabrielesvelto nice! The oomkill behavior is definitely hard to cope with. I've gotten a lot of mileage out of using the cgroups memory pressure signals, but it's certainly not perfect.

Jfrench May 9, 2024

@gabrielesvelto with the number of installs you have you are probably also getting cosmic ray induced errors

@gabrielesvelto FYI, this seems to be a nonnegligible cause of death of Nintendo 3DS units. some units either exhibit strange behavior, corruption in text characters (which turn out to be single bitflips), or just straight up refuse to boot

it's possible to demonstrate these are indeed DRAM* errors by using the boot9strap jailbreak (with ntrboot if not installed beforehand), as these run from SoC-internal SRAM instead of DRAM. booting the OS then typically fails, and it can also be used as a point to run a memtest

problem is that replacing the DRAM chips is *very* difficult. not only would it require BGA rework (because it's so small they couldn't not solder it on the mobo), Nintendo also used epoxy 'underfill' to glue the DRAM chip stuck to the PCB to deter RAM probing attacks (as those were used against the DSi, the 3DS' predecessor), see the "white glue" here: https://giltesa.com/wp-content/uploads/2013/12/Nintendo_3DS_PCB-Top.jpg

*: DRAM is more often called FCRAM on the 3DS because that's the type of DRAM by fujitsu it uses

Wyatt (🏳️‍⚧️♀?)May 9, 2024

@pcy @gabrielesvelto early-revision SNES consoles also fail regularly, and it seems most people point to the VRAM as the failure point. Although I'm not sure if that's been confirmed.
youtube video ID (of a video capture I made): e3P1c9eXKqo

@gabrielesvelto though, when running such a memtest, in most cases the errors seem to have the same pattern throughout the entire DRAM.

this means it's probably a solderball crack under the DRAM or SoC, rather than actual semiconductor failure, though the latter is also sometimes observed

kamstrup May 9, 2024

@gabrielesvelto thanks for an interesting thread!

I read talk, some years ago, about memtest+ not being maintained and actually not working at all for most cases (iirc it was in the context of the Ubuntu live cd boot option)

Do you know how it looks these days?

Gabriele Svelto [moved]May 9, 2024

@kamstrup that was true for a while, then it was forked into PCMemTest by Martin Whitaker, and merged back with the memtest86+ codebase and now it's maintained by both the original author (Sam Demeulemeester) and the Martin Whitaker. The 7.0 major release dates back to this January.

kamstrup May 9, 2024

@gabrielesvelto awesome, thanks ♥️

Justin Dolske May 10, 2024

@gabrielesvelto FWIW, I wrote a "memtest.js" version that runs in the browser. I even got some bad RAM sticks from Mozilla RelEng to verify that it could detect real failures!

Live version still works, too: https://dolske.net/hacks/memtest.js/live/

Justin Dolske May 10, 2024

@gabrielesvelto Of course there are some limitations, since JS (thankfully) doesn't have bare-metal access. But I wanted to see if periodically testing whatever chunks of memory the browser/OS gave out would work well enough.

That is, it can't say "all clear", but it can say "problems found". The idea being to eventually have the browser itself run a small background check, which over time should either detect any bad bits or give confidence that things seem OK.

Justin Dolske May 10, 2024

@gabrielesvelto Alas it was just a side project for fun, so I set it aside after the proof-of-concept.

It seems like an interesting problem space, so I hope you get good results!

Oh, the old code: https://github.com/dolske/memtest.js

(It was also an excuse to play with the then-new asm.js, for the hot bitwise-op loops. That code is lost, but IIRC it wasn't any faster because whatever JIT we used then already did a good job.)

GitHub - dolske/memtest.js: A JS version of the venerable memtest86+ utility

A JS version of the venerable memtest86+ utility. Contribute to dolske/memtest.js development by creating an account on GitHub.

GitHub