Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail.

I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17

First of all let's talk briefly about how memory works. What you have in your PC or phone is what we call dynamic random access memory. That is memory that stores bits by putting a minuscule amount of charge into vanishingly small capacitors (or not putting it in if we're storing a zero).

These capacitors continuously leak this charge, so it needs to be refreshed periodically - every few milliseconds - which is why it's called "dynamic". 2/17

This design is *extremely* analog in nature. When your machine needs to read some bits the capacitors holding them are connected to a bunch of wires. The very small voltage difference that happens in the wire is detected by the use of a circuit that turns it into a clear 0 or 1 value (this is called a sense amplifier). 3/17
So how can this fail? In a huge number of ways. Circuits age with time and use. The ability of the individual capacitors to hold the charge goes down slowly over time, the transistors in the sense amplifiers degrade, points of contact oxidize, etc... Past a certain point this can make the whole process end up outside of the thresholds required to reliably read, write and retain the bits in memory. 4/17
This can lead to different failures: a very common one is a stuck bit, which ends up being always read as 1 or 0, regardless of what was written into it. Another type is timing-dependent failures, which cause a bit to flip but only if it's not touched in due time by an access or a refresh. More catastrophic errors can affect entire lines - which is what happens when a sense amplifier starts to fail. 5/17
Either way, even a single bit error which happens once in a blue moon is catastrophic to a consumer machine. Sometimes it will cause a pixel to slightly change color, but sometimes it will affect an important computation and lead to a crash. Or worse: it'll cause some user data to be corrupted before it's written to disk, and when it is, the damage has become permanent. 6/17
If your machine exhibits rare but hard-to-explain crashes, or if you're forced to reinstall programs - or even the operating system - because of mysterious failures, or experience random reboots or BSODs, then it's very likely that your memory is failing and you need to replace it. 7/17

Diagnosing it is hard. Windows has a memory diagnostic tool which will catch the worst offenders and is easy to use: https://www.microsoft.com/en-us/surface/do-more-with-surface/how-to-use-windows-memory-diagnostic

It's not enough though, some errors can only be caught with more extensive testing. I recommend the open-source memtest86+ (https://memtest.org/) tool or the closed source memtest86 one (https://www.memtest86.com/) 8/17

How to Use Windows Memory Diagnostic | Microsoft Surface

Optimize your PC performance and prevent slowdowns with Windows Memory Diagnostic and RAM tools. Learn how to test RAM and improve performance with Windows Memory Diagnostic.

Surface

@gabrielesvelto FWIW, I wrote a "memtest.js" version that runs in the browser. I even got some bad RAM sticks from Mozilla RelEng to verify that it could detect real failures!

Live version still works, too: https://dolske.net/hacks/memtest.js/live/

@gabrielesvelto Of course there are some limitations, since JS (thankfully) doesn't have bare-metal access. But I wanted to see if periodically testing whatever chunks of memory the browser/OS gave out would work well enough.

That is, it can't say "all clear", but it can say "problems found". The idea being to eventually have the browser itself run a small background check, which over time should either detect any bad bits or give confidence that things seem OK.

@gabrielesvelto Alas it was just a side project for fun, so I set it aside after the proof-of-concept.

It seems like an interesting problem space, so I hope you get good results!

Oh, the old code: https://github.com/dolske/memtest.js

(It was also an excuse to play with the then-new asm.js, for the hot bitwise-op loops. That code is lost, but IIRC it wasn't any faster because whatever JIT we used then already did a good job.)

GitHub - dolske/memtest.js: A JS version of the venerable memtest86+ utility

A JS version of the venerable memtest86+ utility. Contribute to dolske/memtest.js development by creating an account on GitHub.

GitHub
@dolske thanks Justin! With @pbone we were thinking of doing just that, statistically testing crashy machines to figure out if we could spot some bad memory.