@vbabka If it is not enabled by default, then it is not important.
@vathpela How would that even work? Any memory allocated up until the point where you switch the setting will have to remain in place, negating the benefits or randomized allocation for everything that starts early. And both allocators would have to work with the same in-memory format, which may or may not be possible?
@hyc And while you are solving this, make it possible to page out kernel memory :D.
Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail. I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17
@gabrielesvelto @adingbatponder At the end of that thread: " I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17"
Was that article published, or is it approaching publication? I'd be very interested.
@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.
Few years later got a second 16 GB kit to expand the PC to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.
Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.
I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error warning from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.
It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.
@raulinbonn @gabrielesvelto
I had a G.Skill DDR4 kit (RipJaws V, 2x16GB) go bad after about 4 years (had crashes, memtest86 confirmed it was the memory). RMA'd it, got a replacement which worked fine at first but broke after 2½ years.. switched to more expensive ECC RAM from Kingston then (thankfully a few months before the prices skyrocketed), no issues since, I hope it remains like that..
So yeah, memory failure is not uncommon at all :-/
> And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately.
Are these people aware that a bit flip in some file system code could nuke somebody's hard drive?
Checksums will let you detect that there is a problem, but won't actually save your data. If a bit flip causes the file system driver to write to the wrong LBA or corrupt a key file system data structure or something, the damage will still be quite permanent unless you have a backup.
@raulinbonn @gabrielesvelto I can concur, that I had RAM modules that would fail memtest brand new.
However, if you want to get some reliable data, I suggest searching for server RAM going bad data.
They have ECC and multiple failover modes (failed addresses get remapped, spare modules, etc). I'm pretty sure some sort of research exists, on how fast RAM chip goes bad.
@atis @raulinbonn in usage hours yes, a server will always beat a client device (well, maybe not an always-on one such as a phone) however there is an extremely strong correlation in our data between machine age and the number of observed failures. Anyway this is an older but still relevant study if you're interested: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf
The rates that they reported are way, way lower than what I see.