Mastodawn

Gabriele Svelto Mar 4

A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5

Gabriele Svelto Mar 4

In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5

Gabriele Svelto Mar 4

In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5

Gabriele Svelto Mar 4

And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5

Gabriele Svelto Mar 4

And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5

Steven Op de beeck

@gabrielesvelto My mind was going to cheap low-end hardware, but now you’re throwing expensive Apple Silicon SOC’s in the mix, it’s a bit harder to believe that they suffer from bitflips at the rates your are implying.

Val Packett 🧉Mar 4

@stevenodb @gabrielesvelto low-end hardware might sometimes even be less likely to hit this because it's not even trying to be super fast. High-end hardware chasing the fastest speeds is pushing the limits of stability all the time.

@valpackett @stevenodb @gabrielesvelto yeah on that point, the "meta" for hardware tuning has gone from overvolting and overclocking (to use the margin of error) to undervolting, because the products are already running just about as fast as they can, and you get decent gains by making them run cooler instead of faster
ie, they're already at the limits, and an occasional memory error in 10% of consumer hardware is probably what they're okay with

Steven Op de beeck Mar 5

@izzy @valpackett @gabrielesvelto @astraleureka My AS systems have been rock solid under load, they have literally never crashed, panicked or had crashing applications nor have they been churning out corrupted files. The rates of bit flips people in this thread are prophesizing should have surfaced in the last 5 years in some way that is visible to users. This Firefox report being the first one, strikes me as oddly isolated. 10% in a few billion iPhones iPads and MacBooks is a HUGE amount.

the vessel of morganna Mar 5

@stevenodb @izzy @valpackett @gabrielesvelto most people aren't going to understand what happened when a bitflip occurs, they're just going to think "oh my phone/tablet/laptop is glitching" and reboot it - if there is even a user-visible symptom. if it persists, they're going to get it repaired or buy a new one (or just suffer with it if they can't afford a replacement). after all, not all bitflips are permanent, many are transient and do not reoccur.
crashes are just one symptom - a lot of bitflips lead to other strange symptoms. I recall reading a paper years ago where a researcher generated bitflipped equivalents for very heavily used CDN domains, registered them and analysed the access logs. they found massive quantities of devices inadvertently hitting these domains due to transient bitflips, many of which were *not* persistent.

Steven Op de beeck Mar 5

@astraleureka @izzy @valpackett @gabrielesvelto that’s interesting, do you have a link?

the vessel of morganna Mar 5

@stevenodb @izzy @valpackett @gabrielesvelto looks like it's been researched a few times; the original I recall was here:
https://www.youtube.com/watch?v=9WcHsT97suU

but there's a more recent bit of research here as well: https://www.bitfl1p.com/

similar tests have been done at the IP level as well, but I can't seem to find that writeup off hand

DEF CON 19 - Artem Dinaburg - Bit-squatting: DNS Hijacking Without Exploitation

YouTube

@stevenodb @astraleureka @valpackett @gabrielesvelto this was linked elsewhere in the thread
https://web.archive.org/web/20180713212603/http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf
compares manufacturer error ratings to real world observed results

Wayback Machine

Steven Op de beeck Mar 5

@izzy @valpackett @gabrielesvelto @astraleureka I have understood bit flips to be rare events. They have crashed airplanes and impacted elections. Now it could be that bitflips are more common, but a bitflip with measurable impact is more rare. But computers are exact calculating machines, if bit flips are common, they would also have a cumulative effect, quickly eacalating to serious errors. I’m not seeing reports about this in the word. Which is why I’m sceptical, respectfully.

Gabriele Svelto Mar 4

@stevenodb high-end hardware pushes the limits of semiconductors both because of small feature size and high clocks increasing the chances of failure. In the case of DRAM the chance of malfunction increases with higher temperature too, as the ability of trench/stacked capacitors to retain charge degrades with it... and Apple puts its DRAM right next to the CPU, the single hottest place of the whole device.

Daniel Reeders Mar 5

@stevenodb @gabrielesvelto Why is that harder to believe? The SOCs are among the most complicated designs ever made on the smallest process commercially available. RAM isn't on the SOC though and Apple buys their chips from the same places everyone else does.

the vessel of morganna Mar 5

@onekind @stevenodb the RAM isn't on the SoC die, but it is on the package. not having to route super high frequency DDR5 signals across a motherboard (and not having to buffer/reclock them either) provides slightly less chance for bitflips in transit and improves latency, but it has no effect on bitflips actually occurring on the DRAM die itself

Daniel Reeders Mar 5

@astraleureka @stevenodb yes, I understand all that, but it doesn't respond to the question I asked — why does Apple hardware being implicated affect Steven's assessment of Gabriele's account?

the vessel of morganna Mar 5

@onekind @stevenodb like I said, the lack of long traces might provide a small positive benefit, but I doubt it would be significant. I agree with you, DRAM is DRAM and Apple doesn't have some miraculously better chips than other vendors, they're just closer to the memory controller than normal

the vessel of morganna Mar 5

@stevenodb actually, bitflips are becoming far more common as silicon continues to shrink. that's one of the reasons why on-die ECC is mandatory for DDR5 - the reduced feature size means increased leakage, more chance for bitflips to occur. this same shrinkage problem leads to increased error rates in other components as well, not just DRAM.