Mastodawn

would it be rude to use a little atom board as a bitflip testbench with a wee smote of sealed radium nestled betwixt its sodimms

Show thread

qualia Mar 5

little brown smudge that makes memtest86+ angry

Show thread

qualia Mar 5

if i set up a 24/7 livestream of the side effects of a measured & controlled elevated ambient ionizing radiation field on a machine running continuous stability tests would you watch that

i would absolutely check that out

27.8%

yeah maybe idk

23.2%

wouldn't watch but findings would be interesting

31.9%

i do not care

0.8%

i actively do not want this to happen

1.5%

qualia what the fuck

14.8%

Poll ended at Mar 6 at 10:40pm.

Show thread

Andrew Zonenberg Mar 5

@qualia Unless your dose rate is extraordinarily high I would expect the live stream would be super boring.

But let it run for a year or two with good logging and I'd absolutely read a blog or paper summarizing the results

Show thread

qualia Mar 5

@azonenberg i'm halfway expecting to see failures almost immediately with it in as direct contact as i can muster to validate the whole idea, at which point i'll back it off, run a lap or two to make sure that excursion hasn't unduly damaged the dimms, and then pick some fixed distances of measured intensity and start walking it in forward

if it does turn out to be extremely boring, i'll definitely see about arranging a longer-term test strategy, since i have wondered about this for a long while

Show thread

Andrew Zonenberg Mar 5

@qualia Well the other question is what you're irradiating (ram vs CPU vs chipset etc).

You're gonna see different natures of failures hitting caches, logic, main RAM, etc

Show thread

qualia Mar 5

@azonenberg true! i was specifically thinking ram, since it is, afaik, the most prominent/physically large domain where hardware error detection (if not correction) in consumer hardware is not ubiquitous. except storage, maybe? correct me if i'm wrong

this idea is acutely engendered by the firefox error report bit flip thread, if you've seen that floating around

Show thread

🇺🇦 haxadecimal 🚫👑Mar 6

@qualia @azonenberg
Mass storage devices have universally had error correction since the early 1990s. Disk drive error correction was introduced by IBM with the 3330 drive (1970) though error detection was used by earlier IBM disk systems such as the IBM 2311 (1964).
As density has tremendously increased since the late 1980s, it has become technically infeasible to make reliable disk drives without error correction. The same is true of solid state drives.
1/

Show thread

🇺🇦 haxadecimal 🚫👑Mar 6

@qualia @azonenberg
The error correction is internal to the drive. The drive attempts to present itself to the host as an entirely reliable device. Uncorrectable errors will of course be reported as such, but correctable errors are hidden, and only reported by diagnostic commands (e.g., SMART).
Both magnetic disk and solid state drives are dependent on "coding gain", where the error correction is used to achieve higher storage capacity than would be possible without it.
2/

Show thread

qualia Mar 6

@brouhaha @azonenberg makes sense. i had suspected as much -- i know SSDs have a whole extra region of spare blocks for wear management, but I got to thinking about the phenomena of silent data corruption and second-guessed myself

but on reflection, with how aggressively disk I/O gets cached in the free RAM of non-ECC consumer hardware.. that must be the substantial source of most of it

this might make for another interesting test, if I can wind up the single-event-upset events to a practically noticeable level -- get a small ZFS mirror going and redline it with checksummed/deterministic garbage writes & reads; see if/how often it manages to catch-and-correct itself in spite of the RAM's unreliability

Show thread

🇺🇦 haxadecimal 🚫👑Mar 6

@qualia @azonenberg
If the error hits non-ECC RAM written by the application before a write, and before ZFS computes the hash, then of course ZFS won't detect any error.
Similarly, if data read from the drive into non-ECC RAM gets an error after ZFS has validated the hash, then no error is detected.
I'm amazed the commodity mass-market computers have successfully* ignored this issue for so long, as DRAM error rates have constantly increased.

*for some value of "successfully"

Show thread

🇺🇦 haxadecimal 🚫👑Mar 6

@qualia @azonenberg
There is no _technical_ reason for the cost of ECC RAM in a system to be more than 12.5% above the cost of non-ECC RAM.
The actual high cost is due to deliberate product positioning by processor and computer manufacturers, resulting in ECC RAM being a specialty item only marketed for server-class systems.
AMD at least leaves the ECC capability of their memory controllers enabled on many consumer SKUs, but Intel disables it on almost all but Xeons.

Show thread

Andrew Zonenberg Mar 6

@brouhaha @qualia Also 12.5% assumes (72, 64) Hamming FEC. I've played with using lower overhead codes to e.g. correct one bitflip per burst of eight 64-bit words rather than one per word, at the cost of more coding complexity.

Show thread

🇺🇦 haxadecimal 🚫👑Mar 6

@azonenberg @qualia
Yes, I was only considering the commonly used case. It's also possible to go to a wider word, and the number of overhead bits scales with log2(word length). But at some point, a longer word length would make SECDED inadequate. I'm not sure whether that would be true for a 128-bit word, and I'm too lazy to do the math at the moment.

Show thread

Andrew Zonenberg

@brouhaha @qualia Yeah I was talking to @dlharmon a while back and he had some ideas for a RS-FEC that would give you something like 520 or 530? bits of payload per 576 bit (8 word burst * 72 bit bus) DRAM bus.

Not too useful for general purpose computing where you expect power of two cache line sizes but if you're building a router or oscilloscope or something and just making huge FIFOs, it lets you buy a few percent more bandwidth at the same PHY speed without completely throwing out ECC