@qualia Unless your dose rate is extraordinarily high I would expect the live stream would be super boring.
But let it run for a year or two with good logging and I'd absolutely read a blog or paper summarizing the results
@azonenberg i'm halfway expecting to see failures almost immediately with it in as direct contact as i can muster to validate the whole idea, at which point i'll back it off, run a lap or two to make sure that excursion hasn't unduly damaged the dimms, and then pick some fixed distances of measured intensity and start walking it in forward
if it does turn out to be extremely boring, i'll definitely see about arranging a longer-term test strategy, since i have wondered about this for a long while
@qualia Well the other question is what you're irradiating (ram vs CPU vs chipset etc).
You're gonna see different natures of failures hitting caches, logic, main RAM, etc
@azonenberg true! i was specifically thinking ram, since it is, afaik, the most prominent/physically large domain where hardware error detection (if not correction) in consumer hardware is not ubiquitous. except storage, maybe? correct me if i'm wrong
this idea is acutely engendered by the firefox error report bit flip thread, if you've seen that floating around
@brouhaha @azonenberg makes sense. i had suspected as much -- i know SSDs have a whole extra region of spare blocks for wear management, but I got to thinking about the phenomena of silent data corruption and second-guessed myself
but on reflection, with how aggressively disk I/O gets cached in the free RAM of non-ECC consumer hardware.. that must be the substantial source of most of it
this might make for another interesting test, if I can wind up the single-event-upset events to a practically noticeable level -- get a small ZFS mirror going and redline it with checksummed/deterministic garbage writes & reads; see if/how often it manages to catch-and-correct itself in spite of the RAM's unreliability
@qualia @azonenberg
If the error hits non-ECC RAM written by the application before a write, and before ZFS computes the hash, then of course ZFS won't detect any error.
Similarly, if data read from the drive into non-ECC RAM gets an error after ZFS has validated the hash, then no error is detected.
I'm amazed the commodity mass-market computers have successfully* ignored this issue for so long, as DRAM error rates have constantly increased.
*for some value of "successfully"
@brouhaha @qualia Yeah I was talking to @dlharmon a while back and he had some ideas for a RS-FEC that would give you something like 520 or 530? bits of payload per 576 bit (8 word burst * 72 bit bus) DRAM bus.
Not too useful for general purpose computing where you expect power of two cache line sizes but if you're building a router or oscilloscope or something and just making huge FIFOs, it lets you buy a few percent more bandwidth at the same PHY speed without completely throwing out ECC