would it be rude to use a little atom board as a bitflip testbench with a wee smote of sealed radium nestled betwixt its sodimms
little brown smudge that makes memtest86+ angry
if i set up a 24/7 livestream of the side effects of a measured & controlled elevated ambient ionizing radiation field on a machine running continuous stability tests would you watch that
i would absolutely check that out
27.8%
yeah maybe idk
23.2%
wouldn't watch but findings would be interesting
31.9%
i do not care
0.8%
i actively do not want this to happen
1.5%
qualia what the fuck
14.8%
Poll ended at .

@qualia Unless your dose rate is extraordinarily high I would expect the live stream would be super boring.

But let it run for a year or two with good logging and I'd absolutely read a blog or paper summarizing the results

@azonenberg i'm halfway expecting to see failures almost immediately with it in as direct contact as i can muster to validate the whole idea, at which point i'll back it off, run a lap or two to make sure that excursion hasn't unduly damaged the dimms, and then pick some fixed distances of measured intensity and start walking it in forward

if it does turn out to be extremely boring, i'll definitely see about arranging a longer-term test strategy, since i have wondered about this for a long while

@qualia Well the other question is what you're irradiating (ram vs CPU vs chipset etc).

You're gonna see different natures of failures hitting caches, logic, main RAM, etc

@azonenberg true! i was specifically thinking ram, since it is, afaik, the most prominent/physically large domain where hardware error detection (if not correction) in consumer hardware is not ubiquitous. except storage, maybe? correct me if i'm wrong

this idea is acutely engendered by the firefox error report bit flip thread, if you've seen that floating around

@qualia yes i have, and it doesnt surprise me in the slightest. There's a reason I retired my last machine without ECC ram over the weekend.
@azonenberg interestingly! as of recently i do also have a spare machine that takes ECC RAM. that could be a fun apples-to-oranges point of comparison, though I vaguely remember encountering some manner of issue with memtest86+ "seeing" failures behind ECC single-bit correction on a failing DIMM. i may need to explore a test suite running under Linux for all the hairy details and loggability anyhow

@qualia yeah i'm not sure of the details of how to get correctable-error counts out of the ram, maybe over IPMI or something?

For the most part it just works and has been stable althoguh I'm not irradiating my hardware lol

@azonenberg god i had such a stupid IPMI-related saga with that machine recently, i'll just scrape the MCEs out of dmesg lol
@qualia @azonenberg
my fav:
[44135615.739301] Uhhuh. NMI received for unknown reason 3d on CPU 26.
[44135615.739307] Dazed and confused, but trying to continue
[44135615.965025] Uhhuh. NMI received for unknown reason 2d on CPU 43.
[44135615.965031] Dazed and confused, but trying to continue

@astraleureka @azonenberg I got my work laptop's NVIDIA card to throw a bunch of "fell off the bus" errors yesterday while trying to get Optimus switching going + it really not liking having its pstates kicked around

the error makes sense but is still amusingly evocative. no "lp0 on fire" but i'll take it

@qualia @azonenberg my personal laptop was having repeated nvidia bus-death problems a few months back after upgrading the drivers. s76 eventually provided a workaround - disable memory clock gating and just leave it at 8001mhz all the time
the spew of nvidia driver and DRM logs were pretty amusing despite the annoyance of the issue
@azonenberg @qualia there's a special PCI device which you can coax into giving you this information, I believe

@whitequark @azonenberg @qualia there is indeed, you can find it by looking up what the Linux EDAC driver is. e.g. screenshot on this old dual socket Ivy Bridge system.

(EDAC: Error Detection And Correction)

@azonenberg @qualia rasdaemon? I've had a DIMM with a single bad bit show up using that
@azonenberg @qualia
Nothing standardized, unfortunately. It's entirely dependent on the DRAM controller. In modern x86, that's part of the processor, changes with each new microarchitecture, and is poorly documented.