would it be rude to use a little atom board as a bitflip testbench with a wee smote of sealed radium nestled betwixt its sodimms
little brown smudge that makes memtest86+ angry
if i set up a 24/7 livestream of the side effects of a measured & controlled elevated ambient ionizing radiation field on a machine running continuous stability tests would you watch that
i would absolutely check that out
27.8%
yeah maybe idk
23.2%
wouldn't watch but findings would be interesting
31.9%
i do not care
0.8%
i actively do not want this to happen
1.5%
qualia what the fuck
14.8%
Poll ended at .
if you voted "i do not want this to happen" i would love to hear your thoughts/concerns
@qualia only if it has a playlist of ambient and/or synthwave music. Maaaybe other genres of electronic music

@qualia Unless your dose rate is extraordinarily high I would expect the live stream would be super boring.

But let it run for a year or two with good logging and I'd absolutely read a blog or paper summarizing the results

@azonenberg i'm halfway expecting to see failures almost immediately with it in as direct contact as i can muster to validate the whole idea, at which point i'll back it off, run a lap or two to make sure that excursion hasn't unduly damaged the dimms, and then pick some fixed distances of measured intensity and start walking it in forward

if it does turn out to be extremely boring, i'll definitely see about arranging a longer-term test strategy, since i have wondered about this for a long while

@qualia Well the other question is what you're irradiating (ram vs CPU vs chipset etc).

You're gonna see different natures of failures hitting caches, logic, main RAM, etc

@azonenberg true! i was specifically thinking ram, since it is, afaik, the most prominent/physically large domain where hardware error detection (if not correction) in consumer hardware is not ubiquitous. except storage, maybe? correct me if i'm wrong

this idea is acutely engendered by the firefox error report bit flip thread, if you've seen that floating around

@qualia yes i have, and it doesnt surprise me in the slightest. There's a reason I retired my last machine without ECC ram over the weekend.
@azonenberg interestingly! as of recently i do also have a spare machine that takes ECC RAM. that could be a fun apples-to-oranges point of comparison, though I vaguely remember encountering some manner of issue with memtest86+ "seeing" failures behind ECC single-bit correction on a failing DIMM. i may need to explore a test suite running under Linux for all the hairy details and loggability anyhow

@qualia yeah i'm not sure of the details of how to get correctable-error counts out of the ram, maybe over IPMI or something?

For the most part it just works and has been stable althoguh I'm not irradiating my hardware lol

@azonenberg god i had such a stupid IPMI-related saga with that machine recently, i'll just scrape the MCEs out of dmesg lol
@qualia @azonenberg
my fav:
[44135615.739301] Uhhuh. NMI received for unknown reason 3d on CPU 26.
[44135615.739307] Dazed and confused, but trying to continue
[44135615.965025] Uhhuh. NMI received for unknown reason 2d on CPU 43.
[44135615.965031] Dazed and confused, but trying to continue

@astraleureka @azonenberg I got my work laptop's NVIDIA card to throw a bunch of "fell off the bus" errors yesterday while trying to get Optimus switching going + it really not liking having its pstates kicked around

the error makes sense but is still amusingly evocative. no "lp0 on fire" but i'll take it

@qualia @azonenberg my personal laptop was having repeated nvidia bus-death problems a few months back after upgrading the drivers. s76 eventually provided a workaround - disable memory clock gating and just leave it at 8001mhz all the time
the spew of nvidia driver and DRM logs were pretty amusing despite the annoyance of the issue
@azonenberg @qualia there's a special PCI device which you can coax into giving you this information, I believe

@whitequark @azonenberg @qualia there is indeed, you can find it by looking up what the Linux EDAC driver is. e.g. screenshot on this old dual socket Ivy Bridge system.

(EDAC: Error Detection And Correction)

@azonenberg @qualia rasdaemon? I've had a DIMM with a single bad bit show up using that
@azonenberg @qualia
Nothing standardized, unfortunately. It's entirely dependent on the DRAM controller. In modern x86, that's part of the processor, changes with each new microarchitecture, and is poorly documented.
@qualia @azonenberg
Mass storage devices have universally had error correction since the early 1990s. Disk drive error correction was introduced by IBM with the 3330 drive (1970) though error detection was used by earlier IBM disk systems such as the IBM 2311 (1964).
As density has tremendously increased since the late 1980s, it has become technically infeasible to make reliable disk drives without error correction. The same is true of solid state drives.
1/
@qualia @azonenberg
The error correction is internal to the drive. The drive attempts to present itself to the host as an entirely reliable device. Uncorrectable errors will of course be reported as such, but correctable errors are hidden, and only reported by diagnostic commands (e.g., SMART).
Both magnetic disk and solid state drives are dependent on "coding gain", where the error correction is used to achieve higher storage capacity than would be possible without it.
2/

@brouhaha @azonenberg makes sense. i had suspected as much -- i know SSDs have a whole extra region of spare blocks for wear management, but I got to thinking about the phenomena of silent data corruption and second-guessed myself

but on reflection, with how aggressively disk I/O gets cached in the free RAM of non-ECC consumer hardware.. that must be the substantial source of most of it

this might make for another interesting test, if I can wind up the single-event-upset events to a practically noticeable level -- get a small ZFS mirror going and redline it with checksummed/deterministic garbage writes & reads; see if/how often it manages to catch-and-correct itself in spite of the RAM's unreliability

@qualia @azonenberg
If the error hits non-ECC RAM written by the application before a write, and before ZFS computes the hash, then of course ZFS won't detect any error.
Similarly, if data read from the drive into non-ECC RAM gets an error after ZFS has validated the hash, then no error is detected.
I'm amazed the commodity mass-market computers have successfully* ignored this issue for so long, as DRAM error rates have constantly increased.

*for some value of "successfully"

@brouhaha @qualia I think commodity computer users are just expected to tolerate some level of instability and not complain too loudly because "that's how computers are".

People who demand serious reliability use ECC.

All of my Ceph cluster nodes and endpoints use ECC ram and BlueStore does E2E checksumming of data blocks to storage media and back so there should be no way for a SEU to cause data corruption, you'd need multiple bitflips

@qualia @azonenberg
There is no _technical_ reason for the cost of ECC RAM in a system to be more than 12.5% above the cost of non-ECC RAM.
The actual high cost is due to deliberate product positioning by processor and computer manufacturers, resulting in ECC RAM being a specialty item only marketed for server-class systems.
AMD at least leaves the ECC capability of their memory controllers enabled on many consumer SKUs, but Intel disables it on almost all but Xeons.
@qualia @azonenberg
Oh, you want your memory to be reliable? Better cough up another $1000 for your processor!
– Intel marketing and sales
@brouhaha @qualia Also 12.5% assumes (72, 64) Hamming FEC. I've played with using lower overhead codes to e.g. correct one bitflip per burst of eight 64-bit words rather than one per word, at the cost of more coding complexity.
@brouhaha @qualia (this does mean that a single bad DQ PHY driver/pin won't be corrected easily, but you can detect that at link training time and just fail out the DIMM)
@azonenberg @qualia
Yes, I was only considering the commonly used case. It's also possible to go to a wider word, and the number of overhead bits scales with log2(word length). But at some point, a longer word length would make SECDED inadequate. I'm not sure whether that would be true for a 128-bit word, and I'm too lazy to do the math at the moment.

@brouhaha @qualia Yeah I was talking to @dlharmon a while back and he had some ideas for a RS-FEC that would give you something like 520 or 530? bits of payload per 576 bit (8 word burst * 72 bit bus) DRAM bus.

Not too useful for general purpose computing where you expect power of two cache line sizes but if you're building a router or oscilloscope or something and just making huge FIFOs, it lets you buy a few percent more bandwidth at the same PHY speed without completely throwing out ECC

@qualia @azonenberg
There is a trend to embed ECC in the DRAM chips, and that is probably a good thing, especially if they do internal scrubbing, but that's not entirely sufficient for high reliability since it's not end-to-end to the memory controllers. The memory bus itself is also a source of errors.
Nevertheless, anything that reduces random bit flips in the DRAM is good.
@qualia @azonenberg
Unfortunately, the error correction internal to a drive is not sufficient for system-level reliability. Errors can occur at any point between the drive interface and system memory. Modern interfaces such as PCIe, SAS, and SATA have error detection across the link, but that also is not really sufficient.
3/
@qualia @azonenberg
End-to-end error control required that the host file system include error detection and correction in the data sent to/from the drive, as part of what the drive sees as opaque payload.
The ZFS filesystem, originally developed by Sun for Solaris, is an example of a filesystem that has strong data integrity checks for true end-to-end error control.
4/
@qualia some day i want to see a laptop running memtest86+ go through an electron beam irradiator...
@linear @qualia this kills the laptop (probably)
@qualia since when did they let you handle radioactive substances

@Voidhorn we are an exempt quantity household here thankyouverymuch. NRC licenses & spiceses are expensive and i have enough liabilities as it is

i've been a nuclear nerd since about second grade but actually started stewardship of responsibilites-extending-beyond-my-lifespan about ten years ago or so now

@qualia oh interesting.......

....

unrelated please explode something

@qualia this would make a great 24/7 livestream

@qualia "fish plays pokemon" but it's "radium plays memtest86+"

... just me?

@whitequark would

i have an hdmi capture dongle somewhere too eee i love this bad idea

@qualia sealed radium? That isn't gonna emit much radiation that will make it through to sodimms. Would need to wait for daughter isotopes of daughter isotopes to start beta emitting, which, depending on the sample size, would NOT happen very frequently. I could calculate this actually...
@vikxin i assure you it and its myriad daughters are quite lively already as measured by NaI(Tl) gamma spectroscopy. i will Not be performing a chemical extraction of Just The Radium for this
@qualia did you get one of these too

@vikxin no but i have been eyeing one. those are CsI or GAGG(Ce) (the new ones) but don't have the best resolution or stopping power, and i've learned are also prone to a lot of backscatter noise in the spectrum just due to the small size of the thing and closeness of readout electronics

i have a, iirc, 1.25" NaI(Tl) on a 3" PMT (bit excessive) on an HP preamp base, and then either a Canberra Model 35+ MCA or a Canberra 556 AIM MCA system plus three other bins of miscellaneous signal processing legos. the sample, scintillator, PMT, and preamp live inside a graded Al-Pb pig

i haven't thrown a Ba133m line at it but iirc it kinda ballparks in that 7% FWHM resolution range. its fun. it does not fit in my pocket

@qualia the spectrum reading on mine is...not good. It's basically useless unless you have something specific that's way louder than background. I assume. I've never verified this
@vikxin go leave it on your toilet and integrate your results for a day and see if you can find the uranium
@qualia wouldn't it be more effective to find a granite rock? Who knows what material my toilet is made of

@vikxin well sure but i mean you do possess one way to find out

marble countertop, container of salt substitute, operating HEPA air filter; pick your NORM

@qualia I've tried HEPA filter actually! I must have to leave it there for a VERY long time, since it didn't look elevated over background when I tried
@qualia by "salt substitute" I'm assuming you mean KCl, since, you know, 🍌

@vikxin yeah the filter has to be running several hours, your house has to be pretty sealed up, and some areas simply don't have high radon. which is fortunate really. get your CO2 sensor up for a while and try again

i can pick up the ∆CPM on a metal β/γ GM probes so that little guy surely can under the right conditions

@qualia oh I had a radon sensor here for a bit, before bringing it to my parents. It was really, really low

@vikxin lucky

my house has shot past 4pCi/L on cold still winter days where nobody's leaving the house. this ironically is usually when everyone is already sick