Mastodawn

Dave Anderson Aug 12, 2025

Digging into the drive in my NAS that faulted, I'm reminded that magnetic hard drives are preposterously magical technology.

Case in point, using Seagate's tools I can get the drive to tell me how much it's adjusted the fly height of each of its 18 heads over the drive's lifetime, to compensate for wear and stuff. The drive provides these numbers in _thousandths of an angstrom_, or 0.1 _picometers_.

For reference, one helium atom is about 49 picometers in diameter. The drive is adjusting each head individually, in increments of a fraction of a helium atom, to keep them at the right height. I can't find numbers of modern drives, but what I can find for circa ten years ago is that the overall fly height had been reduced to under a nanometer, so the drive head is hovering on a gas bearing that's maybe 10-20 helium atoms thick, and adjusting its position even more minutely than that

This is _extremely_ silly. You can buy a box that contains not just one, but several copies of a mechanism capable of sub-picometer altitude control, and store shitposts on it! That's wild.

Anyway my sad drive apparently looks like it had a head impact, not a full crash but I guess clipped a tiny peak on the platter and splattered a couple thousand sectors. Yow. But I'm told this isn't too uncommon, and isn't the end of the world? Which is, again, just ludicrous to think of. The drive head that appears to have bonked something has adjusted its altitude by almost 0.5 picometers in its 2.5 years in service. Is that a lot? I have no idea!

Aside from having to resilver the array and the reallocated sector count taking a big spike, the drive is now fine and both SMART and vendor data say it could eat this many sectors again 8-9 times before hitting the warranty RMA threshold. Which is very silly. But I guess I should keep an eye on it.

Andrew Zonenberg

@danderson Personally I'd retire the drive under the assumption that particulate generated from the impact will likely contaminate it and make future damage more likely.

I aggressively replace drives at the first sign of trouble. Any increase in failed sector count is enough for me to no longer trust it.

Will Glynn Aug 12, 2025

@azonenberg @danderson I'm sure it's fine. What's the worst that can happen?

Actually, I think you would both like the MBI report on the Titan submersible implosion: https://media.defense.gov/2025/Aug/05/2003773004/-1/-1/0/SUBMERSIBLE%20TITAN%20MBI%20REPORT%20(04AUG2025).PDF

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson already read it cover to cover.

No surprises having already read the preliminary and the NTSB material analysis report (which is itself a very good read).

The part I find most interesting is that the RTM system *worked*. It detected a catastrophic delamination of the hull well in advance of the actual implosion. They just ignored the data.

Does it work reliably enough to make (well constructed and properly engineered) carbon fiber pressure hulls safe for limited cycle applications? Open question and after this incident it's not likely anyone will put the work into finding out.

But in this particular case, it did in fact give significant advance warning. The warnings being ignored is a company culture problem not a technological problem.

Will Glynn Aug 12, 2025

@azonenberg @danderson 💯

The simplest explanation I have is that OceanGate couldn't afford to design, build, and operate a safe carbon fiber submersible, but they could (just barely) afford to design, build and operate *a* carbon fiber submersible.

Combine that with OceanGate being a corporate veneer for Rush's personal endeavor, and you end up with Rush being willing to gamble himself and his passengers on the safety of the sub. If it works, great, he carries on. If it fails… well, he doesn't see a path forward after that anyway.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Yeah but "company cut corners to save money and people died as a result" is not surprising at all.

The thing working at least a little, surviving to its full target depth a couple of times, *and* RTM detecting the impending failure with enough lead time the hull could have been safely retired after the dive with nobody dying?

That's surprising.

Will Glynn Aug 12, 2025

@azonenberg @danderson Agreed! That anything worked as much as it did is the twist in this story.

I also think it's surprising that, despite making no provisions for inspections, they found a giant crack in the first hull — and that despite the company culture and the decision to repair it, they ultimately did retire it before it got everyone killed.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Yeah the company culture was an accident waiting to happen, there's no doub about that.

I bet ultrasound after the dive 80 incident from the inside (with liner removed), or even visual inspection, would have shown obvious signs of damage. But we'll never know.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Like, with better engineering and safety culture this actually could have worked.

Make it thicker as originally planned. Use a PVHO certified window. Better winding with less defects, more accurate cure modeling, etc. Test several full scale prototypes to destruction with RTM data streaming outside the chamber so you know what actual failure and pre failure looks like.

And develop SOPs for retiring hulls when signs of failure are detected.

It would have cost a lot more, but it could have worked.

arclight Aug 16, 2025

@azonenberg @willglynn @danderson Aspects of both the Challenger and Columbia disasters here. From Challenger, having the data that clearly shows a problem but ignoring it. From Columbia, suspecting a serious problem but not investigating it. In all cases a gross failure of safety culture.

Dave Anderson Aug 12, 2025

@azonenberg Yeah after more thought, it seems okay right now, but even if I accept that a few sector reallocs is a fact of life on modern drives and broadly fine, >7k reallocated is outside my comfort zone.

Then I got carried away, and so now the new server is getting a fresh 160TB worth of drives, and once the data's migrated over, the current pool will get dismantled and the remaining healthy drives within redistributed to the two backup NAS pools, so that they vaguely keep up with the growth in the primary.

Andrew Zonenberg Aug 12, 2025

@danderson I retire drives after any increase in reported bad sectors after deployment.

The rationale is that you'll have some number of factory defects that are relatively stable and not going to worsen, but new defects appearing later on are concerning: they suggest particulate contamination, some sort of electrical fault, ESD damage, age related flash bitcell damage, etc. Any of these could potentially affect many other storage locations in the future.

Dave Anderson Aug 12, 2025

@azonenberg yeah it's fair. I'm somewhat blessed that this is the first time I've had to think about my policy, because all my drives to date have had a perfect 0 defects, or went from "fine" to "dead" basically instantly, which made the decision easy.

This event where the drive took a big hit but survived and where the smart data says even the concerning amount of sector reallocation only consumed 13% of the factory spares (RMA threshold is 90% of spares used), made me wonder. Especially since a replacement is $400, if I could persuade myself that I understand what the issue was and feel good that it's not a predictor of future sadness, ...

But yeah, a friend with deeper knowledge of hard drives (from working on them at cloud scale, where you get more insight from the vendors), who is usually somewhat sanguine about having a few reallocated sectors here and there as the drive ages, said they'd consider replacing this drive because that drive head feels like it might not be long for this world under load. So... yeah.

Andrew Zonenberg Aug 12, 2025

@danderson Yeah especially in a case like this where you may have sustained a high energy impact it's a major concern that there could be abrasive particles scattered all over the disk surface waiting for you to hit them, causing a cascading failure.

Andrew Zonenberg Aug 12, 2025

@danderson It also depends on how much redundancy and resilience you have in your infrastructure.

I'd be a lot more willing to run a questionable drive in Ceph BlueStore with 3N replication and end to end checksumming than as say a laptop hdd with no fault tolerance.

Will Glynn Aug 12, 2025

@azonenberg @danderson This is exactly my strategy.

I have a handful of HDDs scattered about in low- or no-redundancy applications. If any of them sneeze or cough, they get sent to a Ceph cluster instead, where most of my HDDs already are. Many go on to live long service lives! Others do not. Regardless, it's no longer a data loss concern.

Dave Anderson Aug 12, 2025

@azonenberg Yeah, I have quite a lot of redundancy: the drive's in a raidz2 pool, and all the data within gets replicated to a separate onsite backup NAS (also raidz2), and a critical subset also gets replicated to an offsite NAS (raidz1 - subset just because I haven't gotten around to shipping more drives there and growing the pool yet).

So, in terms of risk, I have a _lot_ of redundancy to go before this suspect drive causes data loss. But otoh, for Reasons, I discovered the other day that my primary had a faulted drive for a month before I noticed, and my onsite backup brained its UEFI nvram and won't boot any more, and that the offsite backup had filled up and wasn't replicating recent changes any more 😬 So, in practice I'm spending a fair bit of my redundancy on "I can neglect the system for a while due to mental health and even if it all bitrots it's still like N+1.8 overall".

I did do several full ZFS scrubs and also an extended smart test, and they all came back clean (no increase in reallocations, no motion in other early failure indicators)... But without knowing a lot more about modern drive physics and firmware, I don't feel I have enough information to make a confident call to risk it, so to speak.

szakib Aug 16, 2025

@azonenberg @danderson That's pretty wasteful.

arclight Aug 16, 2025

@szakib @azonenberg @danderson Depends on how much your data and your time and your risk tolerance is worth

Andrew Zonenberg Aug 16, 2025

@arclight @szakib @danderson Yeah I've had multiple catastrophic or near catastrophic data loss incidents in the past including...

Our old 386 dying days after we freed up disk space by moving files one floppy at a time to the WinME system

A friend's house burning down containing a server with the only copy of weeks of work for a customer who had a contract clause required all data be stored on premises. The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted.

The HDD in my wife's brand new art computer dying days after I had installed it, before I had time to enroll it in my backup system. She lost a few days of work but most of her stuff was still on the old box.

These days everything I care about is Ceph with 3 way replication across 3 servers and end to end checksumming, backed up nightly to a second off site Ceph cluster.

szakib Aug 17, 2025

@azonenberg @arclight @danderson RAID?

Andrew Zonenberg Aug 17, 2025

@szakib @arclight @danderson RAID doesn't handle silent corruption well and unless you do 3 drive raid1 or have external checksums, if you find corruption you don't know which version of the object to trust.

Ceph (at least with BlueStore) has full end to end checksumming a la ZFS.

F4GRX SÃ©bastien Aug 17, 2025

@azonenberg @arclight @szakib @danderson "The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted." 😰

szakib Aug 17, 2025

@f4grx @azonenberg @arclight @danderson Backups?
My point is still to the OP: individual drive health should not matter.

arclight Aug 17, 2025

@szakib @f4grx @azonenberg @danderson In an ideal world, it shouldn't. It always comes back to risk tolerance and budget. Defense-in-depth isnt cheap and besides, you can always find a use or a home for a working but suspect drive.

Andrew Zonenberg Aug 17, 2025

@arclight @szakib @f4grx @danderson I've had good luck RMAing drives once they start throwing smart errors even if they appear to work.

If I can get a free replacement for an in-warranty drive before it fails catastrophically, I'm going to.

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson oh the drive mfrs are great about this. Best warranty support in the business.

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson They will often cross-ship for a nominal at most fee if you give them a credit card number.

Andrew Zonenberg Aug 17, 2025

@kevinriggle @arclight @szakib @f4grx @danderson Yeah.

That said, my rule is also that data is expensive and storage media is cheap.

If a drive so much as blinks wrong at me, it's removed from active service as quickly as I can source a replacement. Yes, I have enough redundancy that loss of one drive shouldn't cause any horrible issues. But why tickle the dragon's tail if you can avoid it?

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson Two things in life are
certain. Death and lost data.
Guess which has occurred

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson which is to say this is basically the correct perspective on data storage devices I believe yes

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson I had a client who made me walk away from a degraded RAID array and I almost literally moved heaven and earth until I had at least documented its status and informed them what next steps were

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson (had enough status that the hotel I was staying in literally kicked someone next door so I could stay long enough to at least document the job)

Andrew Zonenberg Aug 17, 2025

@kevinriggle @arclight @szakib @f4grx @danderson Yep lol.

To put more concrete numbers on it: a Micron 7450 Pro 3.84 TB server-grade M.2 22110 SSD, one of the workhorses of my fleet (but not the only drive I run), is $745.

If your hourly rate is $250/hr, three hours of lost work is worth as much as the drive. How many hours of work would a major data loss incident take? Even if you have backups, restoring from them and identifying what's been lost, much less recreating anything lost, is not immediate or free.

Even shutting down a cluster node, removing a suspect drive, inserting a new one, and running a few shell commands to provision the new drive has a cost in time that could otherwise be billable. But that's a small, known cost that you can incur during a planned maintenance window (accepting the reduced redundancy during that period).

Andrew Zonenberg Aug 17, 2025

@szakib @f4grx @arclight @danderson In that particular case we were contractually prohibited from having any offsite storage.

Stupid rule, and one that 15 years older/wiser me would never have agreed to. But it was the rule at the time.

Luka Rubinjoni Aug 17, 2025

@f4grx @azonenberg @arclight @szakib @danderson Good illustration of the "RAID's not backup" maxim.