Mastodawn

Digging into the drive in my NAS that faulted, I'm reminded that magnetic hard drives are preposterously magical technology.

Case in point, using Seagate's tools I can get the drive to tell me how much it's adjusted the fly height of each of its 18 heads over the drive's lifetime, to compensate for wear and stuff. The drive provides these numbers in _thousandths of an angstrom_, or 0.1 _picometers_.

For reference, one helium atom is about 49 picometers in diameter. The drive is adjusting each head individually, in increments of a fraction of a helium atom, to keep them at the right height. I can't find numbers of modern drives, but what I can find for circa ten years ago is that the overall fly height had been reduced to under a nanometer, so the drive head is hovering on a gas bearing that's maybe 10-20 helium atoms thick, and adjusting its position even more minutely than that

This is _extremely_ silly. You can buy a box that contains not just one, but several copies of a mechanism capable of sub-picometer altitude control, and store shitposts on it! That's wild.

Anyway my sad drive apparently looks like it had a head impact, not a full crash but I guess clipped a tiny peak on the platter and splattered a couple thousand sectors. Yow. But I'm told this isn't too uncommon, and isn't the end of the world? Which is, again, just ludicrous to think of. The drive head that appears to have bonked something has adjusted its altitude by almost 0.5 picometers in its 2.5 years in service. Is that a lot? I have no idea!

Aside from having to resilver the array and the reallocated sector count taking a big spike, the drive is now fine and both SMART and vendor data say it could eat this many sectors again 8-9 times before hitting the warranty RMA threshold. Which is very silly. But I guess I should keep an eye on it.

SpaceLifeForm Aug 12, 2025

I would suspect heat warping. Winchester in theory should prevent the heads from crashing into a platter.

Garrett Wollman Aug 12, 2025

@danderson Ink-jet printers dispense fluid by the picoliter. Astonishing stuff; the mechanical engineers of my grandparents' generation would not have believed that it was physically possible.

Kevin Boyd (he/him) 🇨🇦Aug 12, 2025

@wollman @danderson They would also be shocked at how expensive the subscription costs are per picolitre.

Andy L.Aug 12, 2025

@danderson I wonder what kind of "wear" requires moving the heads down half a picometer.

I didn't even realize it was possible for wear to happen at that scale.

Dave Anderson Aug 12, 2025

@apLundell I have no idea, I'm being fed information from smarter people than me who've worked with hard drives at cloud scale and so have deeper knowledge of how the products and lifecycle tick. And trying to parse the information dumps from openSeaChest, which reveals that this drive has Quite A Lot going on in terms of monitoring and fine-tuning the analog signals it's seeing at each drive head.

Keith Judge Aug 12, 2025

@danderson Another astonishing thing about HDDs is that in the early 2000s, when most HDDs were 3.5" or 2.5" desktop or laptop things, someone was able to figure out how to mass produce HDDs small AND robust enough for the iPod and similar devices.

At least until flash memory became affordable enough to overtake these micro HDDs.

Inga stands with 🇺🇦 🇵🇸Aug 12, 2025

@KeefJudge @danderson ipods and similar devices don't have a very high needs for robustness. At typical HDD data transfer speed, and at typical compressed audio bitrate, you only need HDD running for like a handful of seconds per hour (running HDDs constantly would destroy these devices' batteries).
And with all kinds of gravity sensors, the players only need to find a small moment of calm to spin up the HDD and to read one buffer worth of compressed file data from it.

Just like, before HDD players, it was the same with audio CDs (although their buffers were usually limited to something around a minute, and CD reading speeds were at most 10x the audio stream bitrate). When subjected to constant shaking they couldn't work, but in real life conditions their sensors usually could find enough calm time to spin up the CD and read data off it (even when players without anti-shock feature struggled to work). Although of course players with larger anti shock buffers, or MP3-CD players, worked better.

Alaric Snell-Pym Aug 12, 2025

@danderson nowhere near as cool as your example, but I diagnosed a seemingly flakey disk using SMART data - one disk in the array had a far higher lifetime reboot count (10s of thousands) than the others (10s). Turned out the problem was a flakey power cable to that drive!

Andrew Zonenberg Aug 12, 2025

@danderson Personally I'd retire the drive under the assumption that particulate generated from the impact will likely contaminate it and make future damage more likely.

I aggressively replace drives at the first sign of trouble. Any increase in failed sector count is enough for me to no longer trust it.

Will Glynn Aug 12, 2025

@azonenberg @danderson I'm sure it's fine. What's the worst that can happen?

Actually, I think you would both like the MBI report on the Titan submersible implosion: https://media.defense.gov/2025/Aug/05/2003773004/-1/-1/0/SUBMERSIBLE%20TITAN%20MBI%20REPORT%20(04AUG2025).PDF

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson already read it cover to cover.

No surprises having already read the preliminary and the NTSB material analysis report (which is itself a very good read).

The part I find most interesting is that the RTM system *worked*. It detected a catastrophic delamination of the hull well in advance of the actual implosion. They just ignored the data.

Does it work reliably enough to make (well constructed and properly engineered) carbon fiber pressure hulls safe for limited cycle applications? Open question and after this incident it's not likely anyone will put the work into finding out.

But in this particular case, it did in fact give significant advance warning. The warnings being ignored is a company culture problem not a technological problem.

Will Glynn Aug 12, 2025

@azonenberg @danderson 💯

The simplest explanation I have is that OceanGate couldn't afford to design, build, and operate a safe carbon fiber submersible, but they could (just barely) afford to design, build and operate *a* carbon fiber submersible.

Combine that with OceanGate being a corporate veneer for Rush's personal endeavor, and you end up with Rush being willing to gamble himself and his passengers on the safety of the sub. If it works, great, he carries on. If it fails… well, he doesn't see a path forward after that anyway.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Yeah but "company cut corners to save money and people died as a result" is not surprising at all.

The thing working at least a little, surviving to its full target depth a couple of times, *and* RTM detecting the impending failure with enough lead time the hull could have been safely retired after the dive with nobody dying?

That's surprising.

Will Glynn Aug 12, 2025

@azonenberg @danderson Agreed! That anything worked as much as it did is the twist in this story.

I also think it's surprising that, despite making no provisions for inspections, they found a giant crack in the first hull — and that despite the company culture and the decision to repair it, they ultimately did retire it before it got everyone killed.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Yeah the company culture was an accident waiting to happen, there's no doub about that.

I bet ultrasound after the dive 80 incident from the inside (with liner removed), or even visual inspection, would have shown obvious signs of damage. But we'll never know.

Andrew Zonenberg Aug 12, 2025

@willglynn @danderson Like, with better engineering and safety culture this actually could have worked.

Make it thicker as originally planned. Use a PVHO certified window. Better winding with less defects, more accurate cure modeling, etc. Test several full scale prototypes to destruction with RTM data streaming outside the chamber so you know what actual failure and pre failure looks like.

And develop SOPs for retiring hulls when signs of failure are detected.

It would have cost a lot more, but it could have worked.

arclight Aug 16, 2025

@azonenberg @willglynn @danderson Aspects of both the Challenger and Columbia disasters here. From Challenger, having the data that clearly shows a problem but ignoring it. From Columbia, suspecting a serious problem but not investigating it. In all cases a gross failure of safety culture.

Dave Anderson Aug 12, 2025

@azonenberg Yeah after more thought, it seems okay right now, but even if I accept that a few sector reallocs is a fact of life on modern drives and broadly fine, >7k reallocated is outside my comfort zone.

Then I got carried away, and so now the new server is getting a fresh 160TB worth of drives, and once the data's migrated over, the current pool will get dismantled and the remaining healthy drives within redistributed to the two backup NAS pools, so that they vaguely keep up with the growth in the primary.

Andrew Zonenberg Aug 12, 2025

@danderson I retire drives after any increase in reported bad sectors after deployment.

The rationale is that you'll have some number of factory defects that are relatively stable and not going to worsen, but new defects appearing later on are concerning: they suggest particulate contamination, some sort of electrical fault, ESD damage, age related flash bitcell damage, etc. Any of these could potentially affect many other storage locations in the future.

Dave Anderson Aug 12, 2025

@azonenberg yeah it's fair. I'm somewhat blessed that this is the first time I've had to think about my policy, because all my drives to date have had a perfect 0 defects, or went from "fine" to "dead" basically instantly, which made the decision easy.

This event where the drive took a big hit but survived and where the smart data says even the concerning amount of sector reallocation only consumed 13% of the factory spares (RMA threshold is 90% of spares used), made me wonder. Especially since a replacement is $400, if I could persuade myself that I understand what the issue was and feel good that it's not a predictor of future sadness, ...

But yeah, a friend with deeper knowledge of hard drives (from working on them at cloud scale, where you get more insight from the vendors), who is usually somewhat sanguine about having a few reallocated sectors here and there as the drive ages, said they'd consider replacing this drive because that drive head feels like it might not be long for this world under load. So... yeah.

Andrew Zonenberg Aug 12, 2025

@danderson Yeah especially in a case like this where you may have sustained a high energy impact it's a major concern that there could be abrasive particles scattered all over the disk surface waiting for you to hit them, causing a cascading failure.

Andrew Zonenberg Aug 12, 2025

@danderson It also depends on how much redundancy and resilience you have in your infrastructure.

I'd be a lot more willing to run a questionable drive in Ceph BlueStore with 3N replication and end to end checksumming than as say a laptop hdd with no fault tolerance.

Will Glynn Aug 12, 2025

@azonenberg @danderson This is exactly my strategy.

I have a handful of HDDs scattered about in low- or no-redundancy applications. If any of them sneeze or cough, they get sent to a Ceph cluster instead, where most of my HDDs already are. Many go on to live long service lives! Others do not. Regardless, it's no longer a data loss concern.

Dave Anderson Aug 12, 2025

@azonenberg Yeah, I have quite a lot of redundancy: the drive's in a raidz2 pool, and all the data within gets replicated to a separate onsite backup NAS (also raidz2), and a critical subset also gets replicated to an offsite NAS (raidz1 - subset just because I haven't gotten around to shipping more drives there and growing the pool yet).

So, in terms of risk, I have a _lot_ of redundancy to go before this suspect drive causes data loss. But otoh, for Reasons, I discovered the other day that my primary had a faulted drive for a month before I noticed, and my onsite backup brained its UEFI nvram and won't boot any more, and that the offsite backup had filled up and wasn't replicating recent changes any more 😬 So, in practice I'm spending a fair bit of my redundancy on "I can neglect the system for a while due to mental health and even if it all bitrots it's still like N+1.8 overall".

I did do several full ZFS scrubs and also an extended smart test, and they all came back clean (no increase in reallocations, no motion in other early failure indicators)... But without knowing a lot more about modern drive physics and firmware, I don't feel I have enough information to make a confident call to risk it, so to speak.

szakib Aug 16, 2025

@azonenberg @danderson That's pretty wasteful.

arclight Aug 16, 2025

@szakib @azonenberg @danderson Depends on how much your data and your time and your risk tolerance is worth

Andrew Zonenberg Aug 16, 2025

@arclight @szakib @danderson Yeah I've had multiple catastrophic or near catastrophic data loss incidents in the past including...

Our old 386 dying days after we freed up disk space by moving files one floppy at a time to the WinME system

A friend's house burning down containing a server with the only copy of weeks of work for a customer who had a contract clause required all data be stored on premises. The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted.

The HDD in my wife's brand new art computer dying days after I had installed it, before I had time to enroll it in my backup system. She lost a few days of work but most of her stuff was still on the old box.

These days everything I care about is Ceph with 3 way replication across 3 servers and end to end checksumming, backed up nightly to a second off site Ceph cluster.

szakib Aug 17, 2025

@azonenberg @arclight @danderson RAID?

Andrew Zonenberg Aug 17, 2025

@szakib @arclight @danderson RAID doesn't handle silent corruption well and unless you do 3 drive raid1 or have external checksums, if you find corruption you don't know which version of the object to trust.

Ceph (at least with BlueStore) has full end to end checksumming a la ZFS.

F4GRX SÃ©bastien Aug 17, 2025

@azonenberg @arclight @szakib @danderson "The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted." 😰

szakib Aug 17, 2025

@f4grx @azonenberg @arclight @danderson Backups?
My point is still to the OP: individual drive health should not matter.

arclight Aug 17, 2025

@szakib @f4grx @azonenberg @danderson In an ideal world, it shouldn't. It always comes back to risk tolerance and budget. Defense-in-depth isnt cheap and besides, you can always find a use or a home for a working but suspect drive.

Andrew Zonenberg Aug 17, 2025

@arclight @szakib @f4grx @danderson I've had good luck RMAing drives once they start throwing smart errors even if they appear to work.

If I can get a free replacement for an in-warranty drive before it fails catastrophically, I'm going to.

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson oh the drive mfrs are great about this. Best warranty support in the business.

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson They will often cross-ship for a nominal at most fee if you give them a credit card number.

Andrew Zonenberg Aug 17, 2025

@kevinriggle @arclight @szakib @f4grx @danderson Yeah.

That said, my rule is also that data is expensive and storage media is cheap.

If a drive so much as blinks wrong at me, it's removed from active service as quickly as I can source a replacement. Yes, I have enough redundancy that loss of one drive shouldn't cause any horrible issues. But why tickle the dragon's tail if you can avoid it?

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson Two things in life are
certain. Death and lost data.
Guess which has occurred

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson which is to say this is basically the correct perspective on data storage devices I believe yes

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson I had a client who made me walk away from a degraded RAID array and I almost literally moved heaven and earth until I had at least documented its status and informed them what next steps were

Kevin Riggle Aug 17, 2025

@azonenberg @arclight @szakib @f4grx @danderson (had enough status that the hotel I was staying in literally kicked someone next door so I could stay long enough to at least document the job)

Andrew Zonenberg Aug 17, 2025

@kevinriggle @arclight @szakib @f4grx @danderson Yep lol.

To put more concrete numbers on it: a Micron 7450 Pro 3.84 TB server-grade M.2 22110 SSD, one of the workhorses of my fleet (but not the only drive I run), is $745.

If your hourly rate is $250/hr, three hours of lost work is worth as much as the drive. How many hours of work would a major data loss incident take? Even if you have backups, restoring from them and identifying what's been lost, much less recreating anything lost, is not immediate or free.

Even shutting down a cluster node, removing a suspect drive, inserting a new one, and running a few shell commands to provision the new drive has a cost in time that could otherwise be billable. But that's a small, known cost that you can incur during a planned maintenance window (accepting the reduced redundancy during that period).

Andrew Zonenberg Aug 17, 2025

@szakib @f4grx @arclight @danderson In that particular case we were contractually prohibited from having any offsite storage.

Stupid rule, and one that 15 years older/wiser me would never have agreed to. But it was the rule at the time.

Luka Rubinjoni Aug 17, 2025

@f4grx @azonenberg @arclight @szakib @danderson Good illustration of the "RAID's not backup" maxim.

Thomas Neubauer Aug 12, 2025

@danderson Amazing! Also Ångström is a unit, i didn't read in a long time.

Sludge Aug 12, 2025

@danderson chances are the engineers picked those units "just to be safe for a while", not necessarily because that resolution is what the drive is actually capable of adjusting in

Major Denis Bloodnok Aug 12, 2025

@SludgePhD @danderson Yes, this can't possibly be right - the head could at best only be manufactured to a precision of one atom's-width, it can only wear in atom's-width lumps, how could it possibly (or usefully) be adjusted in fractions of that?

Dave Anderson Aug 12, 2025

@denisbloodnok @SludgePhD The drive head is mounted on servos that can be adjusted through deflection, not just moving up/down by increments of one atom. That doesn't limit you to integer atoms.

Wear is also not necessarily mechanical, there's a lot of complex electronics in the pickup mechanism that reads the analog signal off the platter and amplifies/cleans it to recover the bit data. The helium pressure in the case also drops over time, which is going to change the properties of the gas bearing the head is riding on. For any number of reasons, minutely adjusting the head's position may result in significant changes to the head's ability to read the data on the track.

Dave Anderson Aug 12, 2025

@denisbloodnok @SludgePhD Other fun things you can find in the vendor's debug metrics on the drive: the signal strength of the head's writes changes as the drive head ages, as does the physical position of where the signal gets laid on the platter. The drive includes per-head adjustments for exactly how to do the writes, as well as track position offsets to keep the data in the right place. Modern drives are a giant bag of active closed loop feedback systems keeping everything in place as the component characteristics change minutely with age.

Dave Anderson Aug 12, 2025

@denisbloodnok @SludgePhD It's also worth asking: why would they lie? They don't have to expose any of this data, it's not part of standard SMART data, it's only accessible through vendor-specific commands. They could just not publish tools to read them, and not document what the values mean. Instead they open-sourced the software to extract this data from the drives, and documented specifically what each value is measuring. Why go to all that trouble and then lie about what the values mean?

Vassil Nikolov | Васил Николов Aug 12, 2025

I can only note that these technologies have been developed for, very roughly, about three quarters of a century and under high pressure to improve.
This is a very long sequence of changes and it can go very far.

@danderson @denisbloodnok @SludgePhD

Major Denis Bloodnok Aug 12, 2025

@danderson @SludgePhD Deflection just means you're trying to make an even tinier adjustment somewhere else.

The head is far _bumpier_ than the supposed amount of adjustment here, and so is the surface of the drive.

Dave Anderson Aug 12, 2025

@denisbloodnok @SludgePhD that's not how the geometry of the drive heads works, but I don't think either of us is getting anything useful out of this argument, so I'll stop now :)

Richard Brockie Aug 16, 2025

@denisbloodnok @SludgePhD @danderson

Those same observations/arguments were made within the industry, but the data are clear that we can adjust spacing with pm as the unit. HDDs are indeed magical devices, just like the OP said!

Major Denis Bloodnok Aug 16, 2025

@RichardBrockie Please can we _see_ some of this data? (In particular, "with pm as the unit" I think reasonably covers anything up to three orders of magnitude less precise than the original claim.)

Richard Brockie Aug 16, 2025

Much is confidential. Here’s a recent paper from a collaboration between some of my colleagues and UC Berkeley.

Cheng, Q., Rajauria, S., Schreck, E. et al.

In-situ sub-angstrom characterization of laser-lubricant interaction in a thermo-tribological system.

Commun Eng 3, 138 (2024). https://doi.org/10.1038/s44172-024-00284-3

Major Denis Bloodnok Aug 16, 2025

@RichardBrockie Perhaps I'm missing something but this appears to be about _measuring_ lubricant thickness at, well, as they say ~0.2 Å resolution. That's pretty tiny but it doesn't seem to have much bearing on the question of whether the drive head can be moved in increments 200 times smaller (ie, much smaller than an atom) or what good that would do.

Richard Brockie Aug 16, 2025

As I said before, much data is confidential, but our colleagues at Seagate have indicated their unit in their SMART data.

This paper is at the same scale: 0.2A is 20pm. Elsewhere in the thread it was mentioned how we control the fly height. The paper indicates some of the reasons we need to control the fly height.

As to how we measure spacing, here’s a good starting point. http://dx.doi.org/10.1109/TMAG.2011.2157672

Major Denis Bloodnok Aug 16, 2025

@RichardBrockie It's not at the same scale (or about positioning the head). The original claim here was for 0.1pm, 200 times as small.

The second paper appears to be (as you say) about measuring spacing not moving the head in 0.1pm increments, and while I can't read the whole thing the abstract says "a repeatability of less than 0.1 nm was obtained".

So far you've shown me nothing to justify the original claim.

Richard Brockie Aug 17, 2025

If you read what I wrote, it is that I’m agreeing on the units that STX have published for spacing. I’ll point out that Marchon et al. is from 11 years ago and the full pdf is available.

Tribological understanding has progressed. If you’re genuinely interested I’ve shared good starting points for a literature search.

Major Denis Bloodnok Aug 17, 2025

@RichardBrockie If what you mean is that the use of "picometres" is not unreasonable for any figure at all in this space, great. You'll forgive me for being a little confused because, since I was specifically dubious about the claim that the head can move in 0.1 pm increments, I expected you to be saying something in support of that claim and was examining what you were saying on that basis.