Digging into the drive in my NAS that faulted, I'm reminded that magnetic hard drives are preposterously magical technology.

Case in point, using Seagate's tools I can get the drive to tell me how much it's adjusted the fly height of each of its 18 heads over the drive's lifetime, to compensate for wear and stuff. The drive provides these numbers in _thousandths of an angstrom_, or 0.1 _picometers_.

For reference, one helium atom is about 49 picometers in diameter. The drive is adjusting each head individually, in increments of a fraction of a helium atom, to keep them at the right height. I can't find numbers of modern drives, but what I can find for circa ten years ago is that the overall fly height had been reduced to under a nanometer, so the drive head is hovering on a gas bearing that's maybe 10-20 helium atoms thick, and adjusting its position even more minutely than that

This is _extremely_ silly. You can buy a box that contains not just one, but several copies of a mechanism capable of sub-picometer altitude control, and store shitposts on it! That's wild.

Anyway my sad drive apparently looks like it had a head impact, not a full crash but I guess clipped a tiny peak on the platter and splattered a couple thousand sectors. Yow. But I'm told this isn't too uncommon, and isn't the end of the world? Which is, again, just ludicrous to think of. The drive head that appears to have bonked something has adjusted its altitude by almost 0.5 picometers in its 2.5 years in service. Is that a lot? I have no idea!

Aside from having to resilver the array and the reallocated sector count taking a big spike, the drive is now fine and both SMART and vendor data say it could eat this many sectors again 8-9 times before hitting the warranty RMA threshold. Which is very silly. But I guess I should keep an eye on it.

@danderson Personally I'd retire the drive under the assumption that particulate generated from the impact will likely contaminate it and make future damage more likely.

I aggressively replace drives at the first sign of trouble. Any increase in failed sector count is enough for me to no longer trust it.

@azonenberg @danderson That's pretty wasteful.
@szakib @azonenberg @danderson Depends on how much your data and your time and your risk tolerance is worth

@arclight @szakib @danderson Yeah I've had multiple catastrophic or near catastrophic data loss incidents in the past including...

Our old 386 dying days after we freed up disk space by moving files one floppy at a time to the WinME system

A friend's house burning down containing a server with the only copy of weeks of work for a customer who had a contract clause required all data be stored on premises. The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted.

The HDD in my wife's brand new art computer dying days after I had installed it, before I had time to enroll it in my backup system. She lost a few days of work but most of her stuff was still on the old box.

These days everything I care about is Ceph with 3 way replication across 3 servers and end to end checksumming, backed up nightly to a second off site Ceph cluster.

@szakib @arclight @danderson RAID doesn't handle silent corruption well and unless you do 3 drive raid1 or have external checksums, if you find corruption you don't know which version of the object to trust.

Ceph (at least with BlueStore) has full end to end checksumming a la ZFS.