Digging into the drive in my NAS that faulted, I'm reminded that magnetic hard drives are preposterously magical technology.

Case in point, using Seagate's tools I can get the drive to tell me how much it's adjusted the fly height of each of its 18 heads over the drive's lifetime, to compensate for wear and stuff. The drive provides these numbers in _thousandths of an angstrom_, or 0.1 _picometers_.

For reference, one helium atom is about 49 picometers in diameter. The drive is adjusting each head individually, in increments of a fraction of a helium atom, to keep them at the right height. I can't find numbers of modern drives, but what I can find for circa ten years ago is that the overall fly height had been reduced to under a nanometer, so the drive head is hovering on a gas bearing that's maybe 10-20 helium atoms thick, and adjusting its position even more minutely than that

This is _extremely_ silly. You can buy a box that contains not just one, but several copies of a mechanism capable of sub-picometer altitude control, and store shitposts on it! That's wild.

Anyway my sad drive apparently looks like it had a head impact, not a full crash but I guess clipped a tiny peak on the platter and splattered a couple thousand sectors. Yow. But I'm told this isn't too uncommon, and isn't the end of the world? Which is, again, just ludicrous to think of. The drive head that appears to have bonked something has adjusted its altitude by almost 0.5 picometers in its 2.5 years in service. Is that a lot? I have no idea!

Aside from having to resilver the array and the reallocated sector count taking a big spike, the drive is now fine and both SMART and vendor data say it could eat this many sectors again 8-9 times before hitting the warranty RMA threshold. Which is very silly. But I guess I should keep an eye on it.

@danderson Personally I'd retire the drive under the assumption that particulate generated from the impact will likely contaminate it and make future damage more likely.

I aggressively replace drives at the first sign of trouble. Any increase in failed sector count is enough for me to no longer trust it.

@azonenberg @danderson That's pretty wasteful.
@szakib @azonenberg @danderson Depends on how much your data and your time and your risk tolerance is worth

@arclight @szakib @danderson Yeah I've had multiple catastrophic or near catastrophic data loss incidents in the past including...

Our old 386 dying days after we freed up disk space by moving files one floppy at a time to the WinME system

A friend's house burning down containing a server with the only copy of weeks of work for a customer who had a contract clause required all data be stored on premises. The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted.

The HDD in my wife's brand new art computer dying days after I had installed it, before I had time to enroll it in my backup system. She lost a few days of work but most of her stuff was still on the old box.

These days everything I care about is Ceph with 3 way replication across 3 servers and end to end checksumming, backed up nightly to a second off site Ceph cluster.

@azonenberg @arclight @szakib @danderson "The customer SSH'd into the burning building and saved most of the critical stuff before the power or fiber cables melted." 😰
@f4grx @azonenberg @arclight @danderson Backups?
My point is still to the OP: individual drive health should not matter.
@szakib @f4grx @azonenberg @danderson In an ideal world, it shouldn't. It always comes back to risk tolerance and budget. Defense-in-depth isnt cheap and besides, you can always find a use or a home for a working but suspect drive.

@arclight @szakib @f4grx @danderson I've had good luck RMAing drives once they start throwing smart errors even if they appear to work.

If I can get a free replacement for an in-warranty drive before it fails catastrophically, I'm going to.

@azonenberg @arclight @szakib @f4grx @danderson oh the drive mfrs are great about this. Best warranty support in the business.

@kevinriggle @arclight @szakib @f4grx @danderson Yeah.

That said, my rule is also that data is expensive and storage media is cheap.

If a drive so much as blinks wrong at me, it's removed from active service as quickly as I can source a replacement. Yes, I have enough redundancy that loss of one drive shouldn't cause any horrible issues. But why tickle the dragon's tail if you can avoid it?

@azonenberg @arclight @szakib @f4grx @danderson Two things in life are
certain. Death and lost data.
Guess which has occurred
@azonenberg @arclight @szakib @f4grx @danderson which is to say this is basically the correct perspective on data storage devices I believe yes
@azonenberg @arclight @szakib @f4grx @danderson I had a client who made me walk away from a degraded RAID array and I almost literally moved heaven and earth until I had at least documented its status and informed them what next steps were
@azonenberg @arclight @szakib @f4grx @danderson (had enough status that the hotel I was staying in literally kicked someone next door so I could stay long enough to at least document the job)

@kevinriggle @arclight @szakib @f4grx @danderson Yep lol.

To put more concrete numbers on it: a Micron 7450 Pro 3.84 TB server-grade M.2 22110 SSD, one of the workhorses of my fleet (but not the only drive I run), is $745.

If your hourly rate is $250/hr, three hours of lost work is worth as much as the drive. How many hours of work would a major data loss incident take? Even if you have backups, restoring from them and identifying what's been lost, much less recreating anything lost, is not immediate or free.

Even shutting down a cluster node, removing a suspect drive, inserting a new one, and running a few shell commands to provision the new drive has a cost in time that could otherwise be billable. But that's a small, known cost that you can incur during a planned maintenance window (accepting the reduced redundancy during that period).