Reminder that SMART is not magic, in my experience, disks are dead long before SMART stops reporting them as 'PASSED'.

I actually came across the perfect example today: Two disks in my backup server are having issues, one is clearly broken, the other gave 2 checksum errors.

Despite this, neither disk is 'failed' according to SMART, one has 456 offline uncorrectable errors / pending sectors, the other one is fine.

Don't rely on broken hardware to tell you that it's broken.

#storage #HDD #ZFS

"haha seagate is bad"
these are enterprise seagate constellation SAS disks, not the OEM seagate disks used in cheap laptops, and they've ran for 7-8 years without issues.

"wait is that an SSD in the middle of that HDD pool"
yea at some point a disk failed and getting a 4TB QLC SSD was actually cheaper at the time  it's a mixed array anyway, it wouldn't be a Redundant Array of INEXPENSIVE Disks if I stuffed it full of new enterprise drives haha (smh companies always get this wrong)

@anthropy

A study from Google a few years ago had two conclusions regarding SMART:

  • Disks that reported SMART errors usually failed very soon afterwards.
  • Most disks that failed did so without reporting SMART errors.

@david_chisnall yea I can very much agree with that; usually if SMART starts showing any issues it's already on its way out.

Although it's not 100% either, I've had disks survive literally for over half a decade after getting 8 uncorrectable sectors without any further issues.

It is worth noting that Google runs their disks extremely hot compared to most homelabs, being between 50 and 70C during runtime, which will cause different issues than when you always run it at <30C in your homelab.

@anthropy In my experience, SMART has always warned me long before the HDs/SSDs died.

The last one was about a week ago. I ignored warnings for 2 weeks that told me the SSDs (2 Samsung PM 893) were about to die, but I thought 2 SSDs at the same time🤔🤔🤔.

In the end, they died. I had to restore the WMS and CT from the backup.