This is a disk I/O report for the last 30 days of every #Proxmox node of a cluster. Something happened around March 28 that caused high disk usage, and I can’t figure out what. The replication tasks are failing randomly, and actually, any disk operations are slowed down. Also, there are no significant changes in CPU, RAM, or network usage.
I was hoping to find which LXCs are causing this, but they all have similar disk IO graphs.
Well, shit.
#homelab #ProxmoxCluster #HighDiskUsage #zfs #mystery

Ok, that was kinda premature and stupid panic.

#Beszel agent was not reporting Disk I/O until the recent update. All agents are updating automatically, and that happened just around March 28-30. After that, the data began to flow.

In fact, I have sudden disk issues only with a single #Proxmox node, and it is clearly visible on the IO pressure stall graph. Spikes before May 30 are backups. Then it went crazy.

#homelab #zfs

Also, this is a single WD Green with #ZFS on it. Yes, I know some #Proxmox guru would hang me on a tree for that, but we got what we got.

#homelab #wdgreen #wd #ssd #Proxmox

@yehor so the problem is beszel?
Showing false data or is causing the problem?
@schenklklopfer I'm not sure, but it seems likely. I'll stop all agents and monitor the IO pressure stall graph from #Proxmox.
@schenklklopfer So Beszel was not in charge. Looks like it was just a coincidence and the drive is just dying.

So I moved all LXCs to another #Proxmox node. The problematic drive’s usage has returned to normal, but another node where containers were moved is still fine.

Looks like the issue is in a WD Green SSD on that node. It took 1429 hours to retire it.

I'm running #diskscan on it right now. Have no idea why, because I still will replace it with a spare NVMe drive I have.

#homelab #Proxmox #sdd #nvme #drive

@yehor anything interesting in smartctl output? The smart self-test could be useful also
@yehor spare SSD in 2026!!??. You should consider yourself a rich person πŸ˜…
@yehor If you are running ZFS on a consumer SSD like WD Green it can wear out extremely quickly, maybe that's what happened?
@woof yeah, I think so
@yehor I went through a pair of consumer SSDs in zmirror awhile back, just used up the 500TB write capacity in only a few years. Now they've been replaced with Samsung SM863 enterprise SSDs, they have 6.2PB write capacity so should last a long time.

@yehor wait. I see exactly the same, but on my bare-metal installation of HomeAssitant on a small Intel Atom ThinClient...

WTF?

@yehor have you tried looking into atop, htop or iotop? First and last specialize in process I/O analysis while the middle one learned a similar use case recently.

https://github.com/Atoptool/atop

GitHub - Atoptool/atop: System and process monitor for Linux

System and process monitor for Linux. Contribute to Atoptool/atop development by creating an account on GitHub.

GitHub
@yehor netdata can be very helpful for this type of troubleshooting, obv not with historical data unless you invent a Time Machine
@yehor Not to be a doomer, but this is what I'd expect to see from ransomware in action, right?
@yehor not trying to be smart but did you try rebooting?
@pax0707 yes. No luck

@yehor just saw it’s a spinner. Not ideal, but should work.

For containers I would advise at least an SSD. Lots of small writes can kill the performance of a HDD real fast. Especially on a SMR drive.

Could be just that.