I've been running Linux for 15 years and am stuck. Help!

On Ubuntu Desktop, sometimes (usually after a few minutes - sometimes not at all until reboot) all file write operations seem to hang, while opening files is not a problem. Can't kill processes, they just go <defunct>. Extremely slow drawing of windows.

What's going on here? Memory and CPU look good. Open files are not crazy... io load is very low. Not virtualized.

Any ideas?

@_lennart does it happen when you use a live distro off USB, etc? If it doesn't it rules out hardware.
@jmurray2011 good idea! I can try that if I can't figure this out soon
@_lennart also see if this translates on a TTY, try to rule out your GUI

@_lennart I'm going to hit you with a super wild guess.

Loose or bad SATA cable.

That's some wild shit, good luck with the troubleshooting.

@_lennart

Does the system come back from this state, or is it something you have to reboot to recover from?

Does anything end up in the system logs? If write IO can't happen, that could be problematic.

What storage device(s) are you using? Are you monitoring them with smartctl / smartd? Does it report anything unusual?

[Edit: sorry, I missed that you said it does sometimes come back]

@_lennart you could be hanging up on too many open files or some kind of disk io? could try vmstat to check?

@_lennart its also possible you have a crummy disk and either
A) its got bad sectors, so syslog will scream at you
B) its having cache issues and getting stuff from actual storage to the OS is choking because of the disk cache.

or, and this hits me from time to time
the stupid fucking local resolver bullshit i always forget to turn off makes everything wait while it does a dns lookup for your machines hostname, which will never resolve, so you get to wait.

@Viss omg I swear to god if this is a systemd DNS thing haha
@_lennart tail -f /var/log/syslog then try and recreate the problem ;b
@Viss have it tailing, waiting for things to happen again
@_lennart @Viss `dmesg -w` may be more reliable, if disk IO is a concern. With the syslog files being, you know, on disk and everything. Also, haven’t modern Ubuntu releases moved most logging to the systemd journal? `journalctl -f` might be better in that case.
@_lennart Knee-jerk reaction is the disk is going bad. It wouldn't hurt to do a backup as soon as you can and then run a disk check and/or inspect the SMART values.
@_lennart P.S. I've had disks go bad even when the SMART report didn't show anything unusual. It just died suddenly after a slow bout of degradation. It was somewhat similar to what's being reported here.

@_lennart

Two guesses:

First guess: nofiles ulimit is probably too low.

ulimit -a will show you what they're set to, for some insane reason some distros set this to like 1024. The idea being that a user shouldn't have more than 1024 open files.

IME on a few systems, snap has exacerbated this because of the way snap does things.

You can set this via/etc/security/limits.conf:

@usergroup hard nofile 16384
@usergroup soft nofile 8192

Then sysctl -p to reload them, then see if it happens again. :)

Second guess: Something is actually locking disk access for some reason (long syncs), which you can look at this using something like iostat or iotop are options.

@fennix @_lennart

This is an excellent point! I wouldn't have thought of this as a possibility and it might explain some similar issues I've been frustrated with for a long time.

However, this change doesn't effect a graphical user login. You'd have to do an su - <user> and then the new limit would apply.

I confirmed this by doing the ulimit -a after logging in and seeing the 1024 limit, and then after doing the su to my user, only then I saw the limit increased.

To make the change as well for graphical user login, modify both /etc/systemd/user.conf and /etc/systemd/system.conf by adding the following line:

DefaultLimitNOFILE=16384

At least, this is my understanding

@Jerry @_lennart oh because of course systemd does its own thing... 🤦‍♂️
@_lennart anything in dmesg or logs? Sounds like something down in the kernel or maybe hardware issues.

@_lennart What does DMESG have to say when it's happening? I had a Raspberry Pi that did this because there was a DMA bug in the CPU that needed a workaround. Lots of I/O errors at the time.

Also, what's your 'swappiness' set to, and are you seeing swap in-use go up (even if you have plenty of memory available) at the time?

Are you running a distro that defaults to Transparent Hugepages (e.g., Fedora/CentOS/RHEL), with a workload somewhere that's not good with THP (e.g., redis)?

Can you pop a terminal open with 'sudo top' running and see if maybe a particular kernel process is associated with whatever is going on?

@_lennart I had this exact problem for weeks. Turned out to be a stale sshfs mount that I'd totally forgotten about, triggering a weird I/O hang. I don't remember exactly how I tracked down the fix, but it involved forcing it to unmount. Adding "sshfs" to your searches might help.
@tychotithonus oh I think something like that with fusefs happened to me in 2007 or so! :D
@_lennart my first instinct is that there's something wrong with the storage. Anything in journalctl when it happens? Maybe install a different kernel version, on the off chance you're hitting an obscure kernel bug. If you're not backing up your data, I'd start right now. Maybe try a different distro and see if you see the same problems there.
@_lennart I've encountered this sort of behavior when my hard drive was having persistent errors.
@_lennart smells like a bad SSD/SATA cable.
@_lennart Hmmm, definitely not normal behavior. Did you check for HW issue? HDD/SSD health stats for your drives?
@_lennart Do you have any network storage mounted?

@_lennart
The first thing that comes to mind when file writes are slow is that you have a failing drive, or perhaps an older SSD which is behaving badly and in need of a TRIM. It's reasonable for read performance to be fine if the problem is with writing.

Slow drawing of GUI elements suggests that there's a performance bottleneck somewhere -- If you are having I/O troubles then that could be a side effect, but it may also be something completely unrelated.

The whole thing may just be a graphical problem which is slowing down everything else. Disabling the GUI entirely and switching to a text VT to retest may at least give you a direction to look in.

I would start by taking another look at your top / iostat /sar / or whatever report. You already mentioned that memory and CPU are good, but if you are seeing high IOWait times then that would point to a problem even if the read/s and write/s are low. If you have it configured, "sar" can give you detailed reports going as far back for days or weeks, which is helpful for problems which won't stand still when you want them to. The "dmesg" output should include any warnings thrown by the kernel over non-responsive devices or processes, so that's always worth checking on.

Another option is thermal problems. If anything is overheating that could easily lead to throttling which would look (correctly) like a massive slowdown of the entire system. "sensors" may give you some helpful input there if it has been configured, and if nothing else works you should be able to see all of your internal temperatures through BIOS.

Whatever it is, good luck.

@_lennart If there's nothing obvious in the logs and you've tried testing storage and ram health, it might be helpful to collect deeper metrics with sysstat/sar or other tools... assuming that they'll be able to save their output 🤔

Check out https://www.brendangregg.com/Perf/linux_observability_tools.png for what tools to use where.