I see some comments about soft lockups during memory pressure. I have struggled with this immensely over the years. I wrote a userspace memory reclaimer daemon and have not had a lockup since: https://gist.github.com/EBADBEEF/f168458028f684a91148f4d3e79... .
The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.
In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.
If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.
The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.
In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.
If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.