Computer freezing after few days and not responding via ssh or physically

denismenchov · April 3, 2024, 2:36pm

Hello,
I have a computer with Plasma 5.24 and after a variable amount of time (can be 1 day to 2 weeks) it freezes completely. After that the display is black, the keyboard and the mouse have no action to wake it up. I can’t access it via SSH as usually.
First I thought it was something about suspension or hibernation, therefore I deactivated it with
$ sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
Then I updated the graphical card driver, and then the BIOS.
I also checked the journalctl but I don’t really understand it…
Could you please guide me to debug it ?

sfiedler · April 3, 2024, 6:44pm

It freezes completely? Not even ssh? That could be, one time, a full RAM/Memory (you could view it using the KDE System Monitor).
The only other thing that I can think of would be a Linux Kernel bug, as SSH is nothing KDE related…

jsalatas · April 3, 2024, 6:46pm

What kind of computer is that?

denismenchov · April 3, 2024, 7:33pm

It’s a Linux 22.04 which was upgraded from 21.04. Could I see the bug in the journalctl ?

denismenchov · April 3, 2024, 7:34pm

The problem is, when it freezes, my only solution is to restart the computer. I tried to enter command line mode on it without success.

jsalatas · April 3, 2024, 7:35pm

umm… that doesn’t help much

I was expecting to see some hardware specs and model.

denismenchov · April 3, 2024, 7:42pm

Yes of course.
Motherboard : ASUS PRIME X570-Pro
CPU : AMD 3950X
GPU : PNY QUADRO RTX4000
RAM : 2x32 Gb CORSAIR DDR4
Storage : 2x CORSAIR FORCE MP600 SERIES NVME 1Tb
We configured a raid 0 for the disks
Hope it helps

jsalatas · April 3, 2024, 7:45pm

make sure that your bios is up to date. Also check the output of the dmesg command (it will be long) for any errors/warnings.

guss77 · April 3, 2024, 8:04pm

I have an issue (now on Plasma 6, but it was happening with 5 as well) where kwin will occasionally eat 100% of the CPU and will be non-responsive. When that happens, nothing on the computer responds - I can’t even access the virtual terminal, but I can connect to the computer using SSH and if I killall -9 kwin_wayland - the system recovers.

There’s a bug open about that, if anyone is interested.

@denismenchov - is it possible that the failure to connect over SSH is due to another reason, such as network connectivity?

denismenchov · April 3, 2024, 8:46pm

The failure for ssh is not related to network as I can connect to other computers on the same lan… I will investigate the cpu / ram

denismenchov · April 4, 2024, 7:17am

I watched it but it lists only error messages from the current session, and as I can’t connect to the computer after it has frozen, I can’t see what went wrong.

denismenchov · April 4, 2024, 7:55am

Before it crashed here is come of the output of the command journalctl -o short-precise -b -1 :
last line is ntpd[1238]: Soliciting pool server XXX...
before

CRON[110966]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
CRON[110967]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
CRON[110966]: pam_unix(cron:session): session closed for user root

then

AUDIT_TRAIL|Centrify Suite|Trusted Path|1.0|2700|Trusted path granted|5|user=XXX pid=XXX utc=XXX centrifyEventID=XXX DASessID=N/A DAInst=N/A status=GRANTED>
<bg-MAIN:CloudConnectorsRefresh> base.cloudconnector.locator Centrify Agent will not use Centrify Connector 'XXX' to provide connectivity to Centrify Identity Platform UR>

and before

audit[XXX]: AVC apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=XXX comm="cups-browsed" capability=23  capname="sys_nice"
kernel: audit: type=1400 audit(XXX:XXX): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=XXX comm="cups-browsed" capability=23  capname="sys_nice"

mlincett · April 4, 2024, 7:57am

Using journalctl -b -1 one should be able to access the system logs from previous boots. But on the chance that the failure is disk-related, the last log messages may be lost. In such cases, one could either redirect the system log to another machine or open a permanent ssh session to the faulty machine and tail the system log from there until the fault occurs.