On 2016-10-04 20:25, Marc Chamberlin wrote:
On 10/3/2016 7:16 PM, Carlos E. R. wrote:
Thanks John, Carlos for your thoughts... I ran the smartctl checks against my disk drives and yeah one of the drives is showing some failures although it's overall summary reports that the drive is healthy. I noted that my swap partition was on the drive that is showing errors, so decided to move the swap partition to a different drive to see if that would help. It didn't. I will go ahead and replace this drive just to be safe and report back if that indeed improves things...
Recently one of my disks developed an error. I figured, from smartctl, that the affected sectors were in the swap partition. So I just did swapoff on that one, wrote it entirely with zeroes, and run smartctl again. Clear. mkswap and activate it again. No issues. Watching it to see if the number of bad sectors increase, then consider replacing it.
However, for my two cents worth, speaking as a computer scientist/engineer myself, I would agree with Carlos. Processes are suspend when they are waiting for I/O operations to complete. That is normal OS behavior. So if a disk drive is having a hard time reading or writing, then that would just result in the process being suspended for a longer period of time, while the drive (using it's own internal processor) figures out how to solve the problem it is having. And a suspended process is not using CPU cycles. So me thinks that disk drive errors should not result in high CPU usage, like I am seeing.
Right.
Something is periodically and definitely using up a LOT of CPU time and top, atop, and ksystemguard are not reporting why. Although ksystemguard will graphically show that both of my CPU cores are running at or near 100% usage. And the total amount of CPU time being reported for each of the individual running processes does not add up to anything near 100% usage. Typically I see less than 15% usage total. Once in awhile I will see the total of the systemd and/or kdeinit5 processes climb up higher, during these episodes of 100% CPU usage, but again nowhere near 100%.
There is a toggle in top to show other threads, I think kernel threads. Wait a minute [...] Yes, I think it is "H".
I can believe it is something in the kernel itself, which these monitoring tools may not know how to report on. Or perhaps my system has been compromised by one heck of a smart virus or trojan that is able to hide it's activities, who knows. I do use it for a number of servers and it is exposed to the internet 24/7. I am not enough of an expert on the internals of the Linux kernel to know how to diagnose it or what tools I should/could use to figure out what it is doing.
One other symptom that is interesting is that the display can become unresponsive to mouse/cursor movements/clicks as well. I don't know whether Linux handles mouse events by interrupt handlers or polls the mouse, but either way I would expect mouse events to be handled at a fairly high priority. The fact that the display becomes unresponsive to these mouse events seems to further point towards the idea that something going on within the kernel/OS itself.
I had a curious issue several years ago. The system was very very busy. It turned out that it was process number 1, init. The culprit was my modem: unplug it and zero cpu. I rebooted my modem, problem solved. Apparently my modem was triggering many interrupts in the serial port, and again apparently they were handled by init. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)