Am 09.04.2018 um 19:29 schrieb Michal Hocko:
On Mon 09-04-18 16:39:11, Stefan Priebe - Profihost AG wrote:
Am 09.04.2018 um 16:19 schrieb Michal Hocko:
On Mon 09-04-18 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I see some of them wait in lru_add_drain_all(), I recall some recent issues with that upstream, Michal might know more, adding to CC.
Right. There were some locking changes which cure theoretical deadlocks on cpu hotplug locks. Then we have created a dedicated workqueue for lru draining (ce612879ddc78). Traces below show that lru_add_drain_all is blocked waiting for a kworker to finish its job. If there are many work items and all the kworkers are busy then it can take quite some time. ksm_scan_thread shouldn't be critical to the system operation so I do not think that the lru_add_drain_all is really a bottle neck here. I suspect that it just shows up as a victim of the overal problem with the system. How many workers are there and how busy they are? Sysrq+t should display that IIRC.
Other tasks seem to be in the VFS layer (maybe i_mutex) lock. Hard to tell who they are waiting for from the given list.
I had this trace on 3 different Servers in a row while they all had memory pressure due to a lot of virtual machine migrations. All of them are production servers so i'm not really willing to reproduce this...
Might it help to post the traces from the other two servers as well?
There is always chance there will be some pattern there. Btw. what is the hypervisor. I was not on the CC from the beginning so so if this has been answered already.
Hypervisor is kvm / qemu - but i sadly can't prodide the other logs. A collegue delered the history on the weekend to fix a reverse order bug in our netconsole server.
Sorry. So i'll post again if i see this error again.