On Mon 09-04-18 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost
> under memory pressure on a hypervisor running ksmd i had two deadlocks
> today where the machines rebootet due to lockups.
I see some of them wait in lru_add_drain_all(), I recall some recent
issues with that upstream, Michal might know more, adding to CC.
Right. There were some locking changes which cure theoretical deadlocks
on cpu hotplug locks. Then we have created a dedicated workqueue for lru
draining (ce612879ddc78). Traces below show that lru_add_drain_all is
blocked waiting for a kworker to finish its job. If there are many work
items and all the kworkers are busy then it can take quite some time.
ksm_scan_thread shouldn't be critical to the system operation so I do
not think that the lru_add_drain_all is really a bottle neck here. I
suspect that it just shows up as a victim of the overal problem with the
system. How many workers are there and how busy they are? Sysrq+t should
display that IIRC.
Other tasks seem to be in the VFS layer (maybe i_mutex) lock. Hard to
tell who they are waiting for from the given list.
I had this trace on 3 different Servers in a row while they all had
memory pressure due to a lot of virtual machine migrations. All of them
are production servers so i'm not really willing to reproduce this...
Might it help to post the traces from the other two servers as well?