[opensuse-kernel] SLES12-SP3 or opensuse 42.3 deadlocks under memory pressure?
Hello, under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups. Kernel was build from commit: 4fe1fb26557d69a4d0397113eb89d2dd7d6021b4 What i could grab from netconsole is: 018-04-05 18:05:45 Leftover inexact backtrace: 2018-04-05 18:05:45 2018-04-05 18:05:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:05:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:05:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:05:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:05:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:05:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:05:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:05:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:05:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:05:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:05:45 Call Trace: 2018-04-05 18:05:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:05:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:05:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:05:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 2018-04-05 18:05:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:05:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:07:45 Leftover inexact backtrace: 2018-04-05 18:07:45 2018-04-05 18:07:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:07:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:07:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:07:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:07:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:07:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:07:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:07:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:07:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:07:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:07:45 Call Trace: 2018-04-05 18:07:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:07:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:07:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:09:45 Leftover inexact backtrace: 2018-04-05 18:09:45 2018-04-05 18:09:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:09:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:09:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:09:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:09:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:09:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:09:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:09:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:09:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:09:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:09:45 Call Trace: 2018-04-05 18:09:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:09:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:09:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:09:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 2018-04-05 18:09:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:11:45 Leftover inexact backtrace: 2018-04-05 18:11:45 2018-04-05 18:11:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:11:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:11:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:11:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:11:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:11:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:11:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:11:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:11:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:11:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:11:45 Call Trace: 2018-04-05 18:11:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:11:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:11:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:11:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 2018-04-05 18:11:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:11:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:13:45 Leftover inexact backtrace: 2018-04-05 18:13:45 2018-04-05 18:13:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:13:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:13:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:13:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:13:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:13:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:13:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:13:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:13:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:13:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:13:45 Call Trace: 2018-04-05 18:13:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:13:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:13:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:13:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 2018-04-05 18:13:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:13:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:15:45 Leftover inexact backtrace: 2018-04-05 18:15:45 2018-04-05 18:15:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:15:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:15:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 2018-04-05 18:15:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 2018-04-05 18:15:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 2018-04-05 18:15:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 2018-04-05 18:15:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:15:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:15:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:15:45 Call Trace: 2018-04-05 18:15:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 2018-04-05 18:15:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 2018-04-05 18:15:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 2018-04-05 18:15:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 2018-04-05 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:15:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 2018-04-05 18:15:45 Leftover inexact backtrace: 2018-04-05 18:15:45 2018-04-05 18:15:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:15:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:15:45 [<ffffffff811d71c6>] SyS_write+0x46/0xa0 2018-04-05 18:15:45 [<ffffffff811d60a9>] vfs_write+0xa9/0x190 2018-04-05 18:15:45 [<ffffffff811d548b>] __vfs_write+0x2b/0x130 2018-04-05 18:15:45 [<ffffffff81250903>] kernfs_fop_write+0x143/0x180 2018-04-05 18:15:45 [<ffffffff81250d5c>] sysfs_kf_write+0x3c/0x50 2018-04-05 18:15:45 [<ffffffff813d44d2>] kobj_attr_store+0x12/0x20 2018-04-05 18:15:45 [<ffffffff811b21c8>] run_store+0x48/0x2c0 2018-04-05 18:15:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:15:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:15:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:15:45 Call Trace: 2018-04-05 18:15:45 ffffffff81e8d588 ffff883e8d0afd20 ffffffff816e15f5 ffffffff81e8d580 2018-04-05 18:15:45 ffff883e8d0b0000 ffffffff81e8d584 ffff887f7a228000 00000000ffffffff 2018-04-05 18:15:45 ffff883e8d0afd08 ffffffff811d3ca2 ffff883e0000001e ffff887f7a228000 2018-04-05 18:15:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 2018-04-05 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:15:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 2018-04-05 18:15:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 2018-04-05 18:15:45 Leftover inexact backtrace: 2018-04-05 18:15:45 2018-04-05 18:15:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 2018-04-05 18:15:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 2018-04-05 18:15:45 [<ffffffff810a4add>] kthread+0xed/0x110 2018-04-05 18:15:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 2018-04-05 18:15:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 2018-04-05 18:15:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 2018-04-05 18:15:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 2018-04-05 18:15:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 2018-04-05 18:15:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 2018-04-05 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:15:45 Call Trace: 2018-04-05 18:15:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 2018-04-05 18:15:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 2018-04-05 18:15:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 2018-04-05 18:15:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 2018-04-05 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:15:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 2018-04-05 18:17:45 Leftover inexact backtrace: 2018-04-05 18:17:45 2018-04-05 18:17:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:17:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:17:45 [<ffffffff811d71c6>] SyS_write+0x46/0xa0 2018-04-05 18:17:45 [<ffffffff811d60a9>] vfs_write+0xa9/0x190 2018-04-05 18:17:45 [<ffffffff811d548b>] __vfs_write+0x2b/0x130 2018-04-05 18:17:45 [<ffffffff81250903>] kernfs_fop_write+0x143/0x180 2018-04-05 18:17:45 [<ffffffff81250d5c>] sysfs_kf_write+0x3c/0x50 2018-04-05 18:17:45 [<ffffffff813d44d2>] kobj_attr_store+0x12/0x20 2018-04-05 18:17:45 [<ffffffff811b21c8>] run_store+0x48/0x2c0 2018-04-05 18:17:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:17:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:17:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:17:45 Call Trace: 2018-04-05 18:17:45 ffffffff81e8d588 ffff883e8d0afd20 ffffffff816e15f5 ffffffff81e8d580 2018-04-05 18:17:45 ffff883e8d0b0000 ffffffff81e8d584 ffff887f7a228000 00000000ffffffff 2018-04-05 18:17:45 ffff883e8d0afd08 ffffffff811d3ca2 ffff883e0000001e ffff887f7a228000 2018-04-05 18:17:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 2018-04-05 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:17:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 2018-04-05 18:17:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 2018-04-05 18:17:45 Leftover inexact backtrace: 2018-04-05 18:17:45 2018-04-05 18:17:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 2018-04-05 18:17:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 2018-04-05 18:17:45 [<ffffffff810a4add>] kthread+0xed/0x110 2018-04-05 18:17:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 2018-04-05 18:17:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 2018-04-05 18:17:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 2018-04-05 18:17:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 2018-04-05 18:17:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 2018-04-05 18:17:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 2018-04-05 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:17:45 Call Trace: 2018-04-05 18:17:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 2018-04-05 18:17:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 2018-04-05 18:17:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 2018-04-05 18:17:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 2018-04-05 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. Greets, Stefan -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri, 06 Apr 2018 20:07:12 +0200, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
Kernel was build from commit: 4fe1fb26557d69a4d0397113eb89d2dd7d6021b4
It's relatively new. Do you have any slightly older one that still seemed working? If it's a regression, narrower range would be easier to debug, of course. thanks, Takashi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Am 06.04.2018 um 20:21 schrieb Takashi Iwai:
On Fri, 06 Apr 2018 20:07:12 +0200, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
Kernel was build from commit: 4fe1fb26557d69a4d0397113eb89d2dd7d6021b4
It's relatively new. Do you have any slightly older one that still seemed working? If it's a regression, narrower range would be easier to debug, of course.
Only ee22fbcbed88635cd7e4f3d8cce91caa790801ff which was running fine before. As i can't reproduce this on the production systems i can't bisect... Nothing interesting from the traces? Greets, Stefan -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri 2018-04-06 20:07:12, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
Kernel was build from commit: 4fe1fb26557d69a4d0397113eb89d2dd7d6021b4
2018-04-05 18:15:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 2018-04-05 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:15:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 2018-04-05 18:15:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 2018-04-05 18:15:45 Leftover inexact backtrace: 2018-04-05 18:15:45 2018-04-05 18:15:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 2018-04-05 18:15:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 2018-04-05 18:15:45 [<ffffffff810a4add>] kthread+0xed/0x110 2018-04-05 18:15:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 2018-04-05 18:15:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 2018-04-05 18:15:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 2018-04-05 18:15:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 2018-04-05 18:15:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 2018-04-05 18:15:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 2018-04-05 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:15:45 Call Trace: 2018-04-05 18:15:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 2018-04-05 18:15:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 2018-04-05 18:15:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 2018-04-05 18:15:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 2018-04-05 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:15:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 2018-04-05 18:17:45 Leftover inexact backtrace: 2018-04-05 18:17:45 2018-04-05 18:17:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:17:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 2018-04-05 18:17:45 [<ffffffff811d71c6>] SyS_write+0x46/0xa0 2018-04-05 18:17:45 [<ffffffff811d60a9>] vfs_write+0xa9/0x190 2018-04-05 18:17:45 [<ffffffff811d548b>] __vfs_write+0x2b/0x130 2018-04-05 18:17:45 [<ffffffff81250903>] kernfs_fop_write+0x143/0x180 2018-04-05 18:17:45 [<ffffffff81250d5c>] sysfs_kf_write+0x3c/0x50 2018-04-05 18:17:45 [<ffffffff813d44d2>] kobj_attr_store+0x12/0x20 2018-04-05 18:17:45 [<ffffffff811b21c8>] run_store+0x48/0x2c0 2018-04-05 18:17:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 2018-04-05 18:17:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 2018-04-05 18:17:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 2018-04-05 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80
BTW: I am curious that the backtrace is in the reverse order. It is pretty consusing. Has anyone seen this yet? I wonder which is the order of the other messages. I mean if this backtrace is related to the above: 2018-04-05 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. or the below: 2018-04-05 18:17:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 Best Regards, Petr
2018-04-05 18:17:45 Call Trace: 2018-04-05 18:17:45 ffffffff81e8d588 ffff883e8d0afd20 ffffffff816e15f5 ffffffff81e8d580 2018-04-05 18:17:45 ffff883e8d0b0000 ffffffff81e8d584 ffff887f7a228000 00000000ffffffff 2018-04-05 18:17:45 ffff883e8d0afd08 ffffffff811d3ca2 ffff883e0000001e ffff887f7a228000 2018-04-05 18:17:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 2018-04-05 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:17:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 2018-04-05 18:17:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 2018-04-05 18:17:45 Leftover inexact backtrace: 2018-04-05 18:17:45 2018-04-05 18:17:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 2018-04-05 18:17:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 2018-04-05 18:17:45 [<ffffffff810a4add>] kthread+0xed/0x110 2018-04-05 18:17:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 2018-04-05 18:17:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 2018-04-05 18:17:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 2018-04-05 18:17:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 2018-04-05 18:17:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 2018-04-05 18:17:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 2018-04-05 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80 2018-04-05 18:17:45 Call Trace: 2018-04-05 18:17:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 2018-04-05 18:17:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 2018-04-05 18:17:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 2018-04-05 18:17:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 2018-04-05 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds.
Greets, Stefan -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 04/09/2018, 11:31 AM, Petr Mladek wrote:
2018-04-05 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2018-04-05 18:17:45 Not tainted 4.4.126+557-ph #1 2018-04-05 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds.
From the above and this:
pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n", t->comm, t->pid, timeout); pr_err(" %s %s %.*s\n", print_tainted(), init_utsname()->release, (int)strcspn(init_utsname()->version, " "), init_utsname()->version); pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" " disables this message.\n");
The whole log is reversed. thanks, -- js suse labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I see some of them wait in lru_add_drain_all(), I recall some recent issues with that upstream, Michal might know more, adding to CC.
Kernel was build from commit: 4fe1fb26557d69a4d0397113eb89d2dd7d6021b4
What i could grab from netconsole is:
Let me paste after running via tac to fix the reversed order of lines: 18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:17:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 18:17:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 18:17:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 18:17:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 18:17:45 Call Trace: 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:17:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 18:17:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 18:17:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 18:17:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 18:17:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 18:17:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 18:17:45 [<ffffffff810a4add>] kthread+0xed/0x110 18:17:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 18:17:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 18:17:45 18:17:45 Leftover inexact backtrace: 18:17:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:17:45 Not tainted 4.4.126+557-ph #1 18:17:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:17:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 18:17:45 ffff883e8d0afd08 ffffffff811d3ca2 ffff883e0000001e ffff887f7a228000 18:17:45 ffff883e8d0b0000 ffffffff81e8d584 ffff887f7a228000 00000000ffffffff 18:17:45 ffffffff81e8d588 ffff883e8d0afd20 ffffffff816e15f5 ffffffff81e8d580 18:17:45 Call Trace: 18:17:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:17:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:17:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:17:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:17:45 [<ffffffff811b21c8>] run_store+0x48/0x2c0 18:17:45 [<ffffffff813d44d2>] kobj_attr_store+0x12/0x20 18:17:45 [<ffffffff81250d5c>] sysfs_kf_write+0x3c/0x50 18:17:45 [<ffffffff81250903>] kernfs_fop_write+0x143/0x180 18:17:45 [<ffffffff811d548b>] __vfs_write+0x2b/0x130 18:17:45 [<ffffffff811d60a9>] vfs_write+0xa9/0x190 18:17:45 [<ffffffff811d71c6>] SyS_write+0x46/0xa0 18:17:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:17:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:17:45 18:17:45 Leftover inexact backtrace: 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:15:45 Not tainted 4.4.126+557-ph #1 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:15:45 ksmd D ffff883f7fbb7c08 0 409 2 0x00080000 18:15:45 ffff883f7fbb7c08 ffff887f7dad81c0 ffff883f00000031 ffff883f7fbfa580 18:15:45 ffff883f7fbb8000 ffff883f7fbb7d58 ffff883f7fbb7d50 ffff883f7fbfa580 18:15:45 0000000000000038 ffff883f7fbb7c20 ffffffff816e15f5 7fffffffffffffff 18:15:45 Call Trace: 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:15:45 [<ffffffff816e4a4a>] schedule_timeout+0x23a/0x2d0 18:15:45 [<ffffffff816e2998>] wait_for_completion+0xa8/0x110 18:15:45 [<ffffffff8109ecae>] flush_work+0xfe/0x170 18:15:45 [<ffffffff8116b18d>] lru_add_drain_all+0x13d/0x190 18:15:45 [<ffffffff811b2adb>] ksm_do_scan+0x69b/0xe00 18:15:45 [<ffffffff811b32c5>] ksm_scan_thread+0x85/0x1b0 18:15:45 [<ffffffff810a4add>] kthread+0xed/0x110 18:15:45 [<ffffffff816e60c5>] ret_from_fork+0x55/0x80 18:15:45 DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 18:15:45 18:15:45 Leftover inexact backtrace: 18:15:45 [<ffffffff810a49f0>] ? kthread_park+0x60/0x60 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 Not tainted 4.4.126+557-ph #1 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:15:45 ksmtuned D ffff883e8d0afd08 0 2259 1 0x00080000 18:15:45 ffff883e8d0afd08 ffffffff811d3ca2 ffff883e0000001e ffff887f7a228000 18:15:45 ffff883e8d0b0000 ffffffff81e8d584 ffff887f7a228000 00000000ffffffff 18:15:45 ffffffff81e8d588 ffff883e8d0afd20 ffffffff816e15f5 ffffffff81e8d580 18:15:45 Call Trace: 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:15:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:15:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:15:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:15:45 [<ffffffff811b21c8>] run_store+0x48/0x2c0 18:15:45 [<ffffffff813d44d2>] kobj_attr_store+0x12/0x20 18:15:45 [<ffffffff81250d5c>] sysfs_kf_write+0x3c/0x50 18:15:45 [<ffffffff81250903>] kernfs_fop_write+0x143/0x180 18:15:45 [<ffffffff811d548b>] __vfs_write+0x2b/0x130 18:15:45 [<ffffffff811d60a9>] vfs_write+0xa9/0x190 18:15:45 [<ffffffff811d71c6>] SyS_write+0x46/0xa0 18:15:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:15:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:15:45 18:15:45 Leftover inexact backtrace: 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:15:45 Not tainted 4.4.126+557-ph #1 18:15:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:15:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 18:15:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:15:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:15:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:15:45 Call Trace: 18:15:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:15:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:15:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:15:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:15:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:15:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:15:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:15:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:15:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:15:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:15:45 18:15:45 Leftover inexact backtrace: 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:13:45 Not tainted 4.4.126+557-ph #1 18:13:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:13:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 18:13:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:13:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:13:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:13:45 Call Trace: 18:13:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:13:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:13:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:13:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:13:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:13:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:13:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:13:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:13:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:13:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:13:45 18:13:45 Leftover inexact backtrace: 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:11:45 Not tainted 4.4.126+557-ph #1 18:11:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:11:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 18:11:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:11:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:11:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:11:45 Call Trace: 18:11:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:11:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:11:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:11:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:11:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:11:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:11:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:11:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:11:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:11:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:11:45 18:11:45 Leftover inexact backtrace: 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:09:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:09:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 18:09:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:09:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:09:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:09:45 Call Trace: 18:09:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:09:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:09:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:09:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:09:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:09:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:09:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:09:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:09:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:09:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:09:45 18:09:45 Leftover inexact backtrace: 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:07:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:07:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:07:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:07:45 Call Trace: 18:07:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:07:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:07:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:07:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:07:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:07:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:07:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:07:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:07:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:07:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:07:45 18:07:45 Leftover inexact backtrace: 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:05:45 Not tainted 4.4.126+557-ph #1 18:05:45 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:05:45 pve-firewall D ffff887f69793c98 0 2914 1 0x00080000 18:05:45 ffff887f69793c98 ffff887f69793c68 ffffffff00000035 ffff887f735a0000 18:05:45 ffff887f69794000 ffff887f69748364 ffff887f735a0000 00000000ffffffff 18:05:45 ffff887f69748368 ffff887f69793cb0 ffffffff816e15f5 ffff887f69748360 18:05:45 Call Trace: 18:05:45 [<ffffffff816e15f5>] schedule+0x35/0x80 18:05:45 [<ffffffff816e18fe>] schedule_preempt_disabled+0xe/0x10 18:05:45 [<ffffffff816e3325>] __mutex_lock_slowpath+0x95/0x110 18:05:45 [<ffffffff816e33b7>] mutex_lock+0x17/0x30 18:05:45 [<ffffffff811e2ef4>] path_openat+0x3b4/0x1200 18:05:45 [<ffffffff811e591e>] do_filp_open+0x7e/0xe0 18:05:45 [<ffffffff811d504c>] do_sys_open+0x12c/0x210 18:05:45 [<ffffffff811d514e>] SyS_open+0x1e/0x20 18:05:45 [<ffffffff816e5c85>] entry_SYSCALL_64_fastpath+0x1e/0xc9 18:05:45 DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xc9 18:05:45 18:05:45 Leftover inexact backtrace: -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Mon 2018-04-09 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I just wonder if the system rebooted on its own or if some human rebooted it.
Let me paste after running via tac to fix the reversed order of lines:
18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds.
It seems that pve-firewall was unblocked after 10 minutes. It is possible that ksmd and ksmtuned would get unblocked after several more minutes as well if the system was not rebooted.
I had this trace on 3 different Servers in a row while they all had memory pressure due to a lot of virtual machine migrations. All of them are production servers so i'm not really willing to reproduce this...
It might be similar to the traffic jam once a sports match or a concert finishes. It might cause unusually long delays that disappear once the unusual amount of people moves outside the small area. I do not say that there is not a bug somewhere or that the kernel could not do better. I just wonder if it might be reasonable to somehow limit also the number of migrations that happen at the same time. Best Regards, Petr -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Am 10.04.2018 um 09:37 schrieb Petr Mladek:
On Mon 2018-04-09 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I just wonder if the system rebooted on its own or if some human rebooted it.
Let me paste after running via tac to fix the reversed order of lines:
18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds.
It seems that pve-firewall was unblocked after 10 minutes. It is possible that ksmd and ksmtuned would get unblocked after several more minutes as well if the system was not rebooted.
Good question. I cannot guarantee that but we've set kernel.hung_task_panic = 0 kernel.softlockup_panic = 0 kernel.hardlockup_panic = 1 kernel.panic_on_oops = 1 kernel.unknown_nmi_panic = 1 kernel.panic = 20 May be it's related to this?
I had this trace on 3 different Servers in a row while they all had memory pressure due to a lot of virtual machine migrations. All of them are production servers so i'm not really willing to reproduce this...
It might be similar to the traffic jam once a sports match or a concert finishes. It might cause unusually long delays that disappear once the unusual amount of people moves outside the small area.
I do not say that there is not a bug somewhere or that the kernel could not do better. I just wonder if it might be reasonable to somehow limit also the number of migrations that happen at the same time.
Best Regards, Petr
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Tue 2018-04-10 15:44:09, Stefan Priebe - Profihost AG wrote:
Am 10.04.2018 um 09:37 schrieb Petr Mladek:
On Mon 2018-04-09 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I just wonder if the system rebooted on its own or if some human rebooted it.
Let me paste after running via tac to fix the reversed order of lines:
18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds.
It seems that pve-firewall was unblocked after 10 minutes. It is possible that ksmd and ksmtuned would get unblocked after several more minutes as well if the system was not rebooted.
Good question. I cannot guarantee that but we've set kernel.hung_task_panic = 0
The last message in the log is from the hung_task daemon. Therefore it should not cause the reboot.
kernel.softlockup_panic = 0
kernel.hardlockup_panic = 1 kernel.panic_on_oops = 1 kernel.unknown_nmi_panic = 1
kernel.panic = 20
This should give some time to see panic() messages on the console. They were not in the log. It means that either panic() was not called or it was not able to show them on the net console. I am not sure how reliable is netconsole during panic(). Anyway, consoles never 100% reliable. The kernel on SLE12 even defers the console handling to a kthread to avoid softlockups. This might cause even longer delay. It tries harder to flush them during panic() but it is still not guaranteed. Alternative way to see the entire log is the kernel crash dump. I wonder if you have it by chance. Best Regards, Petr PS: We should probably move to bugzilla. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Am 10.04.2018 um 17:01 schrieb Petr Mladek:
On Tue 2018-04-10 15:44:09, Stefan Priebe - Profihost AG wrote:
Am 10.04.2018 um 09:37 schrieb Petr Mladek:
On Mon 2018-04-09 16:05:09, Vlastimil Babka wrote:
On 04/06/2018 08:07 PM, Stefan Priebe - Profihost AG wrote:
Hello,
under memory pressure on a hypervisor running ksmd i had two deadlocks today where the machines rebootet due to lockups.
I just wonder if the system rebooted on its own or if some human rebooted it.
Let me paste after running via tac to fix the reversed order of lines:
18:17:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:17:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task ksmd:409 blocked for more than 120 seconds. 18:15:45 INFO: task ksmtuned:2259 blocked for more than 120 seconds. 18:15:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:13:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:11:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:09:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:07:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds. 18:05:45 INFO: task pve-firewall:2914 blocked for more than 120 seconds.
It seems that pve-firewall was unblocked after 10 minutes. It is possible that ksmd and ksmtuned would get unblocked after several more minutes as well if the system was not rebooted.
Good question. I cannot guarantee that but we've set kernel.hung_task_panic = 0
The last message in the log is from the hung_task daemon. Therefore it should not cause the reboot.
kernel.softlockup_panic = 0
kernel.hardlockup_panic = 1 kernel.panic_on_oops = 1 kernel.unknown_nmi_panic = 1
kernel.panic = 20
This should give some time to see panic() messages on the console. They were not in the log. It means that either panic() was not called or it was not able to show them on the net console.
I am not sure how reliable is netconsole during panic(). Anyway, consoles never 100% reliable. The kernel on SLE12 even defers the console handling to a kthread to avoid softlockups. This might cause even longer delay. It tries harder to flush them during panic() but it is still not guaranteed.
Alternative way to see the entire log is the kernel crash dump. I wonder if you have it by chance.
No i don't. I appreciate all your help and comments. I think i just cannot provide enough information in this case. I'm sorry about this. I'll report back and have more information if i see this again. Greets, Stefan
Best Regards, Petr
PS: We should probably move to bugzilla.
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (5)
-
Jiri Slaby
-
Petr Mladek
-
Stefan Priebe - Profihost AG
-
Takashi Iwai
-
Vlastimil Babka