[kernel-bugs] [Bug 1172997] New: kernel panic about "SError Interrupt on CPU…" on aarch64, 4.12.14-lp151.28.48-default
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Bug ID: 1172997 Summary: kernel panic about "SError Interrupt on CPU…" on aarch64, 4.12.14-lp151.28.48-default Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.1 Hardware: aarch64 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de CC: afaerber@suse.com, mbrugger@suse.com, ptesarik@suse.com, tiwai@suse.com, yousaf.kaukab@suse.com Found By: --- Blocker: --- +++ This bug was initially created as a clone of Bug #1162612 +++ ## Observation The SUSE internal machine "openqaworker-arm-3" is an aarch64 Machine model: Cavium ThunderX CN88XX running openSUSE Leap 15.1 with kernel 4.12.14-lp151.28.48-default showed multiple times system crashes with kernel panic and error reports about "SError Interrupt on CPU…", e.g. like: ``` Apr 28 15:17:30 [256898.042772] SError Interrupt on CPU81, code 0xbe000000 -- SError Apr 28 15:17:30 [256898.042775] CPU: 81 PID: 24143 Comm: worker Tainted: G W 4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1 Apr 28 15:17:30 [256898.042777] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017 Apr 28 15:17:30 [256898.042778] task: ffff810cc501c080 task.stack: ffff810cc829c000 Apr 28 15:17:30 [256898.042779] pstate: 60000005 (nZCv daif -PAN -UAO) Apr 28 15:17:30 [256898.042780] pc : ptep_set_access_flags+0xc0/0x100 Apr 28 15:17:30 [256898.042781] lr : 0xe0010bbfeb9bd3 Apr 28 15:17:30 [256898.042782] sp : ffff810cc829fca0 Apr 28 15:17:30 [256898.042784] x29: ffff810cc829fca0 x28: ffff810cc501c080 Apr 28 15:17:30 [256898.042787] x27: ffff810cc2443468 x26: 0000000000000007 Apr 28 15:17:30 [256898.042789] x25: ffff810f6f63d0c8 x24: 0000000000000054 Apr 28 15:17:30 [256898.042792] x23: ffff810cc2443400 x22: ffff810f6f63d0c8 Apr 28 15:17:30 [256898.042794] x21: 62c6000aaab37e3b x20: 00e0010bbfeb9fd3 Apr 28 15:17:30 [256898.042796] x19: ffff810cce57b1d8 x18: 0000000000000000 Apr 28 15:17:30 [256898.042799] x17: 0000ffffa9823238 x16: 0000aaaaeabfd350 Apr 28 15:17:30 [256898.042801] x15: 000040c4034d32e5 x14: 0000000000000000 Apr 28 15:17:30 [256898.042803] x13: 8080c08013012026 x12: 0000000000000000 Apr 28 15:17:30 [256898.042806] x11: 0000000000000040 x10: 000000000000000e Apr 28 15:17:30 [256898.042808] x9 : 000000000000ff00 x8 : 0000ffffa937bac8 Apr 28 15:17:30 [256898.042810] x7 : 0000aaab1bb72d48 x6 : 0000000000000024 Apr 28 15:17:30 [256898.042813] x5 : 00e0010bbfeb9b53 x4 : 00e0010bbfeb9b53 Apr 28 15:17:30 [256898.042815] x3 : 0080000000000400 x2 : 00e0010bbfeb9fd3 Apr 28 15:17:30 [256898.042817] x1 : 00e0010bbfeb9bd3 x0 : 00000000001262c6 Apr 28 15:17:30 [256898.042820] Kernel panic - not syncing: Asynchronous SError Interrupt Apr 28 15:17:30 [256898.042822] CPU: 81 PID: 24143 Comm: worker Tainted: G W 4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1 Apr 28 15:17:30 [256898.042823] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017 Apr 28 15:17:30 [256898.042824] Call trace: Apr 28 15:17:30 [256898.042825] dump_backtrace+0x0/0x188 Apr 28 15:17:30 [256898.042826] show_stack+0x24/0x30 Apr 28 15:17:30 [256898.042827] dump_stack+0x90/0xb0 Apr 28 15:17:30 [256898.042828] panic+0x114/0x28c Apr 28 15:17:30 [256898.042828] nmi_panic+0x7c/0x80 Apr 28 15:17:30 [256898.042829] arm64_serror_panic+0x80/0x90 Apr 28 15:17:30 [256898.042830] __pte_error+0x0/0x50 Apr 28 15:17:30 [256898.042831] el1_error+0x7c/0xdc Apr 28 15:17:30 [256898.042832] ptep_set_access_flags+0xc0/0x100 Apr 28 15:17:30 [256898.042833] __handle_mm_fault+0x204/0x500 Apr 28 15:17:30 [256898.042834] handle_mm_fault+0xd4/0x178 Apr 28 15:17:30 [256898.042835] do_page_fault+0x1a0/0x448 Apr 28 15:17:30 [256898.042836] do_mem_abort+0x54/0xb0 Apr 28 15:17:30 [256898.042837] el0_da+0x24/0x28 Apr 28 15:17:30 [256899.126203] SError Interrupt on CPU6, code 0xbe000000 -- SError Apr 28 15:17:30 [256899.126205] CPU: 6 PID: 19627 Comm: /usr/bin/isotov Tainted: G W 4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1 Apr 28 15:17:30 [256899.126207] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017 Apr 28 15:17:30 [256899.126208] task: ffff810cbfc24080 task.stack: ffff810cc9564000 Apr 28 15:17:30 [256899.126209] pstate: 60000005 (nZCv daif -PAN -UAO) Apr 28 15:17:30 [256899.126210] pc : ptep_clear_flush+0x88/0xd8 Apr 28 15:17:30 [256899.126211] lr : wp_page_copy+0x298/0x6c0 Apr 28 15:17:30 [256899.126212] sp : ffff810cc9567be0 Apr 28 15:17:30 [256899.126213] x29: ffff810cc9567be0 x28: 0000aaaafc866000 Apr 28 15:17:30 [256899.126216] x27: ffff810cb93f8330 x26: ffff810cc1142c00 Apr 28 15:17:30 [256899.126218] x25: 0000aaaafc866000 x24: ffff810cc1142c00 Apr 28 15:17:30 [256899.126220] x23: 00e801048f9ccf53 x22: ffff810f56d2a708 Apr 28 15:17:30 [256899.126223] x21: ffff810f56d2a708 x20: 0000000aaaafc866 Apr 28 15:17:30 [256899.126225] x19: ffff810cb93f8330 x18: 0000000000000000 Apr 28 15:17:30 [256899.126228] x17: 0000aaaafc85f420 x16: 0000aaaafc8693c0 Apr 28 15:17:30 [256899.126230] x15: 4000000b00000001 x14: 0000aaaafc863880 Apr 28 15:17:30 [256899.126232] x13: 0000000000000032 x12: 0000110100000001 Apr 28 15:17:30 [256899.126235] x11: 0000aaaafc866fb8 x10: 0000aaaafc867fd0 Apr 28 15:17:30 [256899.126237] x9 : 1000440300000001 x8 : 0000aaaafc85f440 Apr 28 15:17:30 [256899.126239] x7 : 0000000000000029 x6 : 0000110100000001 Apr 28 15:17:30 [256899.126242] x5 : 0088000000000000 x4 : 00e0000e8ed2ff53 Apr 28 15:17:30 [256899.126244] x3 : 00e001048f9ccfd3 x2 : 0000000000000000 Apr 28 15:17:30 [256899.126247] x1 : 3804000aaaafc866 x0 : 00e0000e8ed2ffd3 Apr 28 15:17:30 [256899.126271] SMP: stopping secondary CPUs Apr 28 15:17:30 [256899.126272] SMP: failed to stop secondary CPUs 1,6,48,81,84-85 Apr 28 15:17:30 [256899.126273] Kernel Offset: disabled Apr 28 15:17:30 [256899.126274] CPU features: 0x01,04101128 Apr 28 15:17:30 [256899.126275] Memory Limit: none Apr 28 15:17:30 [256899.126277] SError Interrupt on CPU85, code 0xbe000000 -- SError Apr 28 15:17:30 [256899.126279] CPU: 85 PID: 23855 Comm: worker Tainted: G W 4.12.14-lp151.28.44-default #1 openSUSE Leap 15.1 Apr 28 15:17:30 [256899.126280] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017 Apr 28 15:17:30 [256899.126281] task: ffff810cc9f0a180 task.stack: ffff810cc99b4000 Apr 28 15:17:30 [256899.126282] pstate: 60000005 (nZCv daif -PAN -UAO) Apr 28 15:17:30 [256899.126283] pc : ptep_set_access_flags+0xc0/0x100 Apr 28 15:17:30 [256899.126284] lr : 0xe0000a0c000fd1 Apr 28 15:17:30 [256899.126285] sp : ffff810cc99b7c30 Apr 28 15:17:30 [256899.126286] x29: ffff810cc99b7c30 x28: ffff810cc9f0a180 Apr 28 15:17:30 [256899.126288] x27: ffff000008d32000 x26: 0000000000000008 Apr 28 15:17:30 [256899.126291] x25: ffff7e0000000000 x24: ffff810f6b56ced8 Apr 28 15:17:30 [256899.126293] x23: ffff00000933a000 x22: ffff810f6b56ced8 Apr 28 15:17:30 [256899.126295] x21: 5f3e000aaab2f400 x20: 00e8000a0c000f51 Apr 28 15:17:30 [256899.126298] x19: ffff810cdab62bd0 x18: 0000000000000000 Apr 28 15:17:30 [256899.126300] x17: 0000ffffafddac80 x16: 0000aaaae4c21020 Apr 28 15:17:30 [256899.126302] x15: 000089a652d558cc x14: 0000000000000000 Apr 28 15:17:30 [256899.126305] x13: 0d4b4f2030303220 x12: 312e312f50545448 Apr 28 15:17:30 [256899.126307] x11: 0000aaab233d1eb0 x10: 00000000c6cf1b21 Apr 28 15:17:30 [256899.126309] x9 : 0000000000000009 x8 : 0000aaab24a44d00 Apr 28 15:17:31 [256899.126311] x7 : 41203832202c6575 x6 : 54203a657461440a Apr 28 15:17:31 [256899.126314] x5 : 0000000000100073 x4 : 00e0000a0c000f51 Apr 28 15:17:31 [256899.126316] x3 : 0088000000000480 x2 : 00e8000a0c000f51 Apr 28 15:17:31 [256899.126318] x1 : 00e0000a0c000fd1 x0 : 0000000000125f3e Apr 28 15:17:31 [256899.725248] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt Apr 28 16:18:35 � ``` ## Reproducible Happened multiple times but sporadic, every couple of days or weeks. On two other, very similar machines running 5.6.0 there are also random crashes but the same error report was never seen. ## Further details Internal issue: https://progress.opensuse.org/issues/41882 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c1 --- Comment #1 from Richard Fan <richard.fan@suse.com> --- Hello Oliver, Added some core dump messages regarding to "isotovideo" process here. As we discussed in another mail thread before, I can see lots of core dump messages in journal log like below: ----- Jun 16 21:29:14 openqaworker-arm-3 systemd-coredump[4258]: Process 95237 (/usr/bin/isotov) of user 481 dumped core. Stack trace of thread 96054: #0 0x0000aaaac704fb20 Perl_csighandler (/usr/bin/perl) ----- I can see 82 core dump files generated during past 3-4 days. #pwd /var/lib/systemd/coredump #ls |grep isotov|wc 82 82 7704 I don't know if the core dump causes cpu stuck or not. however, hopefully the messages can provide some help BR//Richard. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c2 --- Comment #2 from Oliver Kurz <okurz@suse.com> --- the crashes from isotovideo are also described in https://progress.opensuse.org/issues/53999 and at least we know that not every isotovideo crash leads to a system crash but that does not rule out the possibility of complete system crashes still being triggered by this occassionally. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c3 --- Comment #3 from Matthias Brugger <mbrugger@suse.com> --- I think we would need a memory dump through kdump to be able to understand what is happening. Can you provide that for the next crash you see? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(chester.lin@suse. | |com) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(chester.lin@suse. | |com) | -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|kernel-bugs@opensuse.org |mbrugger@suse.com -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c1 --- Comment #1 from Richard Fan <richard.fan@suse.com> --- Hello Oliver, Added some core dump messages regarding to "isotovideo" process here. As we discussed in another mail thread before, I can see lots of core dump messages in journal log like below: ----- Jun 16 21:29:14 openqaworker-arm-3 systemd-coredump[4258]: Process 95237 (/usr/bin/isotov) of user 481 dumped core. Stack trace of thread 96054: #0 0x0000aaaac704fb20 Perl_csighandler (/usr/bin/perl) ----- I can see 82 core dump files generated during past 3-4 days. #pwd /var/lib/systemd/coredump #ls |grep isotov|wc 82 82 7704 I don't know if the core dump causes cpu stuck or not. however, hopefully the messages can provide some help BR//Richard. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c2 --- Comment #2 from Oliver Kurz <okurz@suse.com> --- the crashes from isotovideo are also described in https://progress.opensuse.org/issues/53999 and at least we know that not every isotovideo crash leads to a system crash but that does not rule out the possibility of complete system crashes still being triggered by this occassionally. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 http://bugzilla.opensuse.org/show_bug.cgi?id=1172997#c3 --- Comment #3 from Matthias Brugger <mbrugger@suse.com> --- I think we would need a memory dump through kdump to be able to understand what is happening. Can you provide that for the next crash you see? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(chester.lin@suse. | |com) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(chester.lin@suse. | |com) | -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172997 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|kernel-bugs@opensuse.org |mbrugger@suse.com -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com