[Bug 1163684] New: Kernel oops in s390x tumbleweed
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 Bug ID: 1163684 Summary: Kernel oops in s390x tumbleweed Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: S/390-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: azouhr@opensuse.org QA Contact: qa-bugs@suse.de CC: mikef@suse.com, slindomansilla@suse.com Found By: --- Blocker: --- Since a while, the buildservice encounters a certain type of kernel oops, which often goes away when doing a restartbuild of the respective package. Since the php7:test package seems to encounter the issue more consistently, it seems a good candidate to reproduce the issue: osc rbl openSUSE:Factory:zSystems php7:test standard s390x PASS Leak 001: Incorrect 'if ();' optimization [ext/opcache/tests/leak_001.phpt] [ 1471.342879] Unable to handle kernel pointer dereference in virtual kernel address space [ 1483s] [ 1471.343112] Failing address: 000003d280000000 TEID: 000003d280000803 [ 1483s] [ 1471.343246] Fault in home space mode while using kernel ASCE. [ 1483s] [ 1471.343339] AS:0000000063acc007 R3:0000000000000024 [ 1483s] [ 1471.343495] Oops: 003b ilc:2 [#1] SMP [ 1483s] [ 1471.343577] Modules linked in: sha256_s390 sha_common sd_mod nls_iso8859_1 nls_cp437 vfat fat virtio_rng rng_core virtio_blk xfs btrfs blake2b_generic xor raid6_pq libcrc32c reiserfs squashfs fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg scsi_mod [ 1483s] [ 1471.343946] CPU: 1 PID: 14557 Comm: php Not tainted 5.5.2-1-default #1 openSUSE Tumbleweed (unreleased) [ 1483s] [ 1471.344053] Hardware name: IBM 2827 H43 400 (KVM/Linux) [ 1483s] [ 1471.344098] Krnl PSW : 0704c00180000000 0000000062a9d2ec (page_table_free_rcu+0x7c/0x150) [ 1483s] [ 1471.344188] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 [ 1483s] [ 1471.344256] Krnl GPRS: 00d3df7e00000004 000003d280000034 00000001334dc8d0 0000000000000002 [ 1483s] [ 1471.344325] 0000040001500000 0000000111000000 0000000000000000 0000000062ca82b8 [ 1483s] [ 1471.344393] 000003e0061d7c18 00000001334dc600 0000008000000000 000003d280000000 [ 1483s] [ 1471.344471] 0000000134d13e00 0000000063352d30 0000000062a9d2c6 000003e0061d79e8 [ 1483s] [ 1471.344586] Krnl Code: 0000000062a9d2e0: a7580011 lhi %r5,17 [ 1483s] [ 1471.344586] 0000000062a9d2e4: 4110b034 la %r1,52(%r11) [ 1483s] [ 1471.344586] #0000000062a9d2e8: 89506018 sll %r5,24(%r6) [ 1483s] [ 1471.344586] >0000000062a9d2ec: 58201000 l %r2,0(%r1) [ 1483s] [ 1471.344586] 0000000062a9d2f0: b9f72035 xrk %r3,%r5,%r2 [ 1483s] [ 1471.344586] 0000000062a9d2f4: 1842 lr %r4,%r2 [ 1483s] [ 1471.344586] 0000000062a9d2f6: ba431000 cs %r4,%r3,0(%r1) [ 1483s] [ 1471.344586] 0000000062a9d2fa: ec24fff96076 crj %r2,%r4,6,0000000062a9d2ec [ 1483s] [ 1471.345092] Call Trace: [ 1483s] [ 1471.345127] [<0000000062a9d2ec>] page_table_free_rcu+0x7c/0x150 [ 1483s] [ 1471.345231] [<0000000062ca82b8>] free_pgd_range+0x2d8/0x680 [ 1483s] [ 1471.345304] [<0000000062ca86de>] free_pgtables+0x7e/0x140 [ 1483s] [ 1471.345359] [<0000000062cb2c7e>] unmap_region+0xde/0x120 [ 1483s] [ 1471.345402] [<0000000062cb73c2>] mmap_region+0x662/0x700 [ 1483s] [ 1471.345457] [<0000000062cb776e>] do_mmap+0x30e/0x4d0 [ 1483s] [ 1471.345503] [<0000000062c8bcd0>] vm_mmap_pgoff+0xc0/0x120 [ 1483s] [ 1471.345557] [<0000000062cb46f4>] ksys_mmap_pgoff+0x124/0x270 [ 1483s] [ 1471.345612] [<0000000062cb49e2>] __s390x_sys_old_mmap+0x72/0xa0 [ 1483s] [ 1471.345669] [<00000000633365f4>] system_call+0xd8/0x2c8 [ 1483s] [ 1471.345720] Last Breaking-Event-Address: [ 1483s] [ 1471.345756] [<0000000062abea52>] __local_bh_disable_ip+0x52/0x60 [ 1483s] [ 1471.345826] Kernel panic - not syncing: Fatal exception in interrupt [ 1484s] ### VM INTERACTION END ### [ 1484s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [ 1484s] or the build host has a kernel or hardware problem... gave up after 9 failed build attempts... Building the package on a SLES12 kernel did work without problems. The last time the package php7:test built successfully was 2020-02-04 10:24:13 669482aecc8852f512a26aa6d813bd41 7.4.2-1.4 72 3362 Since php7:test is quite late in the build tree, I guess that this problem was introduced with Kernel 5.5.1. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c1 --- Comment #1 from Berthold Gunreben <azouhr@opensuse.org> --- Created attachment 831930 --> http://bugzilla.opensuse.org/attachment.cgi?id=831930&action=edit tar archive of buildlogs with hanging build jobs Let me give a small update. I collected almost 50 buildlogs that had some sort of Kernel Ooops in the log. I'll attach that to this bug. This is just to demonstrate the bad condition of the current tumbleweed kernel. Please tell me if I can help somehow to improve the situation. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ro@suse.de -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ihno@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hannsj_uhl@de.ibm.com, | |tstaudt@de.ibm.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 Sergio Lindo Mansilla <slindomansilla@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sergiolindo.empresa@gmail.c | |om -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c10 --- Comment #10 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to LTC BugProxy from comment #9)
So, could you try to narrow this down on THP impact by disabling THP for the KVM guests, e.g. by adding the "transparent_hugepage=never" kernel parameter for the guests? The panics do always happen in the guests, not the host, right?
Sorry that I did not notice the comment earlier. Rudi updated the obs-worker yesterday and added transparent_hugepage=never to the guest kernel command line. A tentative result is, that there were no hangs anymore with this parameter enabled. I will update here if the situation changes. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c11 --- Comment #11 from Berthold Gunreben <azouhr@opensuse.org> --- Just let me confirm again: since transparent_hugepage=never is active, no oops has happend. Looks like there is some issue there. If you have some more ideas how to debug that further, please tell me. BTW: I am not that much into other architectures. But from what I saw so far, those issues did not happen on other architectures (all of which seem to be little endian tough). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c15 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(ro@suse.de) | --- Comment #15 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Miroslav Beneš from comment #13)
Berthold, is the issue still happening? Gerald mentioned a known issue with 2G hugepages and 5.5.2 kernel. TW is now on 5.7.x, so it would be nice to know if there is some development. If the issue is still present, could you provide information Gerald asked for, please?
Also CCing Vlastimil for the sake of completeness.
So, we removed transparent_hugepage=never and it didn't take long until we had hanging jobs again. Attaching a log from gettext-runtime-mini. It also seems to affect build speed, there are a number of jobs that seem to still run normal although they have more than 600% buildtime. We are reverting back to transparent_hugepage=never in order to have the buildsystem functional. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c16 --- Comment #16 from Berthold Gunreben <azouhr@opensuse.org> --- Created attachment 840244 --> http://bugzilla.opensuse.org/attachment.cgi?id=840244&action=edit buildlog with hanging build job from gettext-runtime-mini -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c17 --- Comment #17 from Berthold Gunreben <azouhr@opensuse.org> --- Created attachment 840253 --> http://bugzilla.opensuse.org/attachment.cgi?id=840253&action=edit another hanging worker with gettext-runtime -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c21 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(azouhr@opensuse.o |needinfo?(ro@suse.de) |rg) | --- Comment #21 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Jiri Slaby from comment #20)
5.9.11 contains the commit. Kernel:stable is currently building it. Submitted to factory: https://build.opensuse.org/request/show/850892
Sorry, I have no possibility to change the kernel. Rudi, could you please give the latest kernel a try on OBS? Please remove transparent_hugepage=never from the command line for the test. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c22 --- Comment #22 from Ruediger Oertel <ro@suse.com> --- sure, as soon as 5.9.11 has built, at the moment I'm seeing 5.9.10 binaries in openSUSE:Factory:zSystems kernel-default -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c23 Sarah Julia Kriesch <sarah.kriesch@ibm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sarah.kriesch@ibm.com --- Comment #23 from Sarah Julia Kriesch <sarah.kriesch@ibm.com> --- @Ruediger Can you verify this bug fix? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c24 Berthold Gunreben <azouhr@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(ro@suse.de) | --- Comment #24 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Sarah Julia Kriesch from comment #23)
@Ruediger Can you verify this bug fix?
Rudi removed the transparent_hugepages=never parameter yesterday. So far, I did not see any builds with that issue anymore. From my point of view, this issue seems to be fixed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c25 --- Comment #25 from Sarah Julia Kriesch <sarah.kriesch@ibm.com> --- Thank you for this nice pre-christmas present for all included people! :) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c26 --- Comment #26 from Ruediger Oertel <ro@suse.com> --- https://github.com/openSUSE/obs-build/pull/642 merged to finalize dropping the parameter -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c27 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ada.lovelace@gmx.de --- Comment #27 from Sarah Kriesch <ada.lovelace@gmx.de> --- We have got a kernel error message in openQA after dropping the parameter "transparent_hugepages=never" and you can not boot the system any more: '[ 0.210957] ima: No TPM chip found, activating TPM-bypass! ', '[ 0.210960] ima: Allocated hash algorithm: sha256 ', '[ 0.210966] ima: No architecture policies found ', '[ 0.210972] evm: Initialising EVM extended attributes: ', '[ 0.210973] evm: security.selinux ', '[ 0.210974] evm: security.apparmor ', '[ 0.210975] evm: security.ima ', '[ 0.210976] evm: security.capability ', '[ 0.210977] evm: HMAC attrs: 0x1 ', '[ 0.211243] VFS: Cannot open root device "(null)" or unknown-block(1,0): erro', 'r -6 ', '[ 0.211245] Please append a correct "root=" boot option; here are the availab', 'le partitions: ', '[ 0.211246] Kernel panic - not syncing: VFS: Unable to mount root fs on unkno', 'wn-block(1,0) ', '[ 0.211249] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.10.1-1-default #1 ope', 'nSUSE Tumbleweed ', '[ 0.211250] Hardware name: IBM 2964 N63 400 (z/VM 6.4.0) ', '[ 0.211251] Call Trace: ', '[ 0.211254] [<000000003252e15c>] show_stack+0x8c/0xd8 ', '[ 0.211256] [<0000000032533550>] dump_stack+0x90/0xc0 ', '[ 0.211258] [<000000003252eae2>] panic+0x112/0x308 ', '[ 0.211261] [<0000000032a39b98>] mount_block_root+0x2e0/0x368 ', '[ 0.211263] [<0000000032a39e0a>] prepare_namespace+0x162/0x198 ', '[ 0.211265] [<0000000032a3965a>] kernel_init_freeable+0x2c2/0x2d0 ', '[ 0.211267] [<0000000032536692>] kernel_init+0x22/0x150 ', '[ 0.211269] [<0000000032546b20>] ret_from_fork+0x28/0x2c ', '00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 31977BCE' -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c28 --- Comment #28 from Sarah Kriesch <ada.lovelace@gmx.de> --- openQA reference: https://openqa.opensuse.org/tests/1529101# @Gerald: Should we continue in this bug or should we create a new one? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1163684 http://bugzilla.opensuse.org/show_bug.cgi?id=1163684#c30 --- Comment #30 from Sarah Julia Kriesch <sarah.kriesch@ibm.com> --- Hi Petr, Thank you for your feedback! I saw that this bug is s390x specific, because no other architecture has got this Kernel Panic at the moment. The advantage of this bugreport is, that it is open and mirrored to IBM. So we can communicate directly on this way. In comparison to a new bugreport, this one needs some time. Gerald, who has fixed this bug, wanted to wait before closing. He would accept reports of after-effects in this bug. I know, that it is no best-practice at openSUSE, but an efficient way to communicate directly. I have created a new bugreport, because that has got no relationship: bsc1180381 -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com