Bug ID: 1180917 Summary: kernel BUG at mm/huge_memory.c:2144! Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: S/390-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: firstname.lastname@example.org Reporter: email@example.com QA Contact: firstname.lastname@example.org CC: email@example.com, firstname.lastname@example.org, email@example.com Found By: --- Blocker: ---
Recently, a number of packages fails to build and instead hangs with the command cc1plus. In a recent build, I found a kernel trace in a hanging worker:
[ 608s] [ 596.875979] kernel BUG at mm/huge_memory.c:2144! [ 608s] [ 596.876290] monitor event: 0040 ilc:2 [#1] SMP [ 608s] [ 596.876374] Modules linked in: sha256_s390 sha_common overlay sd_mod t10_pi nls_iso8859_1 nls_cp437 vfat fat virtio_rng rng_core virtio_blk xfs btrfs blake2b_generic xor raid6_pq libcrc32c crc32_vx_s390 reiserfs squashfs fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg scsi_mod [ 608s] [ 596.876666] CPU: 1 PID: 2660 Comm: cc1plus Not tainted 5.10.5-1-default #1 openSUSE Tumbleweed [ 608s] [ 596.876750] Hardware name: IBM 2964 N63 400 (KVM/Linux) [ 608s] [ 596.876797] Krnl PSW : 0704e00180000000 00000000e33fb9ea (__split_huge_pmd+0x62a/0xc30) [ 608s] [ 596.877158] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 [ 608s] [ 596.877328] Krnl GPRS: 0000000000000000 00000000b2b40215 0000000003c91000 fffffffffffff800 [ 608s] [ 596.877408] 0000000081691a00 00000000f2440237 00000000000000c0 0000000000000000 [ 608s] [ 596.877493] 0000000081691800 000003d083c91030 0000000086d81d48 000003ff71a40000 [ 608s] [ 596.877567] 0000000086c1c000 00000000e47c4268 00000000e33fb748 000003e003ef3a80 [ 608s] [ 596.877652] Krnl Code: 00000000e33fb9de: a71f0400 cghi %r1,1024 [ 608s] [ 596.877652] 00000000e33fb9e2: a784ff18 brc 8,00000000e33fb812 [ 608s] [ 596.877652] #00000000e33fb9e6: af000000 mc 0,0 [ 608s] [ 596.877652] >00000000e33fb9ea: a55b0602 oill %r5,1538 [ 608s] [ 596.877652] 00000000e33fb9ee: a7f4ff06 brc 15,00000000e33fb7fa [ 608s] [ 596.877652] 00000000e33fb9f2: e32010080004 lg %r2,8(%r1) [ 608s] [ 596.877652] 00000000e33fb9f8: a7210001 tmll %r2,1 [ 608s] [ 596.877652] 00000000e33fb9fc: a77401f1 brc 7,00000000e33fbdde [ 608s] [ 596.878162] Call Trace: [ 608s] [ 596.878190] [<00000000e33fb9ea>] __split_huge_pmd+0x62a/0xc30 [ 608s] [ 596.878257] ([<00000000e33fb6ca>] __split_huge_pmd+0x30a/0xc30) [ 608s] [ 596.878327] [<00000000e3379116>] zap_p4d_range+0x246/0xbb0 [ 608s] [ 596.878396] [<00000000e33808f6>] zap_page_range+0x1a6/0x2e0 [ 608s] [ 596.878458] [<00000000e33b2e14>] do_madvise.part.0+0x844/0xc70 [ 608s] [ 596.879153] [<00000000e33b32a8>] __s390x_sys_madvise+0x68/0x80 [ 608s] [ 596.879247] [<00000000e3c676bc>] system_call+0xe0/0x2ac [ 608s] [ 596.879347] Last Breaking-Event-Address: [ 608s] [ 596.879390] [<00000000e33fb9b2>] __split_huge_pmd+0x5f2/0xc30 [ 608s] [ 596.879465] ---[ end trace 50ad5147a244f7d2 ]---
I don't know if this is related boo#1163684
--- Comment #1 from Berthold Gunreben firstname.lastname@example.org --- Created attachment 845109 --> http://bugzilla.opensuse.org/attachment.cgi?id=845109&action=edit full build log
--- Comment #8 from Berthold Gunreben email@example.com --- (In reply to LTC BugProxy from comment #7)
Thats a whole lot of questions, some of which need some more explanations.
------- Comment From firstname.lastname@example.org 2021-01-15 13:02 EDT------- (In reply to comment #9)
We are using kernel 5.10.7 in the latest Tumbleweed version. The latest iso image is available under: https://download.opensuse.org/ports/zsystems/tumbleweed/iso/
Hmm, it says "5.10.5-1-default" in the kernel BUG output. In order to match the given line 2144 from "mm/huge_memory.c:2144" and to find the corresponding kernel code, a matching kernel source would be needed.
The kernel to use is special for the builds. It originates from Tumbleweed, but it is possible to substitute the kernel with special versions in the build systems, and thus it is not automatically updated to the latest version. The version string that you see tells the truth.
Is there any other means of kernel source access for openSUSE Tumbleweed, ideally a git repo like for SLES? Seems hard to believe that "open"SUSE kernel source is harder to find / access than SLES code...
It is not hard to find at all. All you need to know is, that the different flavors of kernels all depend on a central package called kernel-source, which has an own mechanics to integrate patches depending on a variety of conditions. The source can be found in the package http://download.opensuse.org/ports/zsystems/tumbleweed/repo/oss/noarch/kerne...
I downloaded this in case it gets overwritten and would not be available that easy anymore. Note, that one can always rebuild older versions, because OBS does not throw away sources. Therefore you can just rebuild an older version of a package.
Anyway, SUSE developers surely have such access, and since this is BUG statement in common memory management code anyway, I would suggest to let one of the corresponding SUSE developers have a look first.
That is the reason, why the assignee is the openSUSE Kernel Developers.
BTW, some information that might help is the fact(?) that THP worked fine on s390 with Tumbleweed, at least for some very short time, when verifying the other THP fix in LTC bug#184202 / SUSE bug#1163684. IIUC, then it was verified there with 5.9.11, but that is not 100% clear to me from the other bugzilla.
Now that is an interesting question as well. We never could reliably reproduce the behavior, it is more kind of a statistical experience. From my feeling, I would say, that the kernel at least worked for some time.
One thing that is also a little strange is, that now only one process leads to issues, which is cc1plus. On the other hand, the compile process is one of the biggest (from a memory perspective) processes to be found. Often enough, restarting the build just makes the build work.
Please verify on which kernel version it worked fine the last time. Then, with having access to some proper source repo (and not just an ISO), one might be able to see what was changed in between and with regard to THP, maybe madvise.
So, the changes can be found in the changelog to the rpm. This is the reliable source for knowing what has changed when. The changelog is found with rpm (rpm -q --changelog ...) and also next to the spec file with a changes extension.
With regards to the sources, you can get the sources for the kernel-source package with the command:
osc co -r 77cf39676446e7f7aa15ea53ef337b64 openSUSE:Factory:zSystems kernel-source
The config is found in the file config.tar.bz2 within. The definition of what patch is applied in what case is found in the series.conf file.
Would it be helpful to temporarily add some extra kernel parameter for testing? With boo#1163684 it was very helpful to have a reliable build environment. I know that this kind of issue is hard to debug and hard to find. However, I also believe that it is vital to find it before it hits customers with enterprise distributions. This case hits even less often than boo#1163684 but that does not help those who are hit.
Sarah Kriesch email@example.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |firstname.lastname@example.org Resolution|--- |INVALID
--- Comment #17 from Sarah Kriesch email@example.com --- I close this bug report after such a long time without any reproducible issue for this case. I will create new ones if it will happen again.