[Bug 1206359] New: kernel 6.1: random build failures: "error: <random package>: install failed"
https://bugzilla.suse.com/show_bug.cgi?id=1206359 Bug ID: 1206359 Summary: kernel 6.1: random build failures: "error: <random package>: install failed" Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jslaby@suse.com QA Contact: qa-bugs@suse.de CC: dimstar@opensuse.org, jack@suse.com, rgoldwyn@suse.com Found By: --- Blocker: --- kernel 6.1 is submitted to openSUSE:Factory:Staging:J. There, a lot of packages randomly fail. It happens usually at the installation phase and looks like (build rpm):
[ 12s] libdb-4_8-4.8.30-41.81 ######################################## [ 12s] error: libdb-4_8-4.8.30-41.81.x86_64: install failed [ 12s] perl-5.36.0-5.47 ######################################## [ 13s] liblsan0-12.2.1+git537-2.20 ########################################
Note that the failure is intermittent. Only some package installations fail (here libdb only). Note this can happen later in the loop (samba rpm):
error: samba-libs-4.17.3+git.283.2157972742b-3.2.x86_64: install failed ... [ 2288s] failed to install rpms, aborting build ... [ 2288s] [0;1;38;5;185mFailed to write wtmp record, ignoring: Read-only file system[0m [ 2288s] [0;1;38;5;185mFailed to write utmp record: Read-only file system[0m
I'm currently out of ideas. It looks like the VM image becomes RO and then RW randomly. The image is ext4. I cannot reproduce locally with --kvm-type=kvm. Any hint anyone? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c1
Jiri Slaby
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c2
Dominique Leuenberger
And yet, it looks like host-specific. lamb and old-cirrus machines seem to be affected only. I don't know what's their configuration though. Maybe Dominique can shed some light?
Not really - except if you can do something out of the information provided by
osc workerinfo x86_64:lamb02:1
I'm sure Adrian is better suited to answer the original question:
And yet, it looks like host-specific. lamb and old-cirrus machines seem to be > affected only. I don't know what's their configuration though. Maybe Dominique > can shed some light?
-- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c3
--- Comment #3 from Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c4
Marcus R�ckert
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c5
--- Comment #5 from Marcus R�ckert
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c6
Jiri Slaby
so from the logs it looks like on the newer kernel rpm sometimes fails to launch the processes for the post scriptlets. sadly without any useful error.
If post hooks failed, that would dump: Plugin %s: hook psm_post failed But that didn't happen. So post hooks likely finished w/o error. It's something done after post hooks, right? Michael? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c7
--- Comment #7 from Jiri Slaby
(In reply to Marcus R�ckert from comment #5)
so from the logs it looks like on the newer kernel rpm sometimes fails to launch the processes for the post scriptlets. sadly without any useful error.
If post hooks failed, that would dump: Plugin %s: hook psm_post failed
But that didn't happen. So post hooks likely finished w/o error. It's something done after post hooks, right? Michael?
Or even before post hooks. Post hooks are run despite errors and nor affect the return status from rpmpsmRun(). The question is why rpm does not note anything about the failure in the debug log at all? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c8
--- Comment #8 from Jiri Slaby
BTW, can we get kernel logs from the build in OBS if there's anything interesting?
Adding to this, if we run debug runs, can we increase the kernel debug level? Currently, /usr/lib/build/build-vm does: echo 4 > /proc/sysrq-trigger so "echo 8" might show us more. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c9
--- Comment #9 from Marcus R�ckert
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c10
--- Comment #10 from Dominique Leuenberger
so from the logs it looks like on the newer kernel rpm sometimes fails to launch the processes for the post scriptlets. sadly without any useful error.
From a different build: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:J... [ 18s] python-rpm-macros-20221117.8687578-2.1######################################## [ 18s] error: python-rpm-macros-20221117.8687578-2.10.noarch: install failed python-rpm-macros does not even have scriptlets; so I think we're following a red herring with the scriptlets -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c11
--- Comment #11 from Jiri Slaby
so from the logs it looks like on the newer kernel rpm sometimes fails to launch the processes for the post scriptlets. sadly without any useful error.
Ah. So this (AFAICT) means: dbAdd() failed, because: * a lot of "adding X entries to XX index" happens inside dbAdd(). * the following runFileTriggers() and runImmedFileTriggers() would emit an error in case of error. * the following runInstScript(psm, RPMTAG_POSTIN) was not run for libdb and it should as in libasan8 above (both have "postinstall program: /sbin/ldconfig"). * then only psm_post is run regardless of previous errors. So it might be interesting to see which of these invocations in rpmdbAdd() fails (increases ret): for (int dbix = 0; dbix < db->db_ndbi; dbix++) { ... ret += idxdbPut(dbi, rpmtag, hdrNum, h); -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c12
--- Comment #12 from Jiri Slaby
ret += idxdbPut(dbi, rpmtag, hdrNum, h);
One things where the above returns RPMRC_FAIL is when this fails: flock(pkgdb->fd, LOCK_EX); Let me check if there are any changes in file locking in between 6.0 and 6.1. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c13
--- Comment #13 from Jiri Slaby
Let me check if there are any changes in file locking in between 6.0 and 6.1.
And there are none. Can anyone test with this patched rpm: https://build.opensuse.org/package/show/home:jirislaby:debug61/rpm ? There is a patch with many more rpmlog(RPMLOG_DEBUG, ...) to narrow what and how actually fails. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c14
--- Comment #14 from Jiri Slaby
Can anyone test with this patched rpm: https://build.opensuse.org/package/show/home:jirislaby:debug61/rpm ? There is a patch with many more rpmlog(RPMLOG_DEBUG, ...) to narrow what and how actually fails.
I hit it once in home:jirislaby:debug61:rebuild and it points to rpmidxPutInternal()'s: addnewkey(idxdb, key, keyl, &keyoff) And that does pwrite(), mmap() and mremap() internally. I respinned the debug patch to narrow it further down. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c15
Jiri Slaby
And that does pwrite(), mmap() and mremap() internally.
And the winner is: [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, fl=3, fd=4, off=868352) failed 12 i.e. mremap fails with ENOMEM. Vlasta, any ideas? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c16
--- Comment #16 from Vlastimil Babka
(In reply to Jiri Slaby from comment #14)
And that does pwrite(), mmap() and mremap() internally.
And the winner is: [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, fl=3, fd=4, off=868352) failed 12
How exactly is that log related to mremap() syscall and its parameters? Can I assume the address and old/new sizes are 1:1 parameters? if 'fl=3' are flags of mremap() then going by tools/include/uapi/linux/mman.h it would be MREMAP_MAYMOVE | MREMAP_FIXED. With MREMAP_FIXED there should be also fifth parameter saying the new address, which I don't see here. Instead the "fd=4, off=868352" are not mremap() parameters.
i.e. mremap fails with ENOMEM. Vlasta, any ideas?
At least old=0xffffffffffffffff looks quite wrong, but it should result in EINVAL rather than ENOMEM. In general 6.1 includes the maple tree, so that would be my first suspect to look, but we should make sure what the mremap() parameters exactly are and ideally get a /proc/pid/smaps snapshot upon the failure. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c17
--- Comment #17 from Jiri Slaby
(In reply to Jiri Slaby from comment #15)
(In reply to Jiri Slaby from comment #14)
And that does pwrite(), mmap() and mremap() internally.
And the winner is: [ 30s] error: remapslot (223): mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, fl=3, fd=4, off=868352) failed 12
How exactly is that log related to mremap() syscall and its parameters? Can I assume the address and old/new sizes are 1:1 parameters?
No, sorry for being misleading. The source is: https://github.com/rpm-software-management/rpm/blob/master/lib/backend/ndb/r... so it's in fact: mremap(MAP_FAILED, 1056768, 1060864, MREMAP_MAYMOVE);
if 'fl=3' are flags of mremap() then going by
Those fl, fd and off are used as prot, fd and off resp. for mmap() -- i.e. not in this case.
At least old=0xffffffffffffffff looks quite wrong, but it should result in EINVAL rather than ENOMEM.
Ugh, right. So apparently some previous mmap() failed, i.e. returned -1 == MAP_FAILED.
In general 6.1 includes the maple tree, so that would be my first suspect to look, but we should make sure what the mremap() parameters exactly are and ideally get a /proc/pid/smaps snapshot upon the failure.
Let me catch also the *previous* mmap() giving MAP_FAILED which is put to oldaddr. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c18
--- Comment #18 from Jiri Slaby
In general 6.1 includes the maple tree, so that would be my first suspect to look, but we should make sure what the mremap() parameters exactly are and ideally get a /proc/pid/smaps snapshot upon the failure.
Let me catch also the *previous* mmap() giving MAP_FAILED which is put to oldaddr.
smaps dump attached. I don't know where this "-1 as first argument" is from. slot->mapped (the old addr) is never assigned MAP_FAILED or -1. error: remapslot (232): mapped=0xffffffffffffffff shift=0 mapmem(old=0xffffffffffffffff, osz=1056768, nsz=1060864, prot=3, fd=4, off=868352) failed 12 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c19
--- Comment #19 from Jiri Slaby
smaps dump attached. I don't know where this "-1 as first argument" is from. slot->mapped (the old addr) is never assigned MAP_FAILED or -1.
And neither of previous mmap()s fail. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c20
--- Comment #20 from Jiri Slaby
At least old=0xffffffffffffffff looks quite wrong, but it should result in EINVAL rather than ENOMEM. In general 6.1 includes the maple tree, so that would be my first suspect to look
I have to go now, so for now, I came to a suspicion that this is memory corruption. That 0xffffffffffffffff must be written by something else. So could this mmap() kernel rewrite ended up in something like this in rpm: 1) slots = calloc(); (it's 16320B, so likely via sbrk(), I didn't check) 2) do a lot of mremap() and mmap(), some of them simply returns "slots" address from 1) again. 3) slot[X].mapped gets rewritten by a mistake due to 2) by a bunch of 0xff. 4) we see slot->mapped as -1 here. ? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c21
--- Comment #21 from Michael Schr�der
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c22
--- Comment #22 from Jiri Slaby
But wouldn't that lead to EINVAL because offset_in_page(addr) is not zero? That sounds more like your debug statement does not match what happens.
I might do some wrong mistake. Feel free to check the patch ;): https://build.opensuse.org/package/view_file/home:jirislaby:debug61/rpm/debu... -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c23
--- Comment #23 from Michael Schr�der
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c24
Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c25
--- Comment #25 from Jiri Slaby
Created attachment 863532 [details] straced-rpm.zst
I was able to get rpm to fail inside strace -f. Full build log attached.
mmap2(NULL, 241664, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0xaa000) = 0xf767e000 mremap(0xf767e000, 241664, 245760, MREMAP_MAYMOVE) = 0xf767e000 mremap(0xf767e000, 245760, 249856, MREMAP_MAYMOVE) = 0xf767e000 mremap(0xf767e000, 249856, 253952, MREMAP_MAYMOVE) = 0xf767e000 mremap(0xf767e000, 253952, 258048, MREMAP_MAYMOVE) = 0xf767e000 mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mremap(0xf767e000, 258048, 262144, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory)
Looks interesting. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c26
Jiri Slaby
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c27
Jiri Slaby
mapmem(130): map=0x7ff717f7f000 (old=(nil) osz=0 nsz=1056768 prot=3 fd=4 off=868352 )
off in hex=0x0d4000 nsz in hex=0x102000 So mmap mapped this part of the file (Index.db): 0x0d4000-0x1d6000
mapmem(130): remap=0xffffffffffffffff (old=0x7ff717f7f000 osz=1056768 nsz=1060864 prot=3 fd=4 off=868352 )
Now it wants: 0x0d4000-0x1d7000 but fails:
error: remapslot (240): mapped=0x7ff717f7f000 shift=0 mapmem(old=0x7ff717f7f000, osz=1056768, nsz=1060864, prot=3, fd=4, off=868352) failed 12
The original mapping (mmap above):
7ff717f56000-7ff718081000 rw-s 000ab000 fd:00 656385 /usr/lib/sysimage/rpm/Index.db
This mapping provides: 0x0ab000-0x1d6000 BTW, this was requested earlier by another mmap:
mapmem(130): map=0x7ff717f56000 (old=(nil) osz=0 nsz=167936 prot=3 fd=4 off=700416 )
and was not released until now. The above mapping cannot be enlarged, because this vma follows:
7ff718082000-7ff718087000 rw-s 00047000 fd:00 656385 /usr/lib/sysimage/rpm/Index.db
So the question is why remap did not apply the MAYMOVE rule. Provided it's shared, it should create a new mapping (and preserve the old for the older mmap) and return it. Or not? (I don't know the rules here.) -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206359
https://bugzilla.suse.com/show_bug.cgi?id=1206359#c28
Jiri Slaby
participants (1)
-
bugzilla_noreply@suse.com