[Bug 1183711] New: [Build 8.3] openQA test fails in await_install: Segfault in libzypp/libsolv on upgrade
https://bugzilla.suse.com/show_bug.cgi?id=1183711 Bug ID: 1183711 Summary: [Build 8.3] openQA test fails in await_install: Segfault in libzypp/libsolv on upgrade Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: Other URL: https://openqa.opensuse.org/tests/1670893/modules/awai t_install/steps/1 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: YaST2 Assignee: yast2-maintainers@suse.de Reporter: fvogt@suse.com QA Contact: jsrain@suse.com Found By: openQA Blocker: Yes Created attachment 847398 --> https://bugzilla.suse.com/attachment.cgi?id=847398&action=edit Screenshot of the backtrace When doing an upgrade to Leap 15.3, YaST segfaults when the actual upgrade starts. Backtrace attached as VM screenshot, I couldn't easily copy it as text... From a quick glance, it seems like repodata_dir2str is passed garbage. This was found in the live upgrade test, but the non-live build hasn't reached QA yet, so I can't tell whether the DVD upgrade is also affected. It doesn't look like a live specific issue to me so far. Compared to the last working livecd, mostly YaST and related packages were updated. Most importantly, libzypp and libsolv didn't get changes, but got rebuilt. ## Observation openQA test in scenario opensuse-15.3-KDE-Live-x86_64-kde_live_upgrade_leap_42.3@64bit-2G fails in [await_install](https://openqa.opensuse.org/tests/1670893/modules/await_install/steps/1) ## Test suite description Uses the live installer on the kde live media for upgrading the system. ## Reproducible Fails since (at least) Build [8.3](https://openqa.opensuse.org/tests/1670893) (current job) ## Expected result Last good: [8.1](https://openqa.opensuse.org/tests/1668222) (or more recent) ## Further details Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=KDE-Live&machine=64bit-2G&test=kde_live_upgrade_leap_42.3&version=15.3) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c1
--- Comment #1 from Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c2
Stefan Hundhammer
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c3
--- Comment #3 from Fabian Vogt
As you wrote, this happens when trying to upgrade a KDE Live 42.3 system which is not anything that we are supporting; we discussed this numerous times before.
No, this is about upgrading a "normal" 42.3 using the live media. Upgrading a live system doesn't make sense.
Does this also happen with any of the supported systems? Like e.g. Leap 15.2? Do we have an openQA test case with the equivalent failure when trying to upgrade from that supported scenario to Leap 15.3?
As I wrote in the report, there hasn't been an openQA test of that yet. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c4
--- Comment #4 from Stefan Hundhammer
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c5
Ancor Gonzalez Sosa
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c6
Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c7
Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c8
Benjamin Zeller
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c9
--- Comment #9 from Fabian Vogt
A coredump file might help here though.
https://w3.suse.de/~fvogt/boo1183711core.xz This time it didn't crash in libsolv, but the memory corruption caused a "stack level too deep" error in ruby and then it aborted in malloc_printerr.
But depending on the state of libzypp, libsolv when the files are moved this might indeed be a problem. If we have open fd's to the cache it would be problematic, first the copied files might be garbage and second we'd still continue to use the old files ( since the kernel will keep deletes files around as long as a process has them opened ) or a mix of both.
Do you know in what stage that happens?
Not sure if libsolv keeps the cache files open, adding Michael Schr�der as well.
Before copy and link: lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv After copy and link: lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv (deleted) So the 15.3 solv file got unlinked and @System/solv got overwritten! Excluding @System from the copy prevents the crash. It might work to remove /mnt/var/cache/zypp/ before the copy, so that the open cache isn't modified. Or cp could be used with --remove-destination. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c10
Fabian Vogt
(In reply to Benjamin Zeller from comment #8)
A coredump file might help here though.
https://w3.suse.de/~fvogt/boo1183711core.xz This time it didn't crash in libsolv, but the memory corruption caused a "stack level too deep" error in ruby and then it aborted in malloc_printerr.
But depending on the state of libzypp, libsolv when the files are moved this might indeed be a problem. If we have open fd's to the cache it would be problematic, first the copied files might be garbage and second we'd still continue to use the old files ( since the kernel will keep deletes files around as long as a process has them opened ) or a mix of both.
Do you know in what stage that happens?
Not sure if libsolv keeps the cache files open, adding Michael Schr�der as well.
Before copy and link:
lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv
After copy and link:
lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv (deleted)
So the 15.3 solv file got unlinked and @System/solv got overwritten!
Excluding @System from the copy prevents the crash. It might work to remove /mnt/var/cache/zypp/ before the copy, so that the open cache isn't modified.
Yep, that seems to work. I opened a PR: https://github.com/yast/yast-packager/pull/561 Reassigning back to YaST.
Or cp could be used with --remove-destination.
I chose the explicit removal as this avoids mixing local and destination caches completely. If it's an issue that caches of other repos are also cleared, then the cp switch could be used instead. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c11
Benjamin Zeller
https://w3.suse.de/~fvogt/boo1183711core.xz This time it didn't crash in libsolv, but the memory corruption caused a "stack level too deep" error in ruby and then it aborted in malloc_printerr. Yeah then it won't be too helpful :/.
Before copy and link:
lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv
After copy and link:
lr-x------ 1 root root 64 M�r 31 10:05 27 -> /mnt/var/cache/zypp/solv/@System/solv lr-x------ 1 root root 64 M�r 31 10:00 28 -> /var/cache/zypp/solv/openSUSE-Leap-15.3-1_0/solv (deleted)
So the 15.3 solv file got unlinked and @System/solv got overwritten! So that means libzypp already had a file in the destination cache and its overwritten? Maybe we should rethink the order in which this is happening here.
Excluding @System from the copy prevents the crash. It might work to remove /mnt/var/cache/zypp/ before the copy, so that the open cache isn't modified.
Yep, that seems to work. I opened a PR: https://github.com/yast/yast-packager/pull/561 Reassigning back to YaST.
Or cp could be used with --remove-destination.
I chose the explicit removal as this avoids mixing local and destination caches completely. If it's an issue that caches of other repos are also cleared, then the cp switch could be used instead.
Still a bit concerned about that, in a running transaction ripping the file away from under libzypp/libsolv sounds like a bad idea. And with running transaction I mean somewhere between loading the target and doing a commit. I would still like to get a ok from mlandres and mls about this. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c12
--- Comment #12 from Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c13
--- Comment #13 from Michael Schr�der
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c14
Ladislav Slez�k
It seems that libzypp/solv has already opened /mnt/var/cache/zypp/solv/@System, and then a different /var/cache/zypp/solv/@System is copyied to it. Which sounds like there is some fundamental problem somewhere else.
OK, that is probably the reason, the /mnt/var/.../@System is the original file from the upgraded system, the /var/.../@System file from the inst-sys which is probably just an empty database. So there are basically these options: 1) Just skip copying the @System file at upgrade, copy the other files 2) Copy the caches only in a fresh installation, do not do that at upgrade 3) Do not copy the cache at all The option 3) is the safest but does help to save any RAM, the 2) ensures we either copy all or nothing so the data should be more consistent, 1) is just a solution for the broken @System file, though I do not know if that's enough. I'm not sure which solution to pick, which one is OK for libzypp. If we do not know than the safe option is to revert the cache patches completely. Michaels? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c15
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c16
--- Comment #16 from Fabian Vogt
(In reply to Michael Schr�der from comment #13)
It seems that libzypp/solv has already opened /mnt/var/cache/zypp/solv/@System, and then a different /var/cache/zypp/solv/@System is copyied to it. Which sounds like there is some fundamental problem somewhere else.
OK, that is probably the reason, the /mnt/var/.../@System is the original file from the upgraded system, the /var/.../@System file from the inst-sys which is probably just an empty database.
Yes, like I wrote in comment 9.
So there are basically these options:
1) Just skip copying the @System file at upgrade, copy the other files 2) Copy the caches only in a fresh installation, do not do that at upgrade 3) Do not copy the cache at all
Don't forget about option 0), my PR: https://github.com/yast/yast-packager/pull/561 That saves RAM, is consistent and also works for upgrades. Though correctness still needs to be verified by libsolv/libzypp maintainers.
The option 3) is the safest but does help to save any RAM, the 2) ensures we either copy all or nothing so the data should be more consistent, 1) is just a solution for the broken @System file, though I do not know if that's enough.
I'm not sure which solution to pick, which one is OK for libzypp. If we do not know than the safe option is to revert the cache patches completely.
Michaels?
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c17
Lukas Ocilka
Setting to P2 as this is a serious issue which might completely block the upgrade process.
In fact, it might IMO be a P1 then. The code is the same for SLE 15 SP3. Needinfo for Stefan. But maybe Ladislav could describe more in which (all) cases this happens...? Thx. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c18
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
Guillaume GARDET
https://bugzilla.suse.com/show_bug.cgi?id=1183711
Ancor Gonzalez Sosa
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c20
Michael Andres
I'm not sure which solution to pick, which one is OK for libzypp. If we do not know than the safe option is to revert the cache patches completely.
Don't touch the cache of loaded repos. Unload all repos from the pool, close the target - then moving the repo cache data should be possible. Regarding the symlinks: zypp cache is directory based. If caches are updated we exchange directories. Symlinking individual files may not have the effect you want. BTW - did you consider to unload all repos/target, move the cache to /mnt/..., call ZConfig::setRepoCachePath("/mnt/...") and load the stuff again? And keep in mind that we build and re-build our caches as we need it. The fact that your workflow today does not experience a cache action does not mean it will nnever happen. Load a new libzypp via a DUD and all your symlinking may be obsolete. You can not control when and why a cache rebuild will be necessary. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c21
--- Comment #21 from Ladislav Slez�k
Don't touch the cache of loaded repos. Unload all repos from the pool, close the target - then moving the repo cache data should be possible.
OK, then the 3) option (revert the cache symlinking completely) is the correct solution.
Regarding the symlinks: zypp cache is directory based. If caches are updated we exchange directories. Symlinking individual files may not have the effect you want. BTW - did you consider to unload all repos/target, move the cache to /mnt/..., call ZConfig::setRepoCachePath("/mnt/...") and load the stuff again?
Yes, I think that would be doable. But not for SP3. It would need lot of changes (including pkg-bindings) and even more testing (to test all scenarios properly; installation, upgrade, AutoYaST...).
And keep in mind that we build and re-build our caches as we need it. The fact that your workflow today does not experience a cache action does not mean it will nnever happen. Load a new libzypp via a DUD and all your symlinking may be obsolete. You can not control when and why a cache rebuild will be necessary.
We do the symlinking right before starting the package installation, new DUD cannot be applied at that point. But I see your point... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c22
--- Comment #22 from Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c23
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c24
Fabian Vogt
The revert has been implemented in yast2-packager-4.3.21 (https://github.com/yast/yast-packager/pull/562)
Submitted in https://build.suse.de/request/show/238952 to SP3/Leap 15.3.
What's the sr# for TW? I suppose the needinfos can be cleared now. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c25
--- Comment #25 from Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c26
--- Comment #26 from Fabian Vogt
SR for Factory: https://build.opensuse.org/request/show/883636
That one does not include the revert, it doesn't appear to be fixed in master AFAICT. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c27
Fabian Vogt
(In reply to Ladislav Slez�k from comment #25)
SR for Factory: https://build.opensuse.org/request/show/883636
That one does not include the revert, it doesn't appear to be fixed in master AFAICT.
It seems like there is still no SR for Tumbleweed, reopening. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c28
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
https://bugzilla.suse.com/show_bug.cgi?id=1183711#c29
Ladislav Slez�k
https://bugzilla.suse.com/show_bug.cgi?id=1183711
Stefan Weiberg
participants (1)
-
bugzilla_noreply@suse.com