[Bug 1217083] New: snapshot 20231107 --> 20231108 boot fails
https://bugzilla.suse.com/show_bug.cgi?id=1217083 Bug ID: 1217083 Summary: snapshot 20231107 --> 20231108 boot fails Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: All OS: Other Status: NEW Severity: Critical Priority: P5 - None Component: Bootloader Assignee: screening-team-bugs@suse.de Reporter: roeland.jansen@sue.nl QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Recently I updated 23031107 to 20231108 on two laptops running legacy boot. Both hang some time at console plymouth stuff. dracut then comes up with the inability to find /dev/system/root. It ends up in the dracut emergency shell. in the shell, lvm pvscan will report no physical volumes. It also happens on a virtual machine. (also not UEFI). If I try to recover with an usb stick, no partitions are shown. If I show "all", the installed system is being identified als unknown, architecture unknown. (and yes it's a 64 bit TW stick). I could not recover them anymore. Also, on one of the laptops, grub does not show windosws anmore, the other does. I luckily have a snapshot on my work vm so I can revert back. The following packages are updated when this happens. alsa alsa-devel code kernel-firmware-all kernel-firmware-amdgpu kernel-firmware-ath10k kernel-firmware-ath11k kernel-firmware-atheros kernel-firmware-bluetooth kernel-firmware-bnx2 kernel-firmware-brcm kernel-firmware-chelsio kernel-firmware-dpaa2 kernel-firmware-i915 kernel-firmware-intel kernel-firmware-iwlwifi kernel-firmware-liquidio kernel-firmware-marvell kernel-firmware-media kernel-firmware-mediatek kernel-firmware-mellanox kernel-firmware-mwifiex kernel-firmware-network kernel-firmware-nfp kernel-firmware-nvidia kernel-firmware-platform kernel-firmware-prestera kernel-firmware-qcom kernel-firmware-qlogic kernel-firmware-radeon kernel-firmware-realtek kernel-firmware-serial kernel-firmware-sound kernel-firmware-ti kernel-firmware-ueagle kernel-firmware-usb-network libasound2 libatopology2 libbrotli-devel libbrotlicommon1 libbrotlicommon1-x86-64-v3 libbrotlidec1 libbrotlidec1-x86-64-v3 libbrotlienc1 libbrotlienc1-x86-64-v3 libbytesize-lang libbytesize1 libgusb2 libnghttp2-14 libsqlite3-0 libsqlite3-0-x86-64-v3 libsvn_auth_kwallet-1-0 libxxhash0 openSUSE-release openSUSE-release-appliance-custom openSUSE-release-ftp sqlite3-devel sqlite3-tcl subversion subversion-bash-completion subversion-perl sysuser-shadow ucode-amd wmctrl I tried by mounting in rescue, created a new grub.cfg where osprober does not show windows anymore. The interesting part is that dracut -f will as last line shows "adding boot menu entry for UEFI firmware settings" Basically I will end up in reinstalling windows and linux on both laptops. (in legacy boot) If needed, I can install the non-kernel parts and see what package eventualy triggers this. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c3 --- Comment #3 from Roeland Jansen <roeland.jansen@sue.nl> --- it's all intel -- so I would think it's not used at all. I taboo'd the package and a restart was ok there. Directly all back. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c5 --- Comment #5 from Takashi Iwai <tiwai@suse.com> --- Hm, then it smells really strange. In general, the microcode won't be updated unless it really matches with the CPU (the CPU itself checks). Since this is the only report -- although it should have hit to far more people -- I'm afraid that we're scratching the wrong surface. Could you double-check whether it's really ucode-amd package that really breaks? As mentioned, you can control the ucode loading via a boot option (so you can type in on GRUB menu at boot time). -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c6 --- Comment #6 from Roeland Jansen <roeland.jansen@sue.nl> --- both on the laptop upstairs and on the vm it was definitely the case. what I did was: laptop above (in the unbootable state): started rescue, mounted all on /mnt, including rbound /proc /sys and /dev; chrooted, removed the ucode-amd package only and it started. the vm I use at work, installed all the packages -- broken, snapshotted back and incrementally in junks and w/o ucode-amd it booted. In a different setting while updating a client 12.4 --> 15.5 offline, I had the same where even two out of four PVs were missing. The LVs found were / and /usr. (SLES) In the mean time (about a week), I believe packages have updated on our RMTs and then it started working. It's what I observed, not 100% conclusive. And just got word that a collegue of mine had the same issues, several lv's not seen. he's kicking the remote RMT to update and tries to re-update from a previous snapshot. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c7 Roeland Jansen <roeland.jansen@sue.nl> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(roeland.jansen@su | |e.nl) | --- Comment #7 from Roeland Jansen <roeland.jansen@sue.nl> --- Created attachment 870905 --> https://bugzilla.suse.com/attachment.cgi?id=870905&action=edit this is how they all end in dracut shell, in the lvm section, pvscan, nor pvs find disks. note that this not only happened in tumbleweed but also sles15.5 and 15.s for SAP -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c8 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|kernel-bugs@opensuse.org |dracut-maintainers@suse.de Flags| |needinfo? --- Comment #8 from Takashi Iwai <tiwai@suse.com> --- Then it's not an issue of amd-ucode itself, but something screwed up with dracut (or the info fed to dracut). If it were a problem of CPU ucode, you won't reach at that point at all. Tossed to dracut maintainers. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c9 Antonio Feijoo <antonio.feijoo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo? | CC| |antonio.feijoo@suse.com --- Comment #9 from Antonio Feijoo <antonio.feijoo@suse.com> --- (In reply to Roeland Jansen from comment #0)
Recently I updated 23031107 to 20231108 on two laptops running legacy boot.
The previous dracut update was in snapshot 20231101 (059+suse.511.g0bdb16ac), no changes between 23031107 and 23031107.
I tried by mounting in rescue, created a new grub.cfg where osprober does not show windows anymore. The interesting part is that dracut -f will as last line shows "adding boot menu entry for UEFI firmware settings"
This output is not from dracut, but from grub2-mkconfig. dracut does not add boot menu entries.
it's all intel -- so I would think it's not used at all.
I taboo'd the package and a restart was ok there. Directly all back.
Then I'm wondering how ucode-amd can be included in your initramfs. Are you building non-hostonly initrds? Can you attach the output of `dracut -f --debug test.img`? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c10 --- Comment #10 from Roeland Jansen <roeland.jansen@sue.nl> --- if I literally throw away that package, it boots. And what I just got from my collgue: all packages installed, vmware and model name : Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz as CPU seen on the VM's, he installed 15.5 (SLES) --> missing lv's and pv's he just removed ucode-intel and it boots.... regarding the question about the output of `dracut -f --debug test.img`? I then need to forcefully f* up one instance. Not sure if we can do. At work, these vm's ertainly cannot be f* up. The images are not unified ones, specifically built on all the respective images. I even have see a VM where both ucode-intel AND ucode-amd were installed as rpm's. The plot thickens. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c11 --- Comment #11 from Antonio Feijoo <antonio.feijoo@suse.com> --- (In reply to Roeland Jansen from comment #1)
in my case, ucode-amd triggered the dracut emergency shell.
Both the failed vm (intel) and laptop (intel) dropped in dracut emerg. shell:
+ (In reply to Roeland Jansen from comment #10)
he installed 15.5 (SLES) --> missing lv's and pv's
he just removed ucode-intel and it boots....
so the boot fails with any ucode? that's pretty weird...
regarding the question about the output of `dracut -f --debug test.img`?
I then need to forcefully f* up one instance. Not sure if we can do. At work, these vm's ertainly cannot be f* up.
You can install the ucode package that you say it's breaking your boot, run `dracut -f --debug test.img 2>&1 &> dracut.log` (it does not install the initrd in the /boot partition), and uninstall the ucode package. Otherwise, if we don't have any kind of log it's quite difficult to guess what may be happening, I've never seen a similar bug report. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c12 --- Comment #12 from Roeland Jansen <roeland.jansen@sue.nl> --- collegue of mine and I will be at the office and will try and see if we can replay it with a new vm and talked to my collegue and we will try to reproduce a new vm with this issue and send the test.img will also read the story above and act on it. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c13 --- Comment #13 from Antonio Feijoo <antonio.feijoo@suse.com> --- (In reply to Roeland Jansen from comment #12)
collegue of mine and I will be at the office and will try and see if we can replay it with a new vm and talked to my collegue and we will try to reproduce a new vm with this issue and send the test.img
Just to clarify, we ask to attach the `dracut.log` file, not the `test.img`. BTW, you can use this `test.img` initramfs without breaking the default, copy it to your /boot partition and edit the grub entry at boot when the grub menu is displayed, changing the value of the `initrd` line to /test.img -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c14 --- Comment #14 from Roeland Jansen <roeland.jansen@sue.nl> --- we're going to redo the test. my collegue mentioned that when the boot started to fail, he had also seen an UEFI entry in grub, like what I mentioned as well in one of the first times I wrote this report. And his environment also legacy boot. We'll update you later today I hope -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c15 --- Comment #15 from Roeland Jansen <roeland.jansen@sue.nl> --- the debug log to be attached -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c16 --- Comment #16 from Roeland Jansen <roeland.jansen@sue.nl> --- Created attachment 870934 --> https://bugzilla.suse.com/attachment.cgi?id=870934&action=edit dracut debug log of a failed system -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c17 --- Comment #17 from Roeland Jansen <roeland.jansen@sue.nl> --- Created attachment 870936 --> https://bugzilla.suse.com/attachment.cgi?id=870936&action=edit updated to latest w/o ucode this is the log when we uninstall ucode (in this case intel) and do the zypper up. It then will boot just fine. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c18 --- Comment #18 from Roeland Jansen <roeland.jansen@sue.nl> --- and as an add bonus: the boot log w/o ucode above as said boots. After this we installed ucode-intel ; it fired off dracut and it fails to boot afterwards. It stalls at rechaedtargetbasic system after that, dracut initqueue complains that timeouts happen while not being able to find all LVs. Don't think that the ucode itse;f is the issue but something that triggers dracut to create a failed initrd? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c25 --- Comment #25 from Roeland Jansen <roeland.jansen@sue.nl> --- well.... The problem is that it's at home and work a TW issue (3 or 4 systems) AND it also is a same issue on SLES (*) (at work) 100% same issue, same problem, no PVs found after update. We do know the work-around" - on all the systems, removing ucode* upfront. (*) I could just skip it and let it go like "if someone else has this issue, I don't care" but to me that's not helping. Two different teams looking at it to me doesn't seem to me effective use of resources? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c26 --- Comment #26 from Antonio Feijoo <antonio.feijoo@suse.com> --- (In reply to Roeland Jansen from comment #25)
well....
The problem is that it's at home and work a TW issue (3 or 4 systems) AND it also is a same issue on SLES (*) (at work)
100% same issue, same problem, no PVs found after update. We do know the work-around" - on all the systems, removing ucode* upfront.
(*) I could just skip it and let it go like "if someone else has this issue, I don't care" but to me that's not helping.
Two different teams looking at it to me doesn't seem to me effective use of resources?
SLE is a commercial product, therefore all its incidents must be handled through the SUSE Customer Center (https://scc.suse.com/). That does not mean different teams, but different processes. Thank you for your understanding and I hope this does not cause you any inconvenience. Other than that, I suspect (it's the only thing I can do with the info I have) that you are experiencing at least 2 different issues: - 1 : Tumbleweed update from 23031107 to 20231108 on two laptops with Intel CPU => problem with ucode-amd (comment #0, comment #3) - 2 : SLES 15.5, update? laptop or vm? with Intel CPU => problem with ucode-intel (comment #10) Both Tumbleweed and SLE have different kernel, dracut, firmware versions... Tumbleweed is a rolling release, so it's very unlikely that an issue with a TW update is related to an issue in SLE. BTW, AFAIK microcode only affects physical CPUs, the vm guests do not have microcode of it's own, so it's even more strange that a ucode package is breaking a vm. And, as Takashi said in comment #8, a problem with a microcode should not allow the system to get this far in the boot process. So, it'd great if you can provide the Tumbleweed logs requested in comment #11 (`dracut -f --debug test.img 2>&1 &> dracut.log`), and the file /run/initramfs/rdsosreport.txt after trying to boot with the generated initramfs test.img, passing also `rd.debug` to the kernel command line. Thanks! -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 Antonio Feijoo <antonio.feijoo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c19 --- Comment #19 from Antonio Feijoo <antonio.feijoo@suse.com> ---
//etc/os-release@4(source): PRETTY_NAME='SUSE Linux Enterprise Server 15 SP5'
The logs you are providing are from SLE, so please provide the Tumbleweed logs. If this is only about a SLE bug, you should open a L3 incident, so it can be addressed correctly. Thank you. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(tiwai@suse.com) | -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 Antonio Feijoo <antonio.feijoo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo? | -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c27 --- Comment #27 from Roeland Jansen <roeland.jansen@sue.nl> --- SLE15 is a VM for a customer. the TW systems I cannot provide anymore as they are all fixed and the reload of the ucode stuff does not trigger the iissues after the fix which let my collegue and I think of this: if ucode (intel/amd) is there, it triggers a specific initrd build that b0rks the system. (ref comment #8). If there is no ucode installed --> no another dracut run, basically) If you then update the system, putting back and doing a dracut -f does not fail anymore. What I can do is checking out if I can find back both ISOs (TW) and redo this in a VM. re 23031107 to 20231108 -- happened on a vm (vmware workstation) in windows (intel) and on two physical systems (laptops, also intel) the base image 15.5 we used for the customer (SLE) booted fine and failed after the update. we luckily had a snapshot so when we went back, removing ucode there (intel) and udated, all was fine. So our idea is that there is somewheren between versions/updates there is a specific condition that breaks the booting process. Give me some time to find the specific snapshots and replay this in a workstation vm. Re SLE: we can definitely 'replay' this from the base image and update but think the rdsosreport could be a problem. We'll see. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 Antonio Feijoo <antonio.feijoo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(tiwai@suse.com) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c29 Antonio Feijoo <antonio.feijoo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |NORESPONSE Status|NEW |RESOLVED --- Comment #29 from Antonio Feijoo <antonio.feijoo@suse.com> --- Closing this bug for Tumbleweed after 3 months without response. Please reopen it if you can reproduce it with the current Tumbleweed version (snapshot 20240228) and provide its logs (see comment #26), because we didn't have any other reports similar to this one during this time span. For the SLE case, an incident must be open through the SUSE Customer Center (https://scc.suse.com/). Thank you. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c31 --- Comment #31 from Maintenance Automation <maint-coord+maintenance-robot@suse.de> --- SUSE-RU-2024:1081-1: An update that has four fixes can now be installed. Category: recommended (important) Bug References: 1217083, 1219841, 1220485, 1221675 Maintenance Incident: [SUSE:Maintenance:33012](https://smelt.suse.de/incident/33012/) Sources used: Basesystem Module 15-SP5 (src): dracut-055+suse.382.g80b55af2-150500.3.18.1 openSUSE Leap 15.5 (src): dracut-055+suse.382.g80b55af2-150500.3.18.1 SUSE Linux Enterprise Micro 5.5 (src): dracut-055+suse.382.g80b55af2-150500.3.18.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c33 --- Comment #33 from Maintenance Automation <maint-coord+maintenance-robot@suse.de> --- SUSE-RU-2024:2697-1: An update that has three fixes can now be installed. URL: https://www.suse.com/support/update/announcement/2024/suse-ru-20242697-1 Category: recommended (moderate) Bug References: 1208690, 1217083, 1220485 Maintenance Incident: [SUSE:Maintenance:34686](https://smelt.suse.de/incident/34686/) Sources used: openSUSE Leap 15.4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Micro for Rancher 5.3 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Micro 5.3 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Micro for Rancher 5.4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Micro 5.4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise High Performance Computing ESPOS 15 SP4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise High Performance Computing LTSS 15 SP4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Desktop 15 SP4 LTSS 15-SP4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Server 15 SP4 LTSS 15-SP4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Linux Enterprise Server for SAP Applications 15 SP4 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Manager Proxy 4.3 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Manager Retail Branch Server 4.3 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 SUSE Manager Server 4.3 (src): dracut-055+suse.357.g905645c2-150400.3.34.2 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217083 https://bugzilla.suse.com/show_bug.cgi?id=1217083#c34 --- Comment #34 from Maintenance Automation <maint-coord+maintenance-robot@suse.de> --- SUSE-RU-2024:1081-2: An update that has four fixes can now be installed. URL: https://www.suse.com/support/update/announcement/2024/suse-ru-20241081-2 Category: recommended (important) Bug References: 1217083, 1219841, 1220485, 1221675 Maintenance Incident: [SUSE:Maintenance:33012](https://smelt.suse.de/incident/33012/) Sources used: SUSE Linux Enterprise Micro 5.5 (src): dracut-055+suse.382.g80b55af2-150500.3.18.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com