[Bug 1192126] New: VM enters crash/reset loop inside OVMF on reboots
https://bugzilla.suse.com/show_bug.cgi?id=1192126 Bug ID: 1192126 Summary: VM enters crash/reset loop inside OVMF on reboots Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Virtualization:Tools Assignee: virt-bugs@suse.de Reporter: fvogt@suse.com QA Contact: qa-bugs@suse.de CC: glin@suse.com, jlee@suse.com Found By: --- Blocker: --- After the upgrade of openQA workers to Leap 15.3, some tests with UEFI do not reboot correctly, they get stuck and enter a reset loop: https://openqa.opensuse.org/tests/1994829#step/disk_boot/2 For some reason this does not happen at all (or much less often?) on older builds of Tumbleweed, but there's no obvious related change... Can be reproduced by downloading openSUSE-MicroOS-DVD-x86_64-Snapshot20211027-Media.iso (e.g. from https://openqa.opensuse.org/tests/1994829#downloads) and starting QEMU like this: qemu-system-x86_64 -accel kvm -m 1024 -cdrom /data/openqa/iso/openSUSE-MicroOS-DVD-x86_64-Snapshot20211027-Media.iso -bios /usr/share/qemu/ovmf-x86_64-ms-code.bin -d cpu_reset The quickest way to trigger it is to append "startshell=1" to the "Installation" option in the boot menu and enter "reboot -f" in the shell. The VM enters a reset loop and not even doing a full reset using the QEMU UI helps. QEMU's cpu_reset debug output looks like this: CPU Reset (CPU 0) RAX=00000000fffffffe RBX=0000000000010000 RCX=0000000000d52120 RDX=000000000002a4b3 RSI=0000000000000000 RDI=0000008da499b79c RBP=000000000000000a RSP=ffffb9744078bd50 R8 =0000000000000000 R9 =000000000002a491 R10=ffffffff87477f20 R11=ffffffff87477f20 R12=0000000000000000 R13=0000000000000061 R14=00000000fffffffe R15=0000000000000000 RIP=ffffffff85a66501 RFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 ffffffff 00c00000 CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0000 0000000000000000 ffffffff 00c00000 FS =0000 00007f87746bb540 ffffffff 00c00000 GS =0000 ffff939f7ce00000 ffffffff 00c00000 LDT=0000 0000000000000000 000fffff 00000000 TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy GDT= fffffe0000001000 0000007f IDT= fffffe0000000000 00000fff CR0=80050033 CR2=00007f87754000f0 CR3=0000000025db8000 CR4=000006f0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=0000000000000000 CCD=0000000000000000 CCO=DYNAMIC EFER=0000000000000d01 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000 XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000 XMM04=ffffff00ffff0000 0000000000000000 XMM05=0000000000000000 00007f8775138ab4 XMM06=0000000000000000 0000000000000000 XMM07=0000000000000001 0000000000000000 XMM08=0000000000000000 0000000000000000 XMM09=ffffffffffffffff ffffff0000000000 XMM10=0000000000000000 0000000000000000 XMM11=ff00000000ff0000 0000ff0000000000 XMM12=0000000000000000 0000000000000000 XMM13=0000000000000000 0000000000000000 XMM14=0000000000000000 0000000000000000 XMM15=0000000000000000 0000000000000000 CPU Reset (CPU 0) RAX=0000000000000000 RBX=00000000554d4501 RCX=00000000c0010131 RDX=000000000083ca97 RSI=0000000000000002 RDI=00000000554d4551 RBP=0000000000000003 RSP=000000000081f388 R8 =0000000000000003 R9 =4ca54900e701458c R10=4dda4c0d060cc026 R11=0000000000000040 R12=000000000081f648 R13=00000000008292e7 R14=000000000081f578 R15=000000000081f601 RIP=00000000008345e3 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0008 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0018 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0008 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0008 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0008 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0008 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy GDT= 00000000ffffff30 00000027 IDT= 000000000081fd70 0000021f CR0=c0000033 CR2=0000000000000000 CR3=0000000000800000 CR4=00000660 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=0000000000000000 CCD=0000000000000000 CCO=DYNAMIC EFER=0000000000000500 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=0000000000000000 0000000000000000 XMM01=0000000000000000 0000000000000000 XMM02=0000000000000000 0000000000000000 XMM03=0000000000000000 0000000000000000 XMM04=0000000000000000 0000000000000000 XMM05=0000000000000000 0000000000000000 XMM06=0000000000000000 0000000000000000 XMM07=0000000000000000 0000000000000000 XMM08=0000000000000000 0000000000000000 XMM09=0000000000000000 0000000000000000 XMM10=0000000000000000 0000000000000000 XMM11=0000000000000000 0000000000000000 XMM12=0000000000000000 0000000000000000 XMM13=0000000000000000 0000000000000000 XMM14=0000000000000000 0000000000000000 XMM15=0000000000000000 0000000000000000 (repeating several times a second) The first reset is from the guest and expected, but the subsequent ones are inside FW. This issue appears with qemu-ovmf-x86_64-202105-3.4.noarch on TW as well as qemu-ovmf-x86_64-202008-10.8.1.noarch on Leap 15.3, but not using qemu-ovmf-x86_64-201911-lp152.6.17.1.noarch from Leap 15.2. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Dominique Leuenberger
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Charles Arnold
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Lubos Kocman
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c1
--- Comment #1 from Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c2
--- Comment #2 from Fabian Vogt
Apparently the crash is due to a rdmsr instruction raising a fault when reading MSR 0xC0010131 (SEV status), which is not available on the host. This is caused by the MemEncryptSevEsIsEnabled function called from the TpmMmioSevDecryptPei module.
The reason it thinks that SEV is available is that PcdPteMemoryEncryptionAddressOrMask is a non-zero value, probably overwritten by Linux at some point. In qemu-ovmf-x86_64-202105-3.4.noarch, it is at address 0x80b010 and doing "set *(long*)0x80b010 = 0" fixes the reset loop.
According to dmesg, the e820 table marks 0x80b010 as usable. That's where my knowledge of EFI and OVMF stops, so I'll stop debugging here. It appears like either PcdPteMemoryEncryptionAddressOrMask is the wrong type of variable (wrong section?) or the region not correctly marked as reserved. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Oliver Kurz
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c5
--- Comment #5 from Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c6
Stefan Hundhammer
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Santiago Zarate
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c7
--- Comment #7 from Joey Lee
Any news here? Currently all openQA workers are stuck on the old OVMF from 15.2.
I am back for this bug and looking at how to reserve the area in ovmf code. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c8
--- Comment #8 from Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c9
--- Comment #9 from Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Jose Lausuch
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c10
--- Comment #10 from Jose Lausuch
... Sampling every 5 s to /var/log/YaST2/memsample.zcat *** Starting YaST2 *** swapoff: bad useage Try 'swapoff --help' for more information. /sbin/inst_setup: 197: cannot create /sys/class/zram-control/hot_remove: Directory nonexistent
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c11
--- Comment #11 from Joey Lee
This issue can not be reproduced on edk2-stable202111 because the this patchset:
80e67af9afcac3b OvmfPkg: introduce a common work area ab77b6031b03733 OvmfPkg/ResetVector: update SEV support to use new work area format b9af5037b270c47 OvmfPkg/ResetVector: move the GHCB page setup in AmdSev.asm
I will try to backport those patches.
Too many changes since edk2-stable202008 for SEV, so backporting the above patches must also backported many patches of SEV. So I choice to apply the workaround patch on comment#8 for Leap 15.3. The PcdSevEsWorkArea will always be reserved as an ACPI_NVS region as this: [ 0.000000] efi: mem06: [ACPI Mem NVS| | | | | | | | | | |WB|WT|WC|UC] range=[0x000000000080b000-0x000000000080bfff] (0MB) The size is 4K. For openSUSE TW, I will update it to edk2-stable202111 which included the above patches. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c12
Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c13
--- Comment #13 from Joey Lee
Created attachment 854707 [details] 0001-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch
Updated workaround patch. At least checking the existence of PcdSevEsWorkAreaBase before reserve SEV-ES work area.
I have applied this patch on edk2-stable202108 for SLE15-SP3 and built in my home branch: https://build.opensuse.org/project/monitor/home:joeyli:branches:SUSE:SLE-15-... It can be used for testing. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c14
Timo Jyrinki
https://bugzilla.suse.com/show_bug.cgi?id=1192126
Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c16
Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c18
--- Comment #18 from Joey Lee
Created attachment 854756 [details] ovmf-bsc1192126-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch
Updated workaround patch. Always reserved the SEV-ES work area as a ACPI NVS region.
Workaround patch be pushed to SLE15-SP3/ovmf and wait to be merged. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c19
--- Comment #19 from Joey Lee
(In reply to Joey Lee from comment #16)
Created attachment 854756 [details] ovmf-bsc1192126-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch
Updated workaround patch. Always reserved the SEV-ES work area as a ACPI NVS region.
Workaround patch be pushed to SLE15-SP3/ovmf and wait to be merged.
The patch be merged to SLE15-SP3/ovmf. Waiting the change be duplicated to Leap 15.3 in OBS. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c21
--- Comment #21 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c22
--- Comment #22 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c23
Joey Lee
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c24
--- Comment #24 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c26
Marius Kittler
https://bugzilla.suse.com/show_bug.cgi?id=1192126
https://bugzilla.suse.com/show_bug.cgi?id=1192126#c27
--- Comment #27 from openQA Review
participants (1)
-
bugzilla_noreply@suse.com