[Bug 1209245] New: Xen VM fails to be destroyed (or crashes completely?) if Linux kernel in HVM guest records a crash with configured crashkernel+kdump
https://bugzilla.suse.com/show_bug.cgi?id=1209245 Bug ID: 1209245 Summary: Xen VM fails to be destroyed (or crashes completely?) if Linux kernel in HVM guest records a crash with configured crashkernel+kdump Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: Other URL: https://openqa.suse.de/tests/10650309/modules/kdump_an d_crash/steps/151 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Virtualization:Other Assignee: virt-bugs@suse.de Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de CC: ohering@suse.com, santiago.zarate@suse.com Found By: openQA Blocker: Yes ## Observation openQA test in scenario sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-extratest@svirt-xen-hvm fails in [kdump_and_crash](https://openqa.suse.de/tests/10650309/modules/kdump_and_crash/steps/151) waiting for the VM to be shut down when it is not. The Xen VM fails to be destroyed if Linux kernel in HVM guest records a crash with configured crashkernel+kdump. https://openqa.suse.de/tests/10663267 is a simplified test scenario showing the same. After triggering a deliberate crash+kdump in a VM causes the VM to be listed as still "running" in `virsh list --all` in contrast to previously "shut off". Interestingly the "Id" vanishes in that list and is replaced with `-`. If I just trigger a crash in a VM that does not have crashkernel configured the system goes to "shut off" after 2s as expected. I ran `xl` during test execution and as soon as the kernel crash is deliberately triggered the xl output changes to ``` (null) 363 0 1 --p--d ``` corresponding to what `virsh list --all` reports. As soon as the crash is induced the corresponding qemu process vanishes and `virsh list --all` reports the machine with "(null)", two seconds before xl reports the machine with Id "-". Xen 4.16.3_04-150400.4.22.1.x86_64 and domU kernel SLE 12 SP5 4.12.14-120-default. ## Test suite description The openQA test boots a Xen HVM instance on a Leap 15.4 hypervisor host with the OS in the VM being SLE12-SP5-JeOS. The operating system in the VM is configured with "yast kdump" to reserve memory for the crashkernel and enable the kdump collecting service, then rebooted before a kernel crash is artifically induced with `echo c > /proc/sysrq-trigger` when the VM crashes (as expected). The VM should be destroyed which is what the openQA test expects. ## Reproducible Reproducible every time in openQA tests running Xen HVM on openqaw5-xen.qa.suse.de after upgrading the hypervisor host from SLE15-SP2 to SLE15-SP4 and now Leap 15.4 but *only* happens if the crashkernel+kdump is enabled. ## Expected result Last good: [20230306-1](https://openqa.suse.de/tests/10630886) when the hypervisor host was still on SLE15-SP2 showing how kdump and crashkernel is configured, the VM is destroyed and recreated, rebooted and then a kernel crash is induced which is properly caught and recovered by the crashkernel. ## Further details Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=JeOS-for-kvm-and-xen-Updates&machine=svirt-xen-hvm&test=jeos-extratest&version=12-SP5) https://progress.opensuse.org/issues/125783 has more details on our investigation. https://openqa.suse.de/tests/10663267/video?filename=video.ogv shows that the kernel crash backtrace is shown from the point when a kernel crash is triggered till the end of test execution. This supports my hypothesis that the VM crashes completely and is unable to load the crashkernel. https://openqa.suse.de/tests/10663651#step/kdump_and_crash/449 shows HVM kdump failed on SLE15-SP4, same as in the originally reported job, same in https://openqa.suse.de/tests/10663654#step/kdump_and_crash/448 also SLE15-SP4. https://bugzilla.suse.com/show_bug.cgi?id=1201645 looks related or similar. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c1 --- Comment #1 from Olaf Hering <ohering@suse.com> --- In my testing with a SLE15-SP2-LTSS dom0 and a SLE12-SP5 HVM domU, kdump in domU works with kernel-default up to 4.12.14-122.127.1, but starts to fail with 4.12.14-122.130.1. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c2 --- Comment #2 from Olaf Hering <ohering@suse.com> --- When running a xen-4.14.5_08-150300.3.40.1 dom0 and domU kernel 4.12.14-122.127.1, domU kdump works with 'echo c > /proc/sysrq-trigger'. When running a xen-4.16.3_04-150400.4.22.1 dom0 and domU kernel 4.12.14-122.127.1, this is reported in xl dmesg after 'echo c > /proc/sysrq-trigger' in domU: (XEN) d2 has active grant 0 (MFN: 0x10540ba) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c3 --- Comment #3 from Olaf Hering <ohering@suse.com> --- (XEN) d2 has active grant 0 (MFN: 0x105473d) is also reported with xen-4.16.0_08-150400.2.12 and xen-4.17.0_04-150500.1.1 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c4 --- Comment #4 from Olaf Hering <ohering@suse.com> --- It happens to work with xen-4.18.20230314T170323.391f1e13, despite the "(XEN) d1 has active grant 0 (MFN: 0x10556fd)" message ... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c5 --- Comment #5 from Olaf Hering <ohering@suse.com> --- I think we need 1e454c2b5b1172e0fc7457e411ebaba61db8fc87. Not sure why this was not backported upstream, likely because it is a tools change ... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c7 Olaf Hering <ohering@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #7 from Olaf Hering <ohering@suse.com> --- committed to master and SLE15-SP4-Branch -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c8 --- Comment #8 from Oliver Kurz <okurz@suse.com> --- can you please follow a submit request or something I could track? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1209245 https://bugzilla.suse.com/show_bug.cgi?id=1209245#c9 --- Comment #9 from Olaf Hering <ohering@suse.com> --- (In reply to Oliver Kurz from comment #8)
can you please follow a submit request or something I could track?
I missed the SP4 submission for some pending XSA. It will be part of future submissions of pkg xen. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com