Bug ID | 1209245 |
---|---|
Summary | Xen VM fails to be destroyed (or crashes completely?) if Linux kernel in HVM guest records a crash with configured crashkernel+kdump |
Classification | openSUSE |
Product | openSUSE Distribution |
Version | Leap 15.4 |
Hardware | Other |
URL | https://openqa.suse.de/tests/10650309/modules/kdump_and_crash/steps/151 |
OS | Other |
Status | NEW |
Severity | Major |
Priority | P5 - None |
Component | Virtualization:Other |
Assignee | virt-bugs@suse.de |
Reporter | okurz@suse.com |
QA Contact | qa-bugs@suse.de |
CC | ohering@suse.com, santiago.zarate@suse.com |
Found By | openQA |
Blocker | Yes |
## Observation openQA test in scenario sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-extratest@svirt-xen-hvm fails in [kdump_and_crash](https://openqa.suse.de/tests/10650309/modules/kdump_and_crash/steps/151) waiting for the VM to be shut down when it is not. The Xen VM fails to be destroyed if Linux kernel in HVM guest records a crash with configured crashkernel+kdump. https://openqa.suse.de/tests/10663267 is a simplified test scenario showing the same. After triggering a deliberate crash+kdump in a VM causes the VM to be listed as still "running" in `virsh list --all` in contrast to previously "shut off". Interestingly the "Id" vanishes in that list and is replaced with `-`. If I just trigger a crash in a VM that does not have crashkernel configured the system goes to "shut off" after 2s as expected. I ran `xl` during test execution and as soon as the kernel crash is deliberately triggered the xl output changes to ``` (null) 363 0 1 --p--d ``` corresponding to what `virsh list --all` reports. As soon as the crash is induced the corresponding qemu process vanishes and `virsh list --all` reports the machine with "(null)", two seconds before xl reports the machine with Id "-". Xen 4.16.3_04-150400.4.22.1.x86_64 and domU kernel SLE 12 SP5 4.12.14-120-default. ## Test suite description The openQA test boots a Xen HVM instance on a Leap 15.4 hypervisor host with the OS in the VM being SLE12-SP5-JeOS. The operating system in the VM is configured with "yast kdump" to reserve memory for the crashkernel and enable the kdump collecting service, then rebooted before a kernel crash is artifically induced with `echo c > /proc/sysrq-trigger` when the VM crashes (as expected). The VM should be destroyed which is what the openQA test expects. ## Reproducible Reproducible every time in openQA tests running Xen HVM on openqaw5-xen.qa.suse.de after upgrading the hypervisor host from SLE15-SP2 to SLE15-SP4 and now Leap 15.4 but *only* happens if the crashkernel+kdump is enabled. ## Expected result Last good: [20230306-1](https://openqa.suse.de/tests/10630886) when the hypervisor host was still on SLE15-SP2 showing how kdump and crashkernel is configured, the VM is destroyed and recreated, rebooted and then a kernel crash is induced which is properly caught and recovered by the crashkernel. ## Further details Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=JeOS-for-kvm-and-xen-Updates&machine=svirt-xen-hvm&test=jeos-extratest&version=12-SP5) https://progress.opensuse.org/issues/125783 has more details on our investigation. https://openqa.suse.de/tests/10663267/video?filename=video.ogv shows that the kernel crash backtrace is shown from the point when a kernel crash is triggered till the end of test execution. This supports my hypothesis that the VM crashes completely and is unable to load the crashkernel. https://openqa.suse.de/tests/10663651#step/kdump_and_crash/449 shows HVM kdump failed on SLE15-SP4, same as in the originally reported job, same in https://openqa.suse.de/tests/10663654#step/kdump_and_crash/448 also SLE15-SP4. https://bugzilla.suse.com/show_bug.cgi?id=1201645 looks related or similar.