Bug ID 1209245
Summary Xen VM fails to be destroyed (or crashes completely?) if Linux kernel in HVM guest records a crash with configured crashkernel+kdump
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.4
Hardware Other
URL https://openqa.suse.de/tests/10650309/modules/kdump_and_crash/steps/151
OS Other
Status NEW
Severity Major
Priority P5 - None
Component Virtualization:Other
Assignee virt-bugs@suse.de
Reporter okurz@suse.com
QA Contact qa-bugs@suse.de
CC ohering@suse.com, santiago.zarate@suse.com
Found By openQA
Blocker Yes

## Observation

openQA test in scenario
sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-extratest@svirt-xen-hvm
fails in
[kdump_and_crash](https://openqa.suse.de/tests/10650309/modules/kdump_and_crash/steps/151)
waiting for the VM to be shut down when it is not. The
Xen VM fails to be destroyed if Linux kernel in HVM guest records a crash with
configured crashkernel+kdump.
https://openqa.suse.de/tests/10663267 is a simplified test scenario showing the
same.

After triggering a deliberate crash+kdump in a VM causes the VM to be listed as
still "running" in `virsh list --all` in contrast to previously "shut off".
Interestingly the "Id" vanishes in that list and is replaced with `-`. If I
just trigger a crash in a VM that does not have crashkernel configured the
system goes to "shut off" after 2s as expected. I ran `xl` during test
execution and as soon as the kernel crash is deliberately triggered the xl
output changes to

```
(null)                                     363     0     1     --p--d
```

corresponding to what `virsh list --all` reports. As soon as the crash is
induced the corresponding qemu process vanishes and `virsh list --all` reports
the machine with "(null)", two seconds before xl reports the machine with Id
"-".

Xen 4.16.3_04-150400.4.22.1.x86_64 and domU kernel SLE 12 SP5
4.12.14-120-default.

## Test suite description

The openQA test boots a Xen HVM instance on a Leap 15.4 hypervisor host with
the OS in the VM being SLE12-SP5-JeOS. The operating system in the VM is
configured with "yast kdump" to reserve memory for the crashkernel and enable
the kdump collecting service, then rebooted before a kernel crash is
artifically induced with `echo c > /proc/sysrq-trigger` when the VM crashes (as
expected). The VM should be destroyed which is what the openQA test expects.


## Reproducible

Reproducible every time in openQA tests running Xen HVM on
openqaw5-xen.qa.suse.de after upgrading the hypervisor host from SLE15-SP2 to
SLE15-SP4 and now Leap 15.4 but *only* happens if the crashkernel+kdump is
enabled.


## Expected result

Last good: [20230306-1](https://openqa.suse.de/tests/10630886)
when the hypervisor host was still on SLE15-SP2 showing how kdump and
crashkernel is configured, the VM is destroyed and recreated, rebooted and then
a kernel crash is induced which is properly caught and recovered by the
crashkernel.


## Further details

Always latest result in this scenario:
[latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=JeOS-for-kvm-and-xen-Updates&machine=svirt-xen-hvm&test=jeos-extratest&version=12-SP5)

https://progress.opensuse.org/issues/125783 has more details on our
investigation.

https://openqa.suse.de/tests/10663267/video?filename=video.ogv shows that the
kernel crash backtrace is shown from the point when a kernel crash is triggered
till the end of test execution. This supports my hypothesis that the VM crashes
completely and is unable to load the crashkernel.

https://openqa.suse.de/tests/10663651#step/kdump_and_crash/449 shows HVM kdump
failed on SLE15-SP4, same as in the originally reported job, same in
https://openqa.suse.de/tests/10663654#step/kdump_and_crash/448 also SLE15-SP4.

https://bugzilla.suse.com/show_bug.cgi?id=1201645 looks related or similar.


You are receiving this mail because: