[Bug 1202138] New: PowerPC machine crashes frequently after upgrading to Leap 15.4
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138 Bug ID: 1202138 Summary: PowerPC machine crashes frequently after upgrading to Leap 15.4 Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: PowerPC-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: marius.kittler@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- After upgrading a PowerPC machine (the openQA worker qa-power8-4-kvm.qa.suse.de) from Leap 15.3 to Leap 15.4 that machine crashes frequently. It usually does not stay on for more than a few hours. Downgrading the machine (by rolling back to the last BTRFS snapshot with Leap 15.3) allows the machine to run stable again. Note that there's actually a second machine (the openQA worker qa-power8-5-kvm.qa.suse.de) that we managed to operate without crashes on Leap 15.4. Both machines seems very similar to me so I'm not sure whether one is crashing and the other one not. Note that in the first place both machines did not boot on Leap 15.4. It seemed to be stuck at some point and the kernel logged messages like ``` [ 197.877239][ C62] watchdog: BUG: soft lockup - CPU#62 stuck for 25s! [swapper/62:0]` quite frequently ``` quite frequently. So I added the kernel parameter `nmi_watchdog=0`. With this parameter both machines could boot on Leap 15.4 (but as mentioned only qa-power8-5 runs stable without crashing). So the kernel command line used is: ``` root=UUID=89ca2dff-86af-478b-8d4c-2a45ca689fd5 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M nmi_watchdog=0 ``` Not sure what other details could be relevant. Unfortunately the journal does not have any interesting message right before the crash. Via SOL I could once see a kernel panic being logged: ``` QA-Power8-4-kvm login: [ 365.807470][ T3923] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem [ 438.050890][ T94] Kernel panic - not syncing: corrupted stack end detected inside scheduler [ 438.051046][ T94] CPU: 16 PID: 94 Comm: ksof ``` (The filesystem error is likely just a symptom of the crashes.) Any advice what I could try? Maybe another kernel parameter? Maybe booting Leap 15.4 but with the kernel version from 15.3 (not sure how I'd do that, though - so any advice would be welcome if that idea sounds helpful)? By the way, that's `/proc/cpuinfo` on the problematic machine: ``` root=UUID=eebe647f-e867-416e-a0fa-7a6732bfcf9d nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M martchus@QA-Power8-4-kvm:~> cat /proc/cpuinfo processor : 0 cpu : POWER8, altivec supported clock : 3857.000000MHz revision : 2.0 (pvr 004d 0200) [��� repeated 7 more times with processor 8, 16, 24, 32, 40, 48 and 56] timebase : 512000000 platform : PowerNV model : 8348-21C machine : PowerNV 8348-21C firmware : OPAL MMU : Hash ``` On the stable machine it looks very similar but it is actually a model with more processors: ``` processor : 0 cpu : POWER8 (raw), altivec supported clock : 3857.000000MHz revision : 2.0 (pvr 004d 0200) [��� repeated 13 more times with processor 8, 16, 24, ���] timebase : 512000000 platform : PowerNV model : 8335-GCA machine : PowerNV 8335-GCA firmware : OPAL MMU : Hash ``` That's the relevant ticket on the openQA infrastructure tracker (for additional context): https://progress.opensuse.org/issues/114565 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c1
Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c2
--- Comment #2 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c3
--- Comment #3 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c4
--- Comment #4 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c5
--- Comment #5 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c6
--- Comment #6 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c9
--- Comment #9 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c10
--- Comment #10 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c11
--- Comment #11 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c12
--- Comment #12 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c13
--- Comment #13 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c14
--- Comment #14 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c15
--- Comment #15 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c16
--- Comment #16 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c17
--- Comment #17 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c18
--- Comment #18 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c19
--- Comment #19 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c20
--- Comment #20 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c21
--- Comment #21 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c22
--- Comment #22 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c23
--- Comment #23 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c24
--- Comment #24 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c25
--- Comment #25 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c26
--- Comment #26 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c27
--- Comment #27 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c28
--- Comment #28 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c29
--- Comment #29 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c30
--- Comment #30 from Marius Kittler
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c32
Oliver Kurz
What kind of hardware is powerqaworker-qam-1.qa.suse.de?
Yes, this is of course also Power8. https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9710 says it's in particular IBM Power System S822LC
So far the problematic workers were POWER8 which use the old KVM code which gets very little testing upstream.
Further, the problematic workers use 'openpower' firmware. That's some specific firmware different from what most machines use, and the little testing that is done by upstream on POWER8 likely happens on different firmware that also supports PowerVM.
Finally the virtualization team does not provide any support for KVM on Power at all. Any support we have is best-effort. I don't have any POWER8 hardware capable of running KVM available.
Let me try to use this channel. Maybe we can find an answer here to a question that was never properly answered elsewhere: What do you consider the best way to run virtual machine based tests on PowerPC and how to implement that? So far we always preferred qemu based tests because we can use the same on x86_64, aarch64, ppc64le (so far) as well as even s390x. Also this way we can scale best because machines can be created on the fly based on test parameters, e.g. RAM and storage size as needed for testing. For PowerVM we have a testing "backend" but it has very limited capabilities compared to bare-metal tests, e.g. interact with pre-configured LPARs and no support to save/load virtual machine images. And implementing that would require very specific PowerVM knowledge and any solution would only be valid for PowerPC. So, what is your take on it? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c33
--- Comment #33 from Michal Suchanek
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c34
--- Comment #34 from Oliver Kurz
If you want to use KVM on Power it's more likely to work on POWER9. It uses the new KVM code that upstream tests and actively develops.
Nonetheless, the virtualization team does not support it, and it may be broken from time to time, and take some time until fixed. Better chances than POWR8, though.
thank you. That is helpful information.
For PowerVM both HMC and Novalink is scriptable so it's very much possible to create VMs on the fly but it requires platform-specific implementation.
For storage and snapshots FC storage and iSCSI storage is supported by PowerVM making it possible to move the save/restore/... functionality outside of the test machine. Either requires extra hardware, the default 4x1Gbit NIC is not great for iSCSI, and FC is not available on most machines.
Yes. That's right. All those ideas are good but platform-specific so an expensive investment if we would continue with that.
It's also possible to use some solution that saves/restores the system over network in a platform-independent way, and it may be of use for testing baremetal as well if it ever gets implemented.
do you mean something like calling "dd" piped to "netcat" from a live-system? Because that's what I was commonly using to deploy systems easily including Microsoft Windows XP :D
Also iSCSI can be used on most hardware.
Management tools that make creating LPARs easier do exist but we did not make use of any so far, mostly because Orthos and openQA do the same thing in a crosss-platform way. It is failing on Power, though.
And s390 KVM is likely not making full use of the hardware capabilities, either.
Well, it's working fine for years and supports saving/loading VM images which is the important factor. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138
http://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c35
--- Comment #35 from Michal Suchanek
participants (1)
-
bugzilla_noreply@suse.com