Bug ID 1202138
Summary PowerPC machine crashes frequently after upgrading to Leap 15.4
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.4
Hardware PowerPC-64
OS Other
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter marius.kittler@suse.com
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

After upgrading a PowerPC machine (the openQA worker
qa-power8-4-kvm.qa.suse.de) from Leap 15.3 to Leap 15.4 that machine crashes
frequently. It usually does not stay on for more than a few hours. Downgrading
the machine (by rolling back to the last BTRFS snapshot with Leap 15.3) allows
the machine to run stable again.

Note that there's actually a second machine (the openQA worker
qa-power8-5-kvm.qa.suse.de) that we managed to operate without crashes on Leap
15.4. Both machines seems very similar to me so I'm not sure whether one is
crashing and the other one not.

Note that in the first place both machines did not boot on Leap 15.4. It seemed
to be stuck at some point and the kernel logged messages like
```
[ 197.877239][ C62] watchdog: BUG: soft lockup - CPU#62 stuck for 25s!
[swapper/62:0]` quite frequently
```
quite frequently. So I added the kernel parameter `nmi_watchdog=0`. With this
parameter both machines could boot on Leap 15.4 (but as mentioned only
qa-power8-5 runs stable without crashing). So the kernel command line used is:
```
root=UUID=89ca2dff-86af-478b-8d4c-2a45ca689fd5  nospec kvm.nested=1
kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
nmi_watchdog=0
```

Not sure what other details could be relevant. Unfortunately the journal does
not have any interesting message right before the crash. Via SOL I could once
see a kernel panic being logged:
```
QA-Power8-4-kvm login: [  365.807470][ T3923] EXT4-fs error (device sdb1) in
ext4_free_inode:362: Corrupt filesystem
[  438.050890][   T94] Kernel panic - not syncing: corrupted stack end detected
inside scheduler
[  438.051046][   T94] CPU: 16 PID: 94 Comm: ksof
```
(The filesystem error is likely just a symptom of the crashes.)

Any advice what I could try? Maybe another kernel parameter? Maybe booting Leap
15.4 but with the kernel version from 15.3 (not sure how I'd do that, though -
so any advice would be welcome if that idea sounds helpful)?

By the way, that's `/proc/cpuinfo` on the problematic machine:
```
root=UUID=eebe647f-e867-416e-a0fa-7a6732bfcf9d  nospec kvm.nested=1
kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M
martchus@QA-Power8-4-kvm:~> cat /proc/cpuinfo 
processor       : 0
cpu             : POWER8, altivec supported
clock           : 3857.000000MHz
revision        : 2.0 (pvr 004d 0200)
[��������� repeated 7 more times with processor 8, 16, 24, 32, 40, 48 and 56]

timebase        : 512000000
platform        : PowerNV
model           : 8348-21C
machine         : PowerNV 8348-21C
firmware        : OPAL
MMU             : Hash
```

On the stable machine it looks very similar but it is actually a model with
more processors:
```
processor       : 0
cpu             : POWER8 (raw), altivec supported
clock           : 3857.000000MHz
revision        : 2.0 (pvr 004d 0200)
[��������� repeated 13 more times with processor 8, 16, 24, ���������]

timebase        : 512000000
platform        : PowerNV
model           : 8335-GCA        
machine         : PowerNV 8335-GCA        
firmware        : OPAL
MMU             : Hash
```

That's the relevant ticket on the openQA infrastructure tracker (for additional
context):
https://progress.opensuse.org/issues/114565


You are receiving this mail because: