Bug ID 1172541
Summary Mysterious crashes with kernel 5.6.14
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware x86-64
OS openSUSE Factory
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter jimc@jfcarter.net
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

Version: kernel-default-5.6.14-1.1.x86_64 and aarch64 for Tumbleweed

I did a dist-upgrade on 2020-06-02.  After I waited for 
download.opensuse.org to recover from its illness, the upgrade went
smoothly.  But after I rebooted all the machines, 5 to 10 minutes
later two of them went catatonic, and a third (my laptop) froze up
a few minutes after being booted up two days later.  After I power cycled
the gateway machine, it again froze about five minutes after booting. 

On those hosts I reverted (by the grub menu) to
kernel-default-5.6.12-1.3.x86_64 and aarch64.  There were no further
catatonic incidents.  The other hosts continued (on 5.6.14) for up to
36 hours and did not go catatonic (but I eventually reverted them also).

Here are the symptoms:  On two machines I was working on the console
(XFCE) when it happened.  It seemed perfectly normal, but suddenly
keystrokes had no effect and continuously updating apps (load meter
etc.) ceased updating.  On the gateway, network thru traffic was no
longer forwarded.  After I rebooted it I checked syslog (debug level).
After the last normal message (e.g. DHCP lease renewed) there was a
burst of \0's, and the dmesg dump appeared immediately after, starting
with "microcode updated early to revision..." and "Jun  4 08:47:03
jacinth kernel: [    0.000000] Linux version 5.6.12-1-default
(geeko@buildhost) (gcc version 9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux)) #1 SMP Tue May
12 17:44:12 UTC 2020 (9bff61b)", i.e. the very first expected messages
from dmesg.  (I carefully booted it into 5.6.12 not 5.6.14.) I have not
been able to find any useful symptoms that might shed light on what is
killing the machines. 

Machines running 5.6.12 did not crash, both before the dist-upgrade and
after I reverted.  

For what it's worth, here is some data about the various machines.  
These went catatonic: 
jacinth     Gateway, directory server, never sleeps.  Intel NUC6CAYH,
            Celeron(R) CPU J3455 @ 1.50GHz, Intel HD Graphics 500
            (i915). It runs hostapd as a wireless access point, driver
            8812au from
           
rtl8812au-kmp-default-5.6.4.2+git20200318.49e98ff_k5.6.12_1-1.7.x86_64
xena        Laptop, wireless, powered off when the human is sleeping.  
            Acer Aspire A515-54-51D1, Intel(R) Core(TM) i5-8265U 
            CPU @ 1.60GHz, Intel UHD Graphics 620 (Whiskey Lake) (i915)
diamond     Normal desktop.  When it crashed nobody was using it.  At
            night it hibernates.  Intel NUC7i5BNH, Core(TM) i5-7260U CPU
            @ 2.20GHz, Intel Iris Plus Graphics 640 (i915)

These survived up to 36 hours on 5.6.14 (called "non-catatonic" below):
claude      Webserver (1.5 hits/min). VM (KVM) on Jacinth, x86_64, 
            video=cirrus
holly       Desktop replacement.  Raspberry Pi-3B, Broadcom BCM2837 
            @1.2GHz and VideoCore IV (vc4).  aarch64.  RPi's can't sleep.  
iris        Audio-video player (17kbyte/sec), hibernates when unused.  
            Intel NUC6CAYH, Celeron(R) CPU J3455 @ 1.50GHz (like Jacinth),
            Intel HD Graphics 500 (i915)
oso         Development VM (KVM) on Diamond, x86_64, video=cirrus
petra       Development VM (KVM) on Xena, x86_64, video=cirrus
surya       Cloud server, VM (KVM) at Linode, never sleeps :-), Intel(R) 
            Xeon(R) CPU E5-2680 v3 @ 2.50GHz, no GPU at all.  

Thinking that X-Windows activity might be correlated with catatonia,
for about 6 hours I set the non-catatonic machines (except Surya, has no
console) to show screensaver eye candy continuously.  I was wrong; none 
of them crashed.  

Iris hardware is identical to Jacinth, yet it did not go catatonic while
Jacinth failed more times than any other host.  

I don't really know what effective action could be taken, beyond trying
to guess which kernel patch may have been the culprit.  Anyway, thank
you for whatever you can do with this info.


You are receiving this mail because: