[kernel-bugs] [Bug 1172541] New: Mysterious crashes with kernel 5.6.14
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 Bug ID: 1172541 Summary: Mysterious crashes with kernel 5.6.14 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Factory Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jimc@jfcarter.net QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Version: kernel-default-5.6.14-1.1.x86_64 and aarch64 for Tumbleweed I did a dist-upgrade on 2020-06-02. After I waited for download.opensuse.org to recover from its illness, the upgrade went smoothly. But after I rebooted all the machines, 5 to 10 minutes later two of them went catatonic, and a third (my laptop) froze up a few minutes after being booted up two days later. After I power cycled the gateway machine, it again froze about five minutes after booting. On those hosts I reverted (by the grub menu) to kernel-default-5.6.12-1.3.x86_64 and aarch64. There were no further catatonic incidents. The other hosts continued (on 5.6.14) for up to 36 hours and did not go catatonic (but I eventually reverted them also). Here are the symptoms: On two machines I was working on the console (XFCE) when it happened. It seemed perfectly normal, but suddenly keystrokes had no effect and continuously updating apps (load meter etc.) ceased updating. On the gateway, network thru traffic was no longer forwarded. After I rebooted it I checked syslog (debug level). After the last normal message (e.g. DHCP lease renewed) there was a burst of \0's, and the dmesg dump appeared immediately after, starting with "microcode updated early to revision..." and "Jun 4 08:47:03 jacinth kernel: [ 0.000000] Linux version 5.6.12-1-default (geeko@buildhost) (gcc version 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux)) #1 SMP Tue May 12 17:44:12 UTC 2020 (9bff61b)", i.e. the very first expected messages from dmesg. (I carefully booted it into 5.6.12 not 5.6.14.) I have not been able to find any useful symptoms that might shed light on what is killing the machines. Machines running 5.6.12 did not crash, both before the dist-upgrade and after I reverted. For what it's worth, here is some data about the various machines. These went catatonic: jacinth Gateway, directory server, never sleeps. Intel NUC6CAYH, Celeron(R) CPU J3455 @ 1.50GHz, Intel HD Graphics 500 (i915). It runs hostapd as a wireless access point, driver 8812au from rtl8812au-kmp-default-5.6.4.2+git20200318.49e98ff_k5.6.12_1-1.7.x86_64 xena Laptop, wireless, powered off when the human is sleeping. Acer Aspire A515-54-51D1, Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz, Intel UHD Graphics 620 (Whiskey Lake) (i915) diamond Normal desktop. When it crashed nobody was using it. At night it hibernates. Intel NUC7i5BNH, Core(TM) i5-7260U CPU @ 2.20GHz, Intel Iris Plus Graphics 640 (i915) These survived up to 36 hours on 5.6.14 (called "non-catatonic" below): claude Webserver (1.5 hits/min). VM (KVM) on Jacinth, x86_64, video=cirrus holly Desktop replacement. Raspberry Pi-3B, Broadcom BCM2837 @1.2GHz and VideoCore IV (vc4). aarch64. RPi's can't sleep. iris Audio-video player (17kbyte/sec), hibernates when unused. Intel NUC6CAYH, Celeron(R) CPU J3455 @ 1.50GHz (like Jacinth), Intel HD Graphics 500 (i915) oso Development VM (KVM) on Diamond, x86_64, video=cirrus petra Development VM (KVM) on Xena, x86_64, video=cirrus surya Cloud server, VM (KVM) at Linode, never sleeps :-), Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, no GPU at all. Thinking that X-Windows activity might be correlated with catatonia, for about 6 hours I set the non-catatonic machines (except Surya, has no console) to show screensaver eye candy continuously. I was wrong; none of them crashed. Iris hardware is identical to Jacinth, yet it did not go catatonic while Jacinth failed more times than any other host. I don't really know what effective action could be taken, beyond trying to guess which kernel patch may have been the culprit. Anyway, thank you for whatever you can do with this info. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 http://bugzilla.opensuse.org/show_bug.cgi?id=1172541#c2 --- Comment #2 from James Carter <jimc@jfcarter.net> --- I installed kernel-default-5.7.0-3.1.gad96a07.x86_64 from OBS Kernel:stable. Unfortunately the most revealing machines could not use it: Xena uses EFI boot, you can't switch to "legacy", and the kernel's signature is not acceptable to the EFI checker. Jacinth is the Wi-Fi access point and needs rtl8812au-kmp-default-5.6.4.2+$git.k$version, which is not available for kernel 5.7.0. Anyway the other machines have been running about 24 hours with no catatonia. I'll check back when the production version is out. I was reading changelogs at https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog- $version. 5.6.13 and 5.6.16 had nothing relevant; 5.6.15 had a fix for i915 that also seemed irrelevant, 5.6.14 had a hit: Commit f18d3beac6bc2e1bddfba202f5327200acbda54c by Chris Wilson. This is a fix (possibly gone awry) for the following issue and 2 others: https://gitlab.freedesktop.org/drm/intel/-/issues/1746 OP Arnout Engelen ("1 month ago", about 2020-05-(early)) "i915_gem_evict_something stalls". He has kernel 5.7.0-rc1 and gets what sounds a lot like my problem including no useful log output. He and Chris Wilson worked out what kconfig items to turn on (rebuild kernel), and he got 400Mb of log files which he gzipped and posted :-) He cross references two seemingly relevant bugs from kernel 5.5.2 and 5.3.8. I never saw the actual changelog for 5.7.0-rc1. Having got a hit for i915, and since I was becoming impatient reading all the non-hits, I decided to limit the search to i915 only. The next one was filed under "5.7". Commit e7cea7905815ac938e6e90b0cb6b91bcd22f6a15 Linus pulls a bunch of DRM fixes including i915 including something from Chris Wilson closely related to the one above under 5.6.14. I'm a little confused by the timing and version interlocking, but I'm calling it like this: Arnout Engelen is running NixOS with a self-compiled kernel, which was 5.7.0-rc1. Chris Wilson came up with a fix which landed in 5.6.14 and bit me. I'm guessing that something was wrong with the fix, and Chris Wilson improved it (I never saw the changelog stanza for this). Linus eventually got it in time for 5.7.0 (post-rc1). I hope I haven't fixated on a plausible but irrelevant issue, and the real problem is in some of the changes that I skipped over. About the stack traceback: if I read the changelog right, in rare cases it's possible for the i915 driver to get into an infinite loop looking for tasks to evict from an assignment table. That would explain sudden catatonia. But there's no "oops" and nothing is written to syslog or dmesg. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 http://bugzilla.opensuse.org/show_bug.cgi?id=1172541#c4 --- Comment #4 from James Carter <jimc@jfcarter.net> --- I installed kernel-default-5.7.1-1.2.x86_64 (except aarch64 on one machine). The three machines that froze up on kernel-default-5.6.14-1.1.x86_64, also froze up on 5.7.1. I kept the other six on 5.7.1 for about 48 hours witn no freezeups. Again there is no trace of the problem in the logs, when it happens. I have two machines (Intel NUC6CAYH, Celeron J3455, Intel HD Graphics 500, 2x4Gb RAM, Realtek RTL8111/8168/8411 NIC) with identical hardware, for a quick replacement if the more critical one fails: this is the main router, music library, directory server, etc. The other is a leaf node that does audio performance. The router is always the first to freeze up, whereas the leaf node has never frozen yet, despite substantial net traffic for the music. I suspect that hardware is irrelevant, but it's way too soon to blow off this possibility. I suspect that the roles exercise a vulnerable data path on the router, but there is no evidence of this either. However, when I put 5.7.1 on my "non-failing" hosts I got a rash of connection failures. This is like chasing ghosts: here's the issue that I worked on first, since it was the most mission-critical. I don't seriously expect anyone to figure it out, but I'm getting it on record in case it provides a clue. My publicly exposed webserver is on a VM (running 5.7.1 on the Celeron). IPv4+6 connections from the wild side come into the main router (running 5.6.12) and get DNAT to the webserver. HTTP and HTTPS on IPv4 work. IPv6 from the wild side used to work, not now, but with the "comorbid" network issues IPv6 is too complicated to give any clues. A tester on the webserver VM attempts https://www.jfcarter.net/ and it times out, both IPv4 and IPv6. I did lots of troubleshooting including tcpdumps and reboots (into 5.7.1). The client would initiate a TLS connection with SNI, the server would send its cert which the client would accept as valid, but the server then just sat there, not sending a TLS session ticket. Finally I tried a scorched earth solution: rebooted all the leaf nodes and VMs into 5.6.12. The net issues miraculously vanished (not seen for 2 days), and specifically, web service was back to normal. It looks like kernel 5.7.2 is now out. I'm going to install it on one vulnerable machine and see if it freezes. But in the meantime I'm going to learn to use git and I'll try to use the bisect feature to find the commit that messed it up. I wish there were a symptom that would appear immediately and left the machine alive enough that you could log in remotely and boot back to 5.6.12. But the world isn't arranged for our convenience. If I ever accidentally uninstall 5.6.12 I'll have a real problem. Several times I've wanted to revert a bad update, but my only recourse was either to wait for SuSE to release a fixed version or to fix it myself, e.g. https://bugzilla.opensuse.org/show_bug.cgi?id=1172256 . The present issue (bug 1172541) finally got me moving to set up an archive of all RPMs installed on any of my machines, keeping no longer installed versions for a month or two, in case of reversion. Did I reimplement the wheel, that is, does SuSE have such an archive server? If not, maybe you should. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 http://bugzilla.opensuse.org/show_bug.cgi?id=1172541#c5 --- Comment #5 from Takashi Iwai <tiwai@suse.com> --- FYI, Tumbleweed has some archives of the former snapshot releases, available at http://download.opensuse.org/history/ But it doesn't keep forever but rather for only a short period. You'd need to zypper addlock to keep some certain package locally. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 http://bugzilla.opensuse.org/show_bug.cgi?id=1172541#c6 --- Comment #6 from James Carter <jimc@jfcarter.net> --- "Good" news from a test: I have these two identical machines, one with a VM and one without; the VM host freezes reliably and the leaf node has never been seen to freeze. OK, on the leaf note (Iris) I installed qemu-kvm-5.0.0-3.1.x86_64 libvirt-daemon-6.4.0-1.1.x86_64 and numerous friends (the latest) and created a new VM called Orion, basically copying the XML and software selection from my other VMs. For testing I ran a console session on the guest (VNC through virt-viewer) and set the guest's screensaver to show random eye candy, giving the guest and host consistent but not excessive net traffic and CPU load. (I have the impression that the eye candy is not necessary; just being booted is enough; but that's not assured and it's more prudent to load up the guest, to maybe avoid a false negative.) With kernel 5.6.12 on the host and 5.7.1 on the guest, it ran for 14 hours without freezing. With 5.7.1 on both the host and the guest, it ran for 5 hours and the host froze up, being rebooted by its watchdog timer. I can't tell if the guest had any issues just before the host froze; I was in bed at the time. As usual, the log files on both the host and guest look perfectly normal, then end with a block of \0, and continue on with normal-looking output from the rebooted session. So this does support the hypothesis that machines running kernel 5.6.14+ (here, 5.7.1) will freeze if and only if they are host to a VM. It does not support the alternative fact that the recent update to libvirt and qemu may have made the host stop freezing. In the upgrades to 5.6.14 and kernel-default-5.7.0-3.1.gad96a07.x86_64 two of the VM hosts would freeze within 5 minutes of booting (tried several times each), but hosts that are less mission critical and more convenient for testing also take a long time to freeze, if they're going to. It's going to be hard to get developers to pay attention to this problem if they can't reproduce it in-house, and I'm going to try to put together an easily describable software and VM configuration that will freeze up. @tiwai, I retrieved kernel-default-5.6.12-1.3.x86_64 from http://download.opensuse.org/history/ , thanks for the link! -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 James Carter <jimc@jfcarter.net> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Mysterious crashes with |Mysterious crashes with |kernel 5.6.14 |kernel 5.6.14, suspect KVM -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1172541 http://bugzilla.opensuse.org/show_bug.cgi?id=1172541#c9 --- Comment #9 from James Carter <jimc@jfcarter.net> --- @miroslav: During the summer I tried to investigate but the problem is, the thing won't fail reliably, which makes bisection very hard. And then USA politics went into underdrive, which took a lot of time and focus away from "important" work like the kernel bug. I did a test last night (2020-10-21 AM). First with kernel-default-5.8.14-1.2, but unbound wouldn't start, see bug 1177715. I reverted to 5.7.9-1-default (on the VM host), started the guest, gave it some minimal work (other host logs in with X-Windows propagation, guest pings various hosts and shows a graph of the results with X-Windows), and let it run. Every 3 hours both hosts do some monitoring and auditing tasks which fork off up to 10 tests in parallel, 50-60 tests overall. I've noticed that the host is most likely to crash during these tests, but my attempts to point the finger of blame at a specific test were fruitless. The tests at midnight went OK (no complaints, no crash), but 03:04 was the last message in syslog on both machines, and the host crashed (no watchdog). So kernel 5.7.9 doesn't have a fix. It's interesting that neither the host nor the guest's syslog has a block of \0's after the last message, while back in June they did. That means that then and now the log file's inode is getting written showing how much stuff had been sent to the kernel, but formerly the data was not yet synced to disc, while now it does get synced more promptly. rsyslogd has an option to sync after every write, which I didn't turn on, and perhaps I should. [Done.] @joerg: None of my machines has a serial port. I could try buying a pair of USB serial dongles, with a wire between. If I get this working I'll post the result. USB is useless while booting, but if the machine is fully functional until the crash, the USB dongle should send at least something. I upgraded the guest's packages, specifically libvirt, qemu-x86 and numerous friends, to get recent important-seeming bugfixes that had not been installed on the mothballed guest due to political distractions. I'll try the test again. Wish me luck, I'm going to need it. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com