[Bug 1219522] New: Kernel panic with 6.7.x version
https://bugzilla.suse.com/show_bug.cgi?id=1219522 Bug ID: 1219522 Summary: Kernel panic with 6.7.x version Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: e.kleinmentink@zonnet.nl QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Created attachment 872422 --> https://bugzilla.suse.com/attachment.cgi?id=872422&action=edit dmesg 6.6.11-1 kernel I get a kernel panic since the 6.7.x kernels. The capslock key is flickering and if i use the "recovery mode" i usually see: [T8] psmouse serio1: synaptics: Touchpad model: 1, fw: 8.16, id: 0x1e2b1, caps: 0xf01fa3/0x940300/0x12e800/0x400000, board id: 3276, fw id: 2700068 [T8] psmouse serio1: synaptics: serio: Synaptics pass-through port at isa0060/serio1/input0 [T8] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio1/input/input2 I have no idea how to debug this. No error message. In the past i applied a kernel bisect (for debian i think) but i could not find a guide for Tumbleweed. Hardware: Lenovo T580 Grub menu contains: * 6.7.2-1 * 6.7.1-2 * 6.6.11-1 Included a "dmesg" output of a normal boot with the "6.6.11-1" kernel. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c1 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tiwai@suse.com --- Comment #1 from Takashi Iwai <tiwai@suse.com> --- Check with 6.7.3 kernel in OBS Kernel:stable repo at first. http://download.opensuse.org/repositories/Kernel:/stable/standard/ If the problem persists, try to remove "verbose" and "splash=...." boot options. This might give you a bit better insight. If the problem still isn't visible, try to boot with "nomodeset" option instead. If this works, the problem lies in the graphics driver. In addition, you can try the 6.8-rc kernel in OBS Kernel:HEAD repo, too http://download.opensuse.org/repositories/Kernel:/HEAD/standard/ -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c3 --- Comment #3 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #2)
I am hesitant to install kernels. My grub list contains 3 items. 2 broken. If it will remove my working kernel i can not boot anymore.
Increase the number of installable kernels by editing /etc/zypp/zypp.conf before installing more test kernels. Add entries in multiversion.kernels line, e.g. multiversion.kernels = latest,latest-1,latest-2,latest-3,running so that the system can keep more kernel packages.
Also unclear which rpm i should install.
Just kernel-default.rpm should suffice. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c5 --- Comment #5 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #4)
dracut[I]: *** Including module: zfs *** dracut-install: Failed to find module 'zfs' dracut[E]: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.uflnAv/initramfs -H -N ^i2o_scsi$ --kerneldir /lib/modules/6.8.0-rc3-1.gae4495f-default/ -m zfs dracut[F]: installkernel failed in module zfs warning: %post(kernel-default-6.8~rc3-1.1.gae4495f.x86_64) scriptlet failed, exit status 1
Do you use zfs, i.e. out-of-tree module? There is no guarantee that it'd work with such a module, of course. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c7 --- Comment #7 from Takashi Iwai <tiwai@suse.com> --- OK, relieved :) And why are you booting with recovery mode? Didn't the normal boot work in the past? The recovery mode isn't meant for the daily use. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c9 --- Comment #9 from Takashi Iwai <tiwai@suse.com> --- Hm, and if you do boot the kernel normally but with nomodeset option and the removal of verbose & splash=*, you still don't see any messages at the crash? If so, at which point does it crash? Is it the very early stage? It's difficult to judge without knowing what's going on. And, you can try 6.8-rc kernel as mentioned. If it works, there is a good chance that 6.7.x will catch up the fix later. OTOH, if it doesn't work with 6.8-rc, it's something to be addressed in the upstream. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c12 --- Comment #12 from Takashi Iwai <tiwai@suse.com> --- Actually it should be "quiet" to be removed instead "verbose". And, the photo snapshot was taken with "nomodeset" boot session? This option disables the native graphics, hence if it's the case of native graphics, the kernel continues to use EFI frame buffer, which is more robust. Also, at this moment, is LED flushing? If so, it's weird; the LED flush indicates usually a kernel panic, and a kernel panic should print something to the screen as much as possible. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c15 --- Comment #15 from Takashi Iwai <tiwai@suse.com> --- Oh that's an important point that sometimes boot failed in the past. First off: don't use recovery mode. It brings nothing but confusion. It's rather for certain purposes, e.g. where the installation failed or so, but not for debugging or recovering like this case. It lead to other problems. So, keep away from it. So, test only with the normal boot, but with extra options or removal of options. Do I understand correctly that the very same symptom appears with nomodeset option and the removal of "silence" and "splash=*" options on the normal boot? That is, even though the caps lock flushing, you see no kernel messages but the screen got frozen? Did you wait long enough (e.g. for a minute) after that? The second point to be checked is why dracut invocation for 6.8-rc kernel fails. Does it fail in that way only with 6.8-rc kernels? Or did you see the similar failures with current or older versions? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c19 --- Comment #19 from Takashi Iwai <tiwai@suse.com> --- Thanks. It indicates that some interrupt-related bug was triggered and it likely remains after the (warm-) reboot with 6.6.x. For now, we can track two things: - Set up kdump and try to catch the crash on 6.7.x kernel - Test 6.8-rc kernel The latter was attempted in comment 4, and it showed an error of zfs. But this looks really strange. The default dracut package has no zfs module. You must have installed the zfs stuff in addition. Try to check the contents in /usr/lib/dracut/modules.d. There should be a directory with '*zfs' (e.g. "90zfs"). If there is, figure out which package it belongs to: rpm -qf /usr/lib/dracut/modules.d/90zfs If you don't use zfs on your system, there is really no need for that package, and better to get rid of it. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c21 --- Comment #21 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #20)
Installed kernel kernel-debug-6.8~rc3-3.1.g7450939.x86_64.rpm. Now i am back to the "random" chance of a booting system.
What do you mean exactly? If you get a kernel panic even with this kernel, you'll need to report to the upstream. In either way, it's better to set up kdump and try to catch the crash log at first. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c22 --- Comment #22 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #20)
Can i go back to a really old kernel? Like a year old?
There are unofficial kernel builds for old versions found in my OBS repos, e.g. home:tiwai:kernel:6.6, home:tiwai:kernel:6.5, etc. But 6.6.x kernel still works even after rebuilding initrd without zfs, right? Then the regression is clearly after 6.6.x. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c27 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |e.kleinmentink@zonnet.nl Flags| |needinfo?(e.kleinmentink@zo | |nnet.nl) --- Comment #27 from Takashi Iwai <tiwai@suse.com> --- Thanks, now it's more interesting. The crash logs show consistently about the NULL dereference of synaptics stuff. I'm building a test kernel with an additional NULL check in OBS home:tiwai:bsc1219522 repo. Once after the build finishes (takes an hour or so), it'll appear at http://download.opensuse.org/repositories/home:/tiwai:/bsc1219522/standard/ Please give it a try. It'll still show a kernel warning with the stack trace (intentionally), but it shouldn't really crash, if my guess is correct. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c29 --- Comment #29 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #28)
I get a 404 for that link.
It appears to be a problem of OBS web UI. You can get the binaries via osc directly, instead. osc getbinaries home:tiwai:bsc1219522/kernel-source:kernel-default/standard/x86_64
fwiw: If i remember correctly i disabled the touchpad that years ago in the bios.
Obviously synaptics stuff is still detected and enabled. It might be the reason of the breakage, though; some inconsistent configuration that confused the kernel driver. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c31 --- Comment #31 from Takashi Iwai <tiwai@suse.com> --- Could you check that you still have a kernel warning with stack trace from the patched test kernel? And, since you disabled the thouchpad in BIOS, the touchpad itself doesn't work after the boot? FWIW, the below is the test patch. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c32 --- Comment #32 from Takashi Iwai <tiwai@suse.com> --- Created attachment 873973 --> https://bugzilla.suse.com/attachment.cgi?id=873973&action=edit Test fix patch -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c34 --- Comment #34 from Takashi Iwai <tiwai@suse.com> --- (In reply to Edwin KM from comment #33)
can you create a older kernel with the patch applied? Something like 6.8.1-1 (or older).
Why? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c35 --- Comment #35 from Takashi Iwai <tiwai@suse.com> --- I'm asking it because my test patch is merely a workaround and for spotting out the cause. It has to be reported to the upstream devs and address more properly. That'll be the final fix. If the bug is really about the NULL dereference there, the kernel warning should appear, and it has to be verified. Then we need to understand why this NULL dereference happens at the first place. My test kernel is provided for checking that. So, please upload the dmesg output from the test kernel. Then please confirm whether the touchpad is still actually enabled or not. If the touchpad is dead, it might be a half-baked probe of touchpad that caused the problem. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c38 --- Comment #38 from Takashi Iwai <tiwai@suse.com> --- Indeed there appears no kernel WARNING in your log, so it's likely something else that made working. OBS Kernel:stable contains also 6.8.2 kernel. Could you check with that kernel instead? It should work like mine. Or it's really a timing issue, and in that case, it'd be tough to hunt properly. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c40 --- Comment #40 from Takashi Iwai <tiwai@suse.com> --- You don't have to play with osc usually. I suggested osc because the publishing on OBS was broken at that time. Just grab kernel-default.rpm from the URL listed in comment 1, and install it via zypper install. It was upgraded to 6.8.3 meanwhile. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c44 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(e.kleinmentink@zo | |nnet.nl) | --- Comment #44 from Takashi Iwai <tiwai@suse.com> --- OK, thanks. The kernel WARNING with stack trace like the following is no real crash but it's intentionally showing the stack trace for debugging: [ 8.036410] ------------[ cut here ]------------ [ 8.037093] WARNING: CPU: 2 PID: 662 at drivers/input/mouse/psmouse-base.c:123 psmouse_from_serio+0x1e/0x30 This appears in the both logs. So far, so good. Meanwhile, the first log followed another Oops messages: [ 8.094105] RIP: 0010:__mem_cgroup_charge+0xb/0xb0 [ 8.095183] Code: 81 58 01 00 00 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 <41> 89 d4 55 48 89 fd 48 89 f7 53 e8 35 89 ff ff ba 01 00 00 00 48 [ 8.096284] RSP: 0000:ffffb0f2407bbd88 EFLAGS: 00000246 [ 8.097449] RAX: 0017ffffc0000000 RBX: ffffb0f2407bbe08 RCX: 0000000000000000 [ 8.098584] RDX: 0000000000000cc0 RSI: ffff90d080071600 RDI: ffffe0e484c8b8c0 [ 8.099787] RBP: ffffb0f2407bbe08 R08: ffffe0e484c8b8c0 R09: ffff90d3ef3404f0 [ 8.100868] R10: 0000000000000000 R11: 0000000000000001 R12: ffff90d082919840 [ 8.102187] R13: fffffffffffff000 R14: 0000000000000001 R15: 00007ff171d89000 [ 8.103263] do_anonymous_page+0x23e/0x6e0 [ 8.104764] ? pmdp_invalidate+0x130/0x130 [ 8.105930] __handle_mm_fault+0xb4d/0xe60 [ 8.107361] handle_mm_fault+0x17f/0x360 [ 8.108505] do_user_addr_fault+0x15b/0x670 [ 8.109694] exc_page_fault+0x71/0x160 [ 8.110883] asm_exc_page_fault+0x26/0x30 [ 8.112031] RIP: 0033:0x7ff17376c9e4 [ 8.113173] Code: 3a e0 c5 f8 77 c3 c5 fe 6f 4e 20 f7 c1 00 0e 00 00 75 65 49 89 c9 48 8d 4c 16 ff 48 83 ce 3f 4a 8d 7c 0e 01 48 29 f1 48 ff c6 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 66 66 2e 0f 1f [ 8.114373] RSP: 002b:00007ffc29992758 EFLAGS: 00010212 [ 8.115565] RAX: 00007ff171d7d010 RBX: 0000555933e399e0 RCX: 0000000000014010 [ 8.116710] RDX: 0000000000020000 RSI: 00007ff171e4c000 RDI: 00007ff171d89000 [ 8.117976] RBP: 00007ffc29992830 R08: 00007ff171d7d010 R09: fffffffffff3d000 [ 8.119134] R10: 186afaaa2a71579a R11: d9670ae1eee0759f R12: 0000555933e57bda [ 8.120302] R13: 00007ff1735f2be0 R14: 0000000000020000 R15: 00007ff171d7d010 [ 8.121613] </TASK> [ 8.122608] ---[ end trace 0000000000000000 ]--- This is unexpected, and this can be a real problem. But as it's not visible in the second log, it might be intermittent. In anyway, at least the above logs indicate that my guess was correct: it was the NULL dereference in synaptics driver. I'm going to submit the fix patch; it might be no best fix, but better than crash, obviously. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c45 --- Comment #45 from Takashi Iwai <tiwai@suse.com> --- The upstream submission https://lore.kernel.org/r/20240405084448.15754-1-tiwai@suse.de -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c46 --- Comment #46 from Takashi Iwai <tiwai@suse.com> --- ... and I updated the OBS home:tiwai:bsc1219522 repo with 6.8.4 kernel now. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c49 --- Comment #49 from Takashi Iwai <tiwai@suse.com> --- My fix patch is included in OBS Kernel:stable branch, so the later TW kernel will include it. You can use the kernel from OBS Kernel:stable repo instead, too. Let's keep testing with the kernel including my fix for a while, and see whether the crash happens later or not. My wild guess is that it's an issue happening only in the early boot stage. If a crash happens later, it's likely something else. About the acceptance in the upstream: we just need to wait. Nowadays the response is a bit slow in the input driver subsystem. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c50 Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jslaby@suse.com Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #50 from Jiri Slaby <jslaby@suse.com> --- Hopefully fixed. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c51 Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Flags| |needinfo?(tiwai@suse.com) Resolution|FIXED |--- --- Comment #51 from Jiri Slaby <jslaby@suse.com> --- (In reply to Jiri Slaby from comment #50)
Hopefully fixed.
But only in downstream. Takashi, could you resend? patches.suse/Input-psmouse-add-NULL-check-to-psmouse_from_serio.patch -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219522 https://bugzilla.suse.com/show_bug.cgi?id=1219522#c52 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(tiwai@suse.com) | --- Comment #52 from Takashi Iwai <tiwai@suse.com> --- (In reply to Jiri Slaby from comment #51)
(In reply to Jiri Slaby from comment #50)
Hopefully fixed.
But only in downstream. Takashi, could you resend?
patches.suse/Input-psmouse-add-NULL-check-to-psmouse_from_serio.patch
Done: https://lore.kernel.org/20241230111554.1440-1-tiwai@suse.de -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com