[Bug 1203539] New: Boot hangs with update to kernel 5.14.21-15040.24.21-default
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 Bug ID: 1203539 Summary: Boot hangs with update to kernel 5.14.21-15040.24.21-default Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: x86-64 OS: openSUSE Leap 15.4 Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jcarricksmith@gmail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 861545 --> https://bugzilla.suse.com/attachment.cgi?id=861545&action=edit Outputs from journalctl -b -? After the latest kernel update, when booting the system the load stalls after the message "Apparmor profiles loaded" is shown on the boot screen. After a pause both the CAPS Lock and NUM Lock lights flash together. The only way to recover from the situation is to hold the power button down. Booting 5.14.21-150400.24.18-default works fine. In the attachment there is the output from a boot on ...24.21-... (badjournal) and the output from good session on ...24.18-... (goodjournal) Many thanks John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c2 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jcarricksmith@gmail.com, | |tiwai@suse.com Flags| |needinfo?(jcarricksmith@gma | |il.com) --- Comment #2 from Takashi Iwai <tiwai@suse.com> --- Yeah, the most important bits are missing in the log. Any chance to get the full stack traces? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c3 --- Comment #3 from Takashi Iwai <tiwai@suse.com> --- Also, could you try to boot with intel_iommu=off boot option? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c4 --- Comment #4 from John Carrick Smith <jcarricksmith@gmail.com> --- 1. Tried with intel_iommu=off on the boot command line and it made no difference. 2. How do I produce the full stack trace please? Thanks John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c5 --- Comment #5 from Takashi Iwai <tiwai@suse.com> --- (In reply to John Carrick Smith from comment #4)
1. Tried with intel_iommu=off on the boot command line and it made no difference.
2. How do I produce the full stack trace please?
The journal was cut off by some reason. Maybe you can get more when sync and unmount via sysrq before the reboot / reset. e.g. at hangup, try the sysrq combo alt-sysrq-k, alt-sysrq-s, alt-sysrq-u, then alt-sysrq-b. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c6 John Carrick Smith <jcarricksmith@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jcarricksmith@gma | |il.com) | --- Comment #6 from John Carrick Smith <jcarricksmith@gmail.com> --- Created attachment 861609 --> https://bugzilla.suse.com/attachment.cgi?id=861609&action=edit Output from journalctl I tried using the alt-sysrq-? sequence and it didn't appear to make any difference. I tried a second time. 1. I left it in the 'hung' state for 3 minutes. 2. Pressed ctrl-alt-del, then ctrl-alt-ins, then ctrl-alt-bksp. 3. Then followed the alt-sysrq-? sequence. There is more data in the attached file than in the previous - I hope it is what you are looking for. John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c7 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jack@suse.com --- Comment #7 from Takashi Iwai <tiwai@suse.com> --- Thanks, it caught more useful information. According to the log, something stuck at blkdev_flush(), as it seems. Through a quick glance, there have been a few changes in the block layer by Jan. Jan, does it ring a bell? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c8 --- Comment #8 from Jan Kara <jack@suse.com> --- For reference the softlockup is: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd-udevd:711] Modules linked in: idma64 thermal fjes(-) ac acpi_pad intel_hid(N) acpi_cpufreq(-) sparse_keymap fuse configfs ip_tables x_tables ext4 crc16 mbcache jbd2 hid_generic usbhid rtsx_pci_sdmmc i915 mmc_core sd_mod i2c_algo_bit ttm drm_kms_helper nvme crc32_pclmul crc32c_intel syscopyarea sysfillrect sysimgblt nvme_core fb_sys_fops cec ahci rtsx_pci rc_core libahci xhci_pci nvme_common xhci_pci_renesas ghash_clmulni_intel xhci_hcd aesni_intel drm crypto_simd cryptd libata usbcore serio_raw mfd_core t10_pi i2c_hid_acpi battery i2c_hid wmi video pinctrl_cannonlake button v4l2loopback(OEN) videodev mc sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod msr efivarfs Supported: No, Unsupported modules are loaded CPU: 0 PID: 711 Comm: systemd-udevd Tainted: G OE N 5.14.21-150400.24.21-default #1 SLE15-SP4 7550826c4c7e8c258239e300508e0c8b2a69bad2 Hardware name: Novatech PB50_70DFx,DDx /PB50_70DFx,DDx , BIOS 1.07.07TNO 09/25/2019 RIP: 0010:smp_call_function_many_cond+0x126/0x560 RAX: 0000000000000011 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000011 RSI: 0000000000000000 RDI: ffff9416c04794c0 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff9416c0434a40 R13: ffff9416c0434a40 R14: ffffffff9f38d650 R15: 00000000000394c0 FS: 00007f340f894b00(0000) GS:ffff9416c0400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564868903ce8 CR3: 000000010a608002 CR4: 00000000007706f0 PKRU: 55555554 Call Trace: <TASK> ? __brelse+0x30/0x30 on_each_cpu_cond_mask+0x25/0x40 kill_bdev.isra.31+0x16/0x30 blkdev_flush_mapping+0x46/0xf0 blkdev_put_whole+0x3a/0x50 blkdev_put+0x57/0x180 blkdev_close+0x21/0x30 __fput+0x8f/0x250 ? __SCT__preempt_schedule_notrace+0x8/0x8 task_work_run+0x70/0xb0 exit_to_user_mode_prepare+0x228/0x230 syscall_exit_to_user_mode+0x18/0x40 do_syscall_64+0x67/0x80 ... So we are actually stuck in the smp_call_function_many_cond() call. The work queued on each CPU is actually pretty trivial (read per-cpu variable and decrease one refcount). So likely this is not related to any change in the block layer but rather due to some CPU getting stuck with interrupts disabled so we cannot execute work on it. Maybe this is somehow related to the USB devices that are probed by mtp-probe when the hang happens? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c9 --- Comment #9 from Takashi Iwai <tiwai@suse.com> --- Makes sense. And, at the end of the log, I see Sep 21 12:26:44 jlinn kernel: watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [mtp-probe:903] and the corresponding one is: Sep 21 12:26:20 jlinn mtp-probe[903]: checking bus 1, device 3: "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-7" -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c10 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jcarricksmith@gma | |il.com) --- Comment #10 from Takashi Iwai <tiwai@suse.com> --- What kind of device is this? I guess almost same slot should be assigned at the boot with the old working kernel. Please give the hwinfo output and the dmesg output from the working kernel. And, if it's an external USB device, could you try to boot without it once? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c11 John Carrick Smith <jcarricksmith@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jcarricksmith@gma | |il.com) | --- Comment #11 from John Carrick Smith <jcarricksmith@gmail.com> --- Created attachment 861611 --> https://bugzilla.suse.com/attachment.cgi?id=861611&action=edit hwinfo and dmesg data Requested hwinfo and dmesg data. I removed external USB devices one by one and even with none plugged in the boot hung in the usual place. I have the journals from those if required. John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c12 --- Comment #12 from Takashi Iwai <tiwai@suse.com> --- (In reply to John Carrick Smith from comment #11)
I removed external USB devices one by one and even with none plugged in the boot hung in the usual place. I have the journals from those if required.
Yes, the log without USB devices is still helpful. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c13 --- Comment #13 from Takashi Iwai <tiwai@suse.com> --- And, I see that usb1-7 is a Synaptics device, i.e. likely relevant with a built-in touchpad or such, which can't be disconnected. That may explain. As a blind shot, could you try to remove once the file /usr/lib/udev/rules.d/69-libmtp.rules ? Maybe better to move somewhere else to save for restoring later. You might need to create initrd again (just run mkinitrd). -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c14 --- Comment #14 from John Carrick Smith <jcarricksmith@gmail.com> --- Created attachment 861652 --> https://bugzilla.suse.com/attachment.cgi?id=861652&action=edit Journalctl and mkinitrd output Rather than removing 69-libmtp.rules from /usr/lib/udev/rules.d/ I followed the idea in the man page for udev and linked /etc/udev/rules.d/69-libmtp.rules to /dev/null which disabled the file in /usr/lib/udev/rules.d/ . It made no difference. I included the output from /var/log/Yast2/mkinitrd.log in case there was anything in there that might help. The file jrnl_no_USB is the journal from a hang when there were no devices plugged in to the computer. John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c16 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jcarricksmith@gma | |il.com) --- Comment #16 from Takashi Iwai <tiwai@suse.com> --- Hmm, it's still not clear what went wrong, unfortunately. Doesn't it have any crash stack trace? In anyway, I built a Leap 15.4 kernel with KASAN enabled, for checking the memory corruptions. A test kernel is found in OBS home:tiwai:kernel:sle15-sp4-kasan repo, http://download.opensuse.org/repositories/home:/tiwai:/kernel:/sle15-sp4-kas... Could you try to download kernel-default.rpm, kernel-default-extra.rpm and kernel-default-optional.rpm from there, and install them? You might need to pass --oldpackage option to zypper install. Note that KMPs won't work with this kernel; e.g. if you have Nvidia driver, it won't work. And this kernel will be quite heavy, the performance must be significantly dropped, so it's really only for debugging. Please check the boot with this kernel and see whether it can catch something. It won't "fix" anything but it may show something earlier. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c17 John Carrick Smith <jcarricksmith@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jcarricksmith@gma | |il.com) | --- Comment #17 from John Carrick Smith <jcarricksmith@gmail.com> --- Created attachment 861741 --> https://bugzilla.suse.com/attachment.cgi?id=861741&action=edit journal output from KASAN enabled kernel Hello Takashi, I downloaded the KASAN enabled kernel as installed as instructed. I needed to install a MOK. The kernel booted to a command prompt login. I have attached the journal to this report. This was not what I was expecting! I logged in as root successfully. Thank you for your help John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c18 --- Comment #18 from Takashi Iwai <tiwai@suse.com> --- Hrm, it doesn't catch anything, so far. And I noticed that it's a system with Nvidia driver. Can you reproduce the problem without Nvidia driver? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c19 --- Comment #19 from John Carrick Smith <jcarricksmith@gmail.com> --- The problem still occurs with the intel driver. I used 'prime-select boot intel' as root and it still hung on boot. When I switched back to the working OS it didn't give me a log in, but ctrl-alt-F1 and a CLI login solved the problem. John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c20 --- Comment #20 from Takashi Iwai <tiwai@suse.com> --- It's not about which driver is being used, but whether Nvidia driver is loaded or not. Once when the driver is loaded and bound, it may break things, even if not actively used. So, if the problem occurs with Intel and we can exclude Nvidia, let's try to reproduce the bug as cleanly as possible. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c21 --- Comment #21 from John Carrick Smith <jcarricksmith@gmail.com> --- Would installation of a debug kernel help? I have no understanding but just wondered. John -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c25 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #25 from Takashi Iwai <tiwai@suse.com> --- OK, then I close this bug now. Feel free to reopen if you encounter with the upcoming release kernel. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
https://bugzilla.suse.com/show_bug.cgi?id=1203539 https://bugzilla.suse.com/show_bug.cgi?id=1203539#c26 --- Comment #26 from John Carrick Smith <jcarricksmith@gmail.com> --- Now running with 5.14.21-150400.24.28-default for several days - all OK. Many thanks. John -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com