[Bug 1190286] New: Kernel 5.14 freezes in conjunction with Dell Docking Station
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286 Bug ID: 1190286 Summary: Kernel 5.14 freezes in conjunction with Dell Docking Station Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: felix.niederwanger@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 852364 --> http://bugzilla.opensuse.org/attachment.cgi?id=852364&action=edit Screenshot of frozen tty1 After the recent update to the 5.14 kernel, my Dell Latitude 7290 Laptop freezes on system startup after like 30 seconds. I could login via the display manager and even open a terminal and then the system froze. I could still move the mouse, but the terminal and everything else was stuck. I could switch to tty1 and tried to login as root there, but was stuck after the password prompt (See attached Screenshot). After about 5 minutes I had to forcefully terminate the system. I did a full system rollback to the previous snapshot (including Kernel 5.13), then installed all updates except the new kernel and rebooted. The system behaves nicely. Then I updated the Kernel as well, rebooted the Laptop and it was frozen again. The issue only happens when the Laptop is connected to the Dell Docking station. When disconnected from the Dock, it behaves nicely. When the system is stuck, disconnecting it from the docking station and waiting a bit also help to unfreeze the system. After reconnecting, the system remains responsive and does not freeze again (I'm writing on it right now). I also restarted the Docking Station but this had no effect at all. The Laptop was still freezing. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c1
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c2
--- Comment #2 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c3
--- Comment #3 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c4
--- Comment #4 from Felix Niederwanger
# dmesg | grep i9 [ 3.580508] i915 0000:00:02.0: vgaarb: deactivate vga console [ 3.582892] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem [ 3.583677] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4) [ 3.599922] [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0 [ 3.631395] fbcon: i915 (fb0) is primary device [ 4.772386] i915 0000:00:02.0: [drm] fb0: i915 frame buffer device [ 12.011087] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915]) [ 12.288226] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915]) [ 29.349099] i915 0000:00:02.0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [ 34.950457] i915 0000:00:02.0: [drm] *ERROR* mstb 000000004491b89b port 1: DPCD read on addr 0x4b0 for 1 bytes NAKed [ 34.959839] i915 0000:00:02.0: [drm] *ERROR* mstb 000000004491b89b port 2: DPCD read on addr 0x4b0 for 1 bytes NAKed [ 34.970051] i915 0000:00:02.0: [drm] *ERROR* mstb 000000004491b89b port 3: DPCD read on addr 0x4b0 for 1 bytes NAKed
-- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c5
--- Comment #5 from Felix Niederwanger
[ 550.712988] nvme nvme0: I/O 65 QID 7 timeout, aborting [ 550.747484] nvme nvme0: Abort status: 0x0 [ 643.897054] nvme nvme0: I/O 101 QID 7 timeout, aborting [ 643.931494] nvme nvme0: Abort status: 0x0 [ 703.033212] nvme nvme0: I/O 81 QID 7 timeout, aborting [ 703.041825] nvme nvme0: Abort status: 0x0 [ 763.709540] nvme nvme0: I/O 99 QID 7 timeout, aborting [ 763.748734] nvme nvme0: Abort status: 0x0 [ 797.241362] nvme nvme0: I/O 109 QID 7 timeout, completion polled [ 839.225551] nvme nvme0: I/O 102 QID 7 timeout, aborting [ 839.260030] nvme nvme0: Abort status: 0x0 [ 884.537508] nvme nvme0: I/O 66 QID 7 timeout, aborting [ 884.572305] nvme nvme0: Abort status: 0x0 [ 915.001594] nvme nvme0: I/O 122 QID 7 timeout, aborting [ 915.036194] nvme nvme0: Abort status: 0x0 [ 960.313707] nvme nvme0: I/O 148 QID 7 timeout, aborting
Here, the laptop froze occasionally for a couple of seconds, before it was running again. The last time it was completely frozen for at least 5 minutes on the user login on VT1 (tty1) - see screenshot above. When rebooting I got also the following error message, which is consistent with nvme issues:
systemd-shutdown[1]: Synching filesystem and block devices - timed out, issuing SIGKILL to PID 10499
This is a Dell Latitude 7290 with a WD Black 1 TB nvme disk (WDS100T3X0C-00SJG0). Those issues only arises after the update to the 5.14 kernel. I can only suspect that the docking station was just a coincidence, however it puzzles me still, why the system unfroze itself after disconnecting the docking station. I rebooted the laptop and until now it behaves fine, no nvme timeout issues are present in the dmesg log. So, it also does not happen after every boot process. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c6
--- Comment #6 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c7
Daniel Wagner
This is a Dell Latitude 7290 with a WD Black 1 TB nvme disk (WDS100T3X0C-00SJG0). Those issues only arises after the update to the 5.14 kernel.
Regarding nvme: the driver nvme-pci driver didn't get any changes which could explain this regression from v5.13 to v5.14. Just a wild guess: there is one commit in the nvme core code which could be potentially be the source of the problem: ebd8a93aa4f5 ("nvme: extend and modify the APST configuration algorithm") All other commits can be ruled out to be trigger this kind of regression. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c8
--- Comment #8 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c9
--- Comment #9 from Daniel Wagner
The issue seems to be gone after updating to 5.14.2-1-default.
The upstream changes from 5.14 to 5.14.2 doesn't have any patches which touches the NVME or PCI subsytem. Also no platform changes there. I haven't checked if there was a TW change. But the stable update from upstream is unlikely to ship anything which could explain why it suddenly works again. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c10
--- Comment #10 from Felix Niederwanger
[ 663.178681] nvme nvme0: I/O 316 QID 2 timeout, aborting [ 663.212948] nvme nvme0: Abort status: 0x0 [ 695.946443] nvme nvme0: I/O 314 QID 2 timeout, aborting [ 695.946654] nvme nvme0: Abort status: 0x0 [ 735.882658] nvme nvme0: I/O 284 QID 2 timeout, aborting [ 735.891212] nvme nvme0: Abort status: 0x0 [ 767.630652] nvme nvme0: I/O 285 QID 2 timeout, aborting [ 767.665078] nvme nvme0: Abort status: 0x0 [ 873.866909] nvme nvme0: I/O 288 QID 2 timeout, aborting [ 873.901802] nvme nvme0: Abort status: 0x0 [ 904.074947] nvme nvme0: I/O 289 QID 2 timeout, aborting [ 904.109535] nvme nvme0: Abort status: 0x0 [ 934.283056] nvme nvme0: I/O 290 QID 2 timeout, aborting [ 934.291613] nvme nvme0: Abort status: 0x0 [ 964.491053] nvme nvme0: I/O 291 QID 2 timeout, aborting [ 964.525606] nvme nvme0: Abort status: 0x0 [ 994.699056] nvme nvme0: I/O 292 QID 2 timeout, aborting [ 994.733555] nvme nvme0: Abort status: 0x0 [ 1189.007101] nvme nvme0: I/O 274 QID 2 timeout, aborting [ 1189.015735] nvme nvme0: Abort status: 0x0 [ 1228.427041] nvme nvme0: I/O 279 QID 2 timeout, aborting [ 1228.435632] nvme nvme0: Abort status: 0x0 [ 1258.507427] nvme nvme0: I/O 290 QID 2 timeout, aborting [ 1258.515919] nvme nvme0: Abort status: 0x0 [ 1288.587228] nvme nvme0: I/O 299 QID 2 timeout, aborting [ 1288.595821] nvme nvme0: Abort status: 0x0 [ 1318.795237] nvme nvme0: I/O 318 QID 2 timeout, completion polled [ 1449.867351] nvme nvme0: I/O 313 QID 2 timeout, aborting [ 1449.901707] nvme nvme0: Abort status: 0x0 [ 1479.883443] nvme nvme0: I/O 314 QID 2 timeout, aborting [ 1479.918007] nvme nvme0: Abort status: 0x0
After rebooting the system was behaving nicely again. Has anyone an idea what's going on here? I have not touched the hardware and those issues only occurred after updating TW to 5.14. Could the nvme contact be loose? Should I disassemble the laptop and unplug and replug the nvme just to see if that makes a difference? I'm absolutely puzzled here. Attaching also the dmesg log of the failure state for completeness. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c11
--- Comment #11 from Daniel Wagner
[ 663.178681] nvme nvme0: I/O 316 QID 2 timeout, aborting [ 663.212948] nvme nvme0: Abort status: 0x0 [...] After rebooting the system was behaving nicely again. Has anyone an idea what's going on here? I have not touched the hardware and those issues only occurred after updating TW to 5.14.
The NVMe subsystem issues I/O and the hardware doesn't respond in a timely fashion. There is nothing obvious wrong in the logs. Also the aborts do not happen at once. So it's not something stops working completely.
Could the nvme contact be loose? Should I disassemble the laptop and unplug and replug the nvme just to see if that makes a difference? I'm absolutely puzzled here.
Me too. One thing you could check if the drive is overheating (use sensors) before you start pulling things apart. Also check if there is a firmware update for the driver (fwupdate). -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c12
--- Comment #12 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c13
--- Comment #13 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c14
--- Comment #14 from Felix Niederwanger
# sensors ... nvme-pci-3c00 Adapter: PCI adapter Composite: +24.9��C (low = -5.2��C, high = +83.8��C) (crit = +87.8��C)
Still I got the same error messages at first boot. After a reboot, the system behaves nicely again. I checked for available firmware updates, and according to fwupdmgr there are no updates available. I remember to do the last firmware update in the last months, so this is also reasonable
# fwupdmgr get-devices ... Devices that have been updated successfully:
��� System Firmware (1.19.0 ��� 1.20.0)
fwupdmgr get-updates
Devices with no available firmware updates: ��� Dell WD15 ��� Dell WD15 Passive Cable ��� Dell WD15 Port Controller 1 ��� TPM 2.0 ��� UEFI dbx ��� VMM3332 inside Dell WD15/TB16/TB18 wired Dock ��� WDS100T3X0C-00SJG0 Devices with the latest available firmware version: ��� System Firmware ��� Unifying Receiver No updates available for remaining devices
Guess the next step is the screwdriver? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c15
--- Comment #15 from Daniel Wagner
��� WDS100T3X0C-00SJG0
It seems I have almost the same device 'WDS100T3XHC-00SJG0'. Not sure if the '0C' vs 'HC' makes a bit difference. The machine with the driver still runs 5.13.12-2-default. So I'll update it and see if I can reproduce it. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c16
--- Comment #16 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c17
--- Comment #17 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c18
--- Comment #18 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c19
--- Comment #19 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c20
--- Comment #20 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c21
Hannes Reinecke
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c22
--- Comment #22 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c23
--- Comment #23 from Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c24
Felix Niederwanger
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c25
--- Comment #25 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c26
--- Comment #26 from Felix Niederwanger
# /etc/default/grub nvme_core.default_ps_max_latency_us=5500
If this doesn't help the next step is to return the SSD as the hardware might be faulty. [1] https://forums.debian.net/viewtopic.php?t=146747 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c27
--- Comment #27 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286
http://bugzilla.opensuse.org/show_bug.cgi?id=1190286#c28
--- Comment #28 from Felix Niederwanger
participants (1)
-
bugzilla_noreply@suse.com