TW 20250216 Kernel 6.13.2 USB issues

In January I was running TW 20250106 which was using kernel 6.12.8 and I ran into multiple USB corruption issues. At first I thought that the USB drive might be failing but then it happened with a 2nd USB drive getting corrupted. Using TW 20250106 I booted using kernel 6.11.8 instead of 6.12.8 and all USB problems went away and both drives worked fine. Research I see that there were lots of people reporting USB problems with kernel 6.12.8. For the rest of Jan and up through 02/17/2025 I continued to run TW 20250106 using kernel 6.11.8 ( instead of 6.12.8 ) and using multiple USB devices all month long with NO problems. Then I updated to TW 20250216 using kernel 6.13.2. I see that up through kernel 6.13.rc4 there were still USB problems but figured I try with the released 6.13.2 kernel and TW 20250216. It "had" been working fine until earlier to day when I hit a kernel BUG while accessing a USB drive ( different drive than from last month ). Here is the journal: Feb 20 15:39:22 kernel: BUG: unable to handle page fault for address: ffffffad9c1c7800 Feb 20 15:39:22 kernel: #PF: supervisor instruction fetch in kernel mode Feb 20 15:39:22 kernel: #PF: error_code(0x0010) - not-present page Feb 20 15:39:22 kernel: PGD 108e03d067 P4D 108e03d067 PUD 0 Feb 20 15:39:22 kernel: Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI Feb 20 15:39:22 kernel: CPU: 6 UID: 1000 PID: 69034 Comm: python3 Not tainted 6.13.2-1-default #1 openSUSE Tumbleweed cdfe16bec344147391efeacaa0fc0377c0d20a85 Feb 20 15:39:22 kernel: Hardware name: ASUS System Product Name/ROG MAXIMUS Z790 FORMULA, BIOS 1202 04/18/2024 Feb 20 15:39:22 kernel: RIP: 0010:0xffffffad9c1c7800 Feb 20 15:39:22 kernel: Code: Unable to access opcode bytes at 0xffffffad9c1c77d6. Feb 20 15:39:22 kernel: RSP: 0018:ffffbe6d8c62b8af EFLAGS: 00010246 Feb 20 15:39:22 kernel: RAX: 0000000000000000 RBX: 0000000000002400 RCX: 0000000000000006 Feb 20 15:39:22 kernel: RDX: ffff96d350380000 RSI: ffff96d3e9a39048 RDI: ffff96d350380000 Feb 20 15:39:22 kernel: RBP: 00000000000083ff R08: ffff96d80dc4f600 R09: 0000000000000000 Feb 20 15:39:22 kernel: R10: 0000000000000001 R11: ffff96d3e9a39000 R12: 000000fffbffff00 Feb 20 15:39:22 kernel: R13: ffe70d318e790000 R14: ffffbe6d8c62b978 R15: ffff96d29fe022a0 Feb 20 15:39:22 kernel: FS: 00007fddbafab580(0000) GS:ffff96e1feb00000(0000) knlGS:0000000000000000 Feb 20 15:39:22 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 15:39:22 kernel: CR2: ffffffad9c1c77d6 CR3: 00000009a865a001 CR4: 0000000000f72ef0 Feb 20 15:39:22 kernel: PKRU: 55555554 Feb 20 15:39:22 kernel: Call Trace: Feb 20 15:39:22 kernel: <TASK> Feb 20 15:39:22 kernel: ? __die_body.cold+0x19/0x26 Feb 20 15:39:22 kernel: ? page_fault_oops+0x132/0x2a0 Feb 20 15:39:22 kernel: ? exc_page_fault+0x160/0x170 Feb 20 15:39:22 kernel: ? asm_exc_page_fault+0x26/0x30 Feb 20 15:39:22 kernel: ? page_cache_ra_unbounded+0x198/0x200 Feb 20 15:39:22 kernel: ? filemap_get_pages+0x565/0x6f0 Feb 20 15:39:22 kernel: ? filemap_read+0xec/0x370 Feb 20 15:39:22 kernel: ? filemap_read+0x33c/0x370 Feb 20 15:39:22 kernel: ? aa_file_perm+0x122/0x4e0 Feb 20 15:39:22 kernel: ? apparmor_file_permission+0x75/0x190 Feb 20 15:39:22 kernel: ? vfs_read+0x25f/0x330 Feb 20 15:39:22 kernel: ? ksys_read+0x64/0xe0 Feb 20 15:39:22 kernel: ? do_syscall_64+0x82/0x160 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? syscall_exit_to_user_mode+0x37/0x1d0 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? do_pselect.constprop.0+0xd7/0x170 Feb 20 15:39:22 kernel: ? syscall_exit_to_user_mode+0x37/0x1d0 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? syscall_exit_to_user_mode+0x37/0x1d0 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? do_pselect.constprop.0+0xd7/0x170 Feb 20 15:39:22 kernel: ? syscall_exit_to_user_mode+0x37/0x1d0 Feb 20 15:39:22 kernel: ? do_syscall_64+0x8e/0x160 Feb 20 15:39:22 kernel: ? switch_fpu_return+0x4e/0xd0 Feb 20 15:39:22 kernel: ? arch_exit_to_user_mode_prepare.isra.0+0x79/0x90 Feb 20 15:39:22 kernel: ? entry_SYSCALL_64_after_hwframe+0x76/0x7e Feb 20 15:39:22 kernel: </TASK> Feb 20 15:39:22 kernel: Modules linked in: exfat vhost_net tun vhost vhost_iotlb macvtap macvlan tap nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp> Feb 20 15:39:22 kernel: snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi soundwi> Feb 20 15:39:22 kernel: tiny_power_button mc kvm joydev pcspkr thunderbolt wmi_bmof libphy soundcore i2c_mux spi_intel serial_multi_instantiate rfkill intel_vsec pmt_clas> Feb 20 15:39:22 kernel: CR2: ffffffad9c1c7800 Feb 20 15:39:22 kernel: ---[ end trace 0000000000000000 ]--- Feb 20 15:39:22 kernel: RIP: 0010:0xffffffad9c1c7800 Feb 20 15:39:22 kernel: Code: Unable to access opcode bytes at 0xffffffad9c1c77d6. Feb 20 15:39:22 kernel: RSP: 0018:ffffbe6d8c62b8af EFLAGS: 00010246 Feb 20 15:39:22 kernel: RAX: 0000000000000000 RBX: 0000000000002400 RCX: 0000000000000006 Feb 20 15:39:22 kernel: RDX: ffff96d350380000 RSI: ffff96d3e9a39048 RDI: ffff96d350380000 Feb 20 15:39:22 kernel: RBP: 00000000000083ff R08: ffff96d80dc4f600 R09: 0000000000000000 Feb 20 15:39:22 kernel: R10: 0000000000000001 R11: ffff96d3e9a39000 R12: 000000fffbffff00 Feb 20 15:39:22 kernel: R13: ffe70d318e790000 R14: ffffbe6d8c62b978 R15: ffff96d29fe022a0 Feb 20 15:39:22 kernel: FS: 00007fddbafab580(0000) GS:ffff96e1feb00000(0000) knlGS:0000000000000000 Feb 20 15:39:22 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 15:39:22 kernel: CR2: ffffffad9c1c77d6 CR3: 00000009a865a001 CR4: 0000000000f72ef0 Feb 20 15:39:22 kernel: PKRU: 55555554 Feb 20 15:39:22 kernel: note: python3[69034] exited with irqs disabled After that occurs the system seems to still be running fine, however, the kernel is tainted with a value of 128 which say the kernel experienced a death event ( RIP ). This is very much like the issue I saw with kernel 6.12.8 that did not occur with kernel 6.11.8. Anybody else seeing anything like this ? -- Regards, Joe

I've had issues in my TW & Leap 15.6 installs with kernels above 6.12.6 regarding "revival from suspend" . . . which have continued into the 6.13.2 range. I have both set to have 4 kernels available, but in the last couple of weeks there have been a bunch of kernels added. I kept thinking, "maybe the new kernel will fix the problem," and then it didn't. Couldn't get back to the 6.12.6 option that did work fine . . . so far my TW install still has problems with reviving the display . . . in the 6.12.8 - 6.13.2 kernels . . . . I upgraded my Leap 15.6 to 16 via console . . . and there the base kernel is 6.4, so I kept that one and it is working fine for getting the display back after suspend. This time the newer kernels have not been the solution, as they have in the past. Have not played around with USB flash drives too much. I had the Leap 16 installer cued up in flash drive in case the console upgrade blew up on me, it shows as mounting through a number of my distros . . . no problems have shown up so far.

On 2/20/25 8:59 PM, Fritz Hudnut wrote:
Hi Fritz, You can find lots of older kernels in https://download.opensuse.org/repositories/home:/tiwai:/kernel:/ You can also kernels from the last 20 TW builds in https://download.opensuse.org/history/ -- Regards, Joe

Joe Salmeri wrote:
Joe: Thanks for the links to the older kernels, the one I had for 6.12.6 was "not found" a few days back. And, I did expand my "multiversion" settings to allow more kernels, but did not know I could specify an exact kernel . . . obviously it has to be installed first, I'll check into it. F

On 2/21/25 11:03 AM, Fritz Hudnut wrote:
Hi Fritz, Previously I would regularly update multiversion with a kernel to keep but then I switched to keeping 4 kernels so that I didn't have to regularly update multiversion. multiversion.kernels = latest,latest-1,latest-2,latest-3,running That has worked fine for quite a while but lately it seems like the kernels have come faster and after hitting the USB issues with 6.12.8 and finding that 6.11.8 worked fine I want to make sure that 6.11.8 didn't go away anytime soon so I now am keeping 4 + 6.11.8 multiversion.kernels = latest,latest-1,latest-2,latest-3,running,6.11.8-1 Another option to try would be to install kernel-longterm which is now at 6.12.13-1.1 ( as of TW 20250216 ). I do not know if that fixes the USB issues as I have not tried it. -- Regards, Joe

[QUOTE] That has worked fine for quite a while but lately it seems like the kernels have come faster and after hitting the USB issues with 6.12.8 and finding that 6.11.8 worked fine I want to make sure that 6.11.8 didn't go away anytime soon so I now am keeping 4 + 6.11.8 multiversion.kernels = latest,latest-1,latest-2,latest-3,running,6.11.8-1 [/QUOTE] Joe: Exactly, there has been a "kernel storm" of late, which clicked out my last working kernel, the 6.12.6 iteration . . . faster than I could add more "latests" into the zypp mixer . . . . It's ALL for kicks, but then I do like basic function in a machine . . . .

On 2/20/25 10:13 PM, Joe Salmeri wrote:
Anybody else seeing anything like this ?
I recently had USB problems on a Leap laptop. I had to rmmod xhci_hcd and modprobe it again to get my external keyboard+mouse back. Losing a USB-connection to a storage device can cause filesystem corruption (though with ext4 journaling it should not be bad) If it happens again, I'll collect as much logs as I can. Ciao Bernhard M.

On 2025-02-21 08:26, Bernhard M. Wiedemann via openSUSE Factory wrote:
I had the same issue as well as a possible, WOL issue. I created this incident: https://bugzilla.opensuse.org/show_bug.cgi?id=1236992 and was asked to create an upstream, kernel incident on the WOL problem (since I can reproduce it at-will): https://bugzilla.kernel.org/show_bug.cgi?id=219782 -pablo

On 2/21/25 8:26 AM, Bernhard M. Wiedemann via openSUSE Factory wrote:
Hi Bernard, Yes I saw the corruption issues occur with 6.12.8. With 6.13.2 I get these errors which seem to indicate a kernel bug trying to access a page that is no longer available. Feb 20 15:39:22 kernel: BUG: unable to handle page fault for address: ffffffad9c1c7800 Feb 20 15:39:22 kernel: #PF: supervisor instruction fetch in kernel mode Feb 20 15:39:22 kernel: #PF: error_code(0x0010) - not-present page So far 6.13.2 did not corrupt anything like 6.12.8 did. -- Regards, Joe

On 2/20/25 4:13 PM, Joe Salmeri wrote:
Just had this happen again. A common denominator seems to be that it is occurring when a python program is running which is reading a bunch of files from a USB drive. It does not happen all the time though because after it happens the kernel is tainted with a value of 128 so I reboot. After rebooting, I can plugin the same drive and run the same python program to read the data and it completes successfully. Unlike kernel 6.12.8 which had major USB problems including drive corruption, I have only had this not-present-page issue with kernel 6.13.2 -- Regards, Joe

On 24. 02. 25, 20:12, Joe Salmeri wrote:
Your instruction pointer is corrupted. I suggest you run memtest. If that's fine, I would run a kernel with KASAN and retry the scenario. I believe Takashi builds a kasan-enabled kernel: https://build.opensuse.org/project/monitor/home:tiwai:kernel:stable-kasan Note it runs slooowly. -- js suse labs

On 2/24/25 3:25 PM, Jiri Slaby wrote:
Thanks, Jiri. I have run memtest on this system in the past for days with no issues. I even ran the extended and more intensive tests but it has been a while since I ran it. If it is a memory issue it seems very odd to me that it only happens with the newer kernels. It has never happened with kernel 6.11.8, both back when it was the current kernel available and also now when I boot 6.11.8 with the new TW builds. Doesn't that seem ODD too you ? -- Regards, Joe

On 2/24/25 3:25 PM, Jiri Slaby wrote:
The system is used by multiple users and is on 24x7 so rebooting or making it run really slow is not a great option. This morning I tried a different test. The system has 64 GB of memory. Using /tmp which is in memory I copied all the contents of the USB device and other files until I had filled up 62 GB of memory. Then I ran the python program, multiple times with no page-not-present errors. I also tried running 3 instances of the python program all at the same time and again no page-no-present errors. I guess it's possible that if there was a memory issue it was in the 2 GB that was not consumed but the odds are low. Thinking back on the sequence of events, I believe each time this has occurred, the following happened. USB device plugged in and mounted rsync ran to sync files to the USB device python program run to verify files USB device unmounted and unplugged Some time elapses.... USB device plugged in again and mounted python program run to verify files again page-not-present error occurs I am not certain that was the sequence both times, BUT, since the major USB problems and device corruption that occurred with kernel 6.12.8, I have been running the verification steps multiple times after the device has been unmounted and unplugged and then later plugged in and remounted as an extra sanity check to make sure that no corruption exists. Could it be that the issue has something to do with the kernel disk caching ? -- Regards, Joe
participants (5)
-
Bernhard M. Wiedemann
-
Fritz Hudnut
-
Jiri Slaby
-
Joe Salmeri
-
Pablo Sanchez