[Bug 1231800] New: Many problems with kernel prior 6.9 release. AMD platform.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 Bug ID: 1231800 Summary: Many problems with kernel prior 6.9 release. AMD platform. Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: slawek@lach.art.pl QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 Build Identifier: I currently must work on kernel 6.8, cause any newer causes serious problem. I trying many thinks, but nothing helps. Firstly, this bug may be related to closed source NVidia driver. Kernel many times do panic and print some debug info with NVidia string. I try to disable nvidia module. Blacklisting does nothing, so I removed it and try to ran 6.11 kernel. But, I do not disable NVidia service and maybe I unfortunately install it again. Meanwhile, I got information about update in discover and it hangs on installing updates. I try with zypper and I must accept license for NVidia. Discover do not ask for accepting license in previous event, so I think NVidia kernel was installed again, without my knowledge, as I described. When installing via zypper, system halt and sound was playing in loop. Fortunaty, I installed NVidia service and remove nvidia drivers via zypper and test 6.11 again. System many times complained about XFS (/home) filesystem error. Plasma 6.2 then reset and default settings was loaded. I currently, again working on 6.8. Related info: https://www.gamingonlinux.com/2024/08/nvidia-driver-with-linux-kernel-6-10-c... . https://forums.opensuse.org/t/when-kernels-e8c6092e52f8-will-be-applied-on-m... https://forums.opensuse.org/t/asus-zenbook-flip-15-kernel-6-9-fail-to-boot/1... Thanks. Reproducible: Sometimes Steps to Reproduce: 1. Ran system, play many web games (freecivweb, total battle, forge of empires, the settlers: online, tentlan, etc.) at once, play video. Actual Results: On 6.8 no problems, but on newer multiple problems. I am not sure system is hanging, because it may be related to NVidia driver, but sometimes system complained about filesystem is mounted read only. Expected Results: Everything working without problems, like in 6.8. My laptop is ASUS ZenBook Flip 15. Operating System: openSUSE Tumbleweed 20241016 KDE Plasma Version: 6.2.1 KDE Frameworks Version: 6.7.0 Qt Version: 6.7.3 Kernel Version: 6.8.9-1-default (64-bit) Graphics Platform: Wayland Processors: 8 × AMD Ryzen 7 4700U with Radeon Graphics Memory: 15.0 GiB of RAM Graphics Processor: AMD Radeon Graphics Manufacturer: ASUSTeK COMPUTER INC. Product Name: ZenBook UX562IQ_UM562IQ System Version: 1.0 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c1 --- Comment #1 from Sławomir Lach <slawek@lach.art.pl> --- Can you give me a hint how to debug/gather information? I tried enabling UEFI for PStore, but after reboot, there is no files. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c2 --- Comment #2 from Sławomir Lach <slawek@lach.art.pl> --- Because of post in thread link is attached, I perform inxi -GSaz. The result is shown bellow. System: Kernel: 6.8.9-1-default arch: x86_64 bits: 64 compiler: gcc v: 13.2.1 clocksource: tsc avail: hpet,acpi_pm parameters: BOOT_IMAGE=/boot/vmlinuz-6.8.9-1-default root=UUID=589e5f7b-afcf-4539-b4f5-f76f2363e243 splash=silent systemd.show_status=yes quiet security=apparmor mitigations=auto rd.shell=0 Desktop: KDE Plasma v: 6.2.1 tk: Qt v: N/A info: frameworks v: 6.7.0 wm: kwin_wayland tools: avail: xscreensaver vt: 3 dm: SDDM Distro: openSUSE Tumbleweed 20241016 Graphics: Device-1: NVIDIA GP107M [GeForce MX350] vendor: ASUSTeK driver: N/A alternate: nouveau non-free: 550.xx+ status: current (as of 2024-09; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm built: 2016-2021 pcie: gen: 3 speed: 8 GT/s lanes: 4 link-max: lanes: 16 bus-ID: 01:00.0 chip-ID: 10de:1c96 class-ID: 0302 Device-2: Advanced Micro Devices [AMD/ATI] Renoir [Radeon Vega Series / Radeon Mobile Series] vendor: ASUSTeK driver: amdgpu v: kernel arch: GCN-5 code: Vega process: GF 14nm built: 2017-20 pcie: gen: 4 speed: 16 GT/s lanes: 16 ports: active: HDMI-A-1,eDP-1 empty: none bus-ID: 04:00.0 chip-ID: 1002:1636 class-ID: 0300 temp: 53.0 C Device-3: IMC Networks USB2.0 HD IR UVC WebCam driver: uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-2:2 chip-ID: 13d3:56cb class-ID: 0e02 serial: <filter> Display: wayland server: X.org v: 1.21.1.12 with: Xwayland v: 24.1.3 compositor: kwin_wayland driver: X: loaded: modesetting unloaded: fbdev,vesa dri: radeonsi gpu: amdgpu d-rect: 4480x1440 display-ID: 0 Monitor-1: HDMI-A-1 pos: right res: 2560x1440 size: N/A modes: N/A Monitor-2: eDP-1 pos: primary,left res: 1920x1080 size: N/A modes: N/A API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi device: 1 drv: swrast gbm: drv: kms_swrast surfaceless: drv: radeonsi wayland: drv: radeonsi x11: drv: radeonsi API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.1.7 glx-v: 1.4 direct-render: yes renderer: AMD Radeon Graphics (radeonsi renoir LLVM 18.1.8 DRM 3.57 6.8.9-1-default) device-ID: 1002:1636 memory: 500 MiB unified: no display-ID: :1.0 API: Vulkan v: 1.3.296 layers: 6 device: 0 type: integrated-gpu name: AMD Radeon Graphics (RADV RENOIR) driver: N/A device-ID: 1002:1636 I have AMD APU and NVidia GPU. I wrote about is there the possibility of some problems was caused by NVidia driver (hangs), but I am not sure this problem is related to NVidia driver + Linux 6.9 (or above). I am sure, I had problems with XFS filesystem was read only and this is not related to NVidia driver, because while the bug occur, I have NVidia driver uninstalled. So there maybe was two bugs: 1. Hangs with NVidia driver 2. FS problems on newer kernels. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c3 --- Comment #3 from Sławomir Lach <slawek@lach.art.pl> --- I have external monitor attached and use it with internal monitor (laptop). I have read about KDE Plasma was removed bug, which cause hangs on external monitor attached, but previously I have other problems with NVidia driver (on kernel 6.9/6.10), for example panic with debug symbols printed on the screen. I do not known if I had panic with 6.11 or maybe only Plasma hangs (sound was playing in loop, as I said). Test kernel 6.11 with Plasma 6.2.1 ? But I must repair FS (XFS mounted as /home) problems. I must told, these partition is old. It have around three years. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c5 --- Comment #5 from Sławomir Lach <slawek@lach.art.pl> --- 6.11 currently boots rather (nearly always) without problems. I have had 6.11 installed by update process (plasma-discover, so packagekit/libzypp), should I remove some packages. I try to ran 6.11 and if there was XFS FS error, I try to input into konsole dmesg | grep -i xfs and save logs onto root filesystem, then reboot. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c6 --- Comment #6 from Sławomir Lach <slawek@lach.art.pl> --- I am on 6.11. Today, when launching, SDDM complain my root (BTRFS) fs was not writable (/var/tmp/...). I relaunch newest kernel with rd.break=pre-mount and do btrfs check /path/to/device/file. I am actually working and waiting for problem with GPU/hang or message telling my home directory is not writable. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c7 --- Comment #7 from Sławomir Lach <slawek@lach.art.pl> ---
cat /zapis-dziennika [ 8.671729] [ T912] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled [ 8.673684] [ T909] XFS (nvme0n1p6): Mounting V5 Filesystem dafe27d8-7d7f-4029-9de4-baaf35303b25 [ 8.683805] [ T909] XFS (nvme0n1p6): Ending clean mount [ 8.686696] [ T909] xfs filesystem being mounted at /home supports timestamps until 2038-01-19 (0x7fffffff) [ 990.643551] [ T9505] XFS (nvme0n1p6): Metadata corruption detected at xfs_btree_lookup_get_block+0x112/0x1e0 [xfs], xfs_bnobt block 0x329dd0c8 [ 990.643879] [ T9505] XFS (nvme0n1p6): Unmount and run xfs_repair [ 990.645579] [ T9505] XFS (nvme0n1p6): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x2ee/0x440 [xfs] (fs/xfs/libxfs/xfs_defer.c:722). Shutting down filesystem. [ 990.645773] [ T9505] XFS (nvme0n1p6): Please unmount the filesystem and rectify the problem(s)
May I suppose my FS was broken and should I ran xfs_repair? Or it is ram memory corruption? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c9 --- Comment #9 from Sławomir Lach <slawek@lach.art.pl> --- I am not booting from the snapshot. I check with xfs_repair and this tool do not shown info about problems, just normal log, what was checked. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c10 --- Comment #10 from Sławomir Lach <slawek@lach.art.pl> --- Maybe newer kernels than 6.8 activate some hardware feature, which is (in my case) broken? For example, specific laptop model or broken hardware? I will test via smart, but I have nvme disk. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c13 --- Comment #13 from Sławomir Lach <slawek@lach.art.pl> --- When system hangs, It is not responsive. Currently, I do not met hangs. I think it is related to NVidia drivers, because sometimes debug symbols pointing NVidia driver appear. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c15 --- Comment #15 from Sławomir Lach <slawek@lach.art.pl> --- On AMD side, I got: [ 12.371641] [ T200] pcie_mp2_amd 0000:04:00.7: Failed to discover, sensors not enabled is 0 [ 12.371663] [ T200] pcie_mp2_amd 0000:04:00.7: amd_sfh_hid_client_init failed err -95 This message is strange for me. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c18 --- Comment #18 from Sławomir Lach <slawek@lach.art.pl> --- Hi. Yesterday, everything seems to work as excepted. Today I play many browser (Firefox) game. Next, I start PIP movie in Firefox and Firefox + Plasmashell freezes. I kill close Firefox window and Firefox crash, so I decided ta ran krunner (via keyboard shortcut) and kill -9 plasmashell. Then I got panic (I noticed it was panic, cause caps lock led change one's state frequently). I cannot do anything. So, there maybe is an error with video decoding on AMD APU? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c19 --- Comment #19 from Sławomir Lach <slawek@lach.art.pl> --- I think, it could be AMD APU-related bug. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c20 --- Comment #20 from Sławomir Lach <slawek@lach.art.pl> --- Created attachment 878190 --> https://bugzilla.suse.com/attachment.cgi?id=878190&action=edit sudo journalctl --no-hostname -k -b > dmesg.txt Logs. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c23 --- Comment #23 from Sławomir Lach <slawek@lach.art.pl> --- Let's be clear. I partially configure kexec, but then reads kdump is similar, so disable kexec and enable kdump. Should I blacklist suggested kernel modules or trying to invoke panic with normal kernel configuration for my hardware/peripherals? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c25 --- Comment #25 from Sławomir Lach <slawek@lach.art.pl> --- Ok. I now working on linux-6.11.3-2-default and got BTRFS in RO mode. How to dump memory and send here? -- You are receiving this mail because: You are on the CC list for the bug.
dmesg | grep -i BTRFS [ 2.065745] [ T374] Btrfs loaded, assert=on, zoned=yes, fsverity=yes [ 6.863545] [ T579] BTRFS: device fsid 589e5f7b-afcf-4539-b4f5-f76f2363e243 devid 1 transid 65169 /dev/nvme0n1p4 (259:4) scanned by mount (579) [ 6.864017] [ T579] BTRFS info (device nvme0n1p4): first mount of filesystem 589e5f7b-afcf-4539-b4f5-f76f2363e243 [ 6.864035] [ T579] BTRFS info (device nvme0n1p4): using crc32c (crc32c-intel) checksum algorithm [ 6.864042] [ T579] BTRFS info (device nvme0n1p4): using free-space-tree [ 663.022005] [ T8631] BTRFS info (device nvme0n1p4): qgroup scan completed (inconsistency flag cleared) [ 1067.405006] [ T12611] file:libxul.so fault:filemap_fault mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs] [ 1067.405403] [ T12611] file:libxul.so fault:filemap_fault mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs] [ 3293.670588] [ T594] aops:btree_aops [btrfs] ino:1 [ 3293.670661] [ T594] BTRFS critical (device nvme0n1p4): corrupt leaf: block=293683200 slot=92 extent bytenr=130072576 len=16384 invalid tree parent bytenr, have 18437557654059512064 expect aligned to 4096 [ 3293.670666] [ T594] BTRFS info (device nvme0n1p4): leaf 293683200 gen 65259 total ptrs 133 free space 7048 owner 2 [ 3293.671049] [ T594] BTRFS error (device nvme0n1p4): block=293683200 write time tree block corruption detected [ 3293.671837] [ T594] BTRFS: error (device nvme0n1p4) in btrfs_commit_transaction:2524: errno=-5 IO failure (Error while writing out
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c26 --- Comment #26 from Sławomir Lach <slawek@lach.art.pl> --- transaction) [ 3293.671842] [ T594] BTRFS info (device nvme0n1p4 state E): forced readonly [ 3293.671846] [ T594] BTRFS warning (device nvme0n1p4 state E): Skipping commit of aborted transaction. [ 3293.671847] [ T594] BTRFS error (device nvme0n1p4 state EA): Transaction aborted (error -5) [ 3293.671850] [ T594] BTRFS: error (device nvme0n1p4 state EA) in cleanup_transaction:2018: errno=-5 IO failure [ 3293.671849] [ T641] BTRFS: error (device nvme0n1p4 state EA) in btrfs_sync_log:3175: errno=-5 IO failure [ 3894.542995] [ T22033] file:libxul.so fault:filemap_fault mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs] -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c29 --- Comment #29 from Sławomir Lach <slawek@lach.art.pl> --- Ok. I will try to test in second Monday and day of week. If system works like charm, I will enable amd_sfh and try to debug. Currently disabled amd_sfh and works, but I remember days, when newer kernels (6.9/6.10/6.11) working like a charm, but on next day, problems came. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c31 --- Comment #31 from Sławomir Lach <slawek@lach.art.pl> --- I got big banner with PASS string + press esc or any other key to remove this banner, so ram seems good. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c32 --- Comment #32 from Sławomir Lach <slawek@lach.art.pl> --- Ok. It seems to be problem with amd_sfh or NVidia driver. I mean no FS problems (causes probably by amd_sfh kernel module) and no hangs/panic (probably caused by NVidia driver). I will buy some external hard drive, do the backup and unlist amd_sfh from blacklist. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c33 --- Comment #33 from Sławomir Lach <slawek@lach.art.pl> --- Backup done. So, if I understood, I should debug case of remounting fs to read only mode? But how to debug this case? Should I wrote 'echo c > /proc/sysrq-trigger' into terminal? How to detect case, when it occur? Should I listen some udiskctl dbus signal? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c35 Sławomir Lach <slawek@lach.art.pl> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo? --- Comment #35 from Sławomir Lach <slawek@lach.art.pl> ---
uname -a Linux localhost.localdomain 6.11.5-1-default #1 SMP PREEMPT_DYNAMIC Wed Oct 23 04:27:11 UTC 2024 (b4e3aa9) x86_64 x86_64 x86_64 GNU/Linux
So OpenSUSE team release a fixed version of kernel currently? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1231800 https://bugzilla.suse.com/show_bug.cgi?id=1231800#c37 --- Comment #37 from Sławomir Lach <slawek@lach.art.pl> --- Ok. I will work for few days and if everything works, I will install and test NVidia driver, cause I can met multiple bugs. But, NVidia could be touched by amd_sfh bug too, since amd driver could write to random memory. I must test. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com