[Bug 1205774] New: Kernel 6.0.8 + nvidia 525.53 crash on BUG: kernel NULL pointer dereference, #PF: supervisor instruction fetch in kernel mode
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 Bug ID: 1205774 Summary: Kernel 6.0.8 + nvidia 525.53 crash on BUG: kernel NULL pointer dereference, #PF: supervisor instruction fetch in kernel mode Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: bruno@ioda-net.ch QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 863125 --> http://bugzilla.opensuse.org/attachment.cgi?id=863125&action=edit Last kernel 6.0.8 boot and crash Hello I need help to find the root cause of crashes (complete freeze of the workstation). Since the introduction of kernel 6.0.8 and latest nvidia 525.53 I constantly face complete machine freeze, when I'm using my external monitors. nov 25 16:05:34 qt-kt kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 nov 25 16:05:34 qt-kt kernel: #PF: supervisor instruction fetch in kernel mode nov 25 16:05:34 qt-kt kernel: #PF: error_code(0x0010) - not-present page nov 25 16:05:34 qt-kt kernel: PGD 0 P4D 0 nov 25 16:05:34 qt-kt kernel: Oops: 0010 [#1] PREEMPT SMP PTI nov 25 16:05:34 qt-kt kernel: CPU: 3 PID: 1004 Comm: nvidia-modeset/ Tainted: P OE 6.0.8-1-default #1 openSUSE Tumbleweed 9d20364b934f5aab0a9bdf84e8f45cfdfae39dab nov 25 16:05:34 qt-kt kernel: Hardware name: Dell Inc. Precision 7510/0YH43H, BIOS 1.29.3 09/18/2022 nov 25 16:05:34 qt-kt kernel: RIP: 0010:0x0 nov 25 16:05:34 qt-kt kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. nov 25 16:05:34 qt-kt kernel: RSP: 0018:ffffadc300703c90 EFLAGS: 00010207 nov 25 16:05:34 qt-kt kernel: RAX: 0000000000000000 RBX: ffff9b2996c75378 RCX: 0000000000000004 nov 25 16:05:34 qt-kt kernel: RDX: 0000000000000010 RSI: ffff9b2996c75378 RDI: ffffadc300703a10 nov 25 16:05:34 qt-kt kernel: RBP: ffff9b292ef8b560 R08: 0000000000000000 R09: 0000000000000040 nov 25 16:05:34 qt-kt kernel: R10: ffff9b2886b28008 R11: ffffadc300703c00 R12: 0000000000000000 nov 25 16:05:34 qt-kt kernel: R13: 0000000000000000 R14: ffffffffc4fbee70 R15: ffffffffc4eb7580 nov 25 16:05:34 qt-kt kernel: FS: 0000000000000000(0000) GS:ffff9b37c44c0000(0000) knlGS:0000000000000000 nov 25 16:05:34 qt-kt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 nov 25 16:05:34 qt-kt kernel: CR2: ffffffffffffffd6 CR3: 000000075be10001 CR4: 00000000003706e0 nov 25 16:05:34 qt-kt kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 nov 25 16:05:34 qt-kt kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 nov 25 16:05:34 qt-kt kernel: Call Trace: nov 25 16:05:34 qt-kt kernel: <TASK> nov 25 16:05:34 qt-kt kernel: _nv001281kms+0x194/0x1a0 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? _nv001274kms+0x9a/0xd0 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? _nv001466kms+0x172/0x180 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? _nv001149kms+0xf3/0xa50 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? schedule+0x5a/0xd0 nov 25 16:05:34 qt-kt kernel: ? schedule_timeout+0x10e/0x150 nov 25 16:05:34 qt-kt kernel: ? nvkms_sema_up+0x10/0x10 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? __down_common+0x10b/0x1e0 nov 25 16:05:34 qt-kt kernel: ? nvkms_sema_up+0x10/0x10 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? _nv002457kms+0x2d/0x70 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? nvkms_kthread_q_callback+0x88/0xf0 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? _main_loop+0x77/0x130 [nvidia_modeset 4cdb9c16932eaf3d76d37eaaf2544a31bf3316d8] nov 25 16:05:34 qt-kt kernel: ? kthread+0xd7/0x100 nov 25 16:05:34 qt-kt kernel: ? kthread_complete_and_exit+0x20/0x20 nov 25 16:05:34 qt-kt kernel: ? ret_from_fork+0x1f/0x30 nov 25 16:05:34 qt-kt kernel: </TASK> nov 25 16:05:34 qt-kt kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq nf_nat_sip nft_objref af_packet nf_conntrack_sip nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct n> nov 25 16:05:34 qt-kt kernel: processor_thermal_device_pci_legacy irqbypass iwlwifi processor_thermal_device pcspkr ledtrig_audio processor_thermal_rfim snd_usbmidi_lib efi_pstore snd_intel_sdw_acpi sparse_keymap processor_thermal_mbox cdc_wdm videob> nov 25 16:05:34 qt-kt kernel: fuse configfs dmi_sysfs ip_tables x_tables cmac algif_hash algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj hid_generic dm_crypt usbhid essiv authenc trusted asn1_encoder tee bnep btusb btrtl btbcm btintel btmtk > nov 25 16:05:34 qt-kt kernel: CR2: 0000000000000000 nov 25 16:05:34 qt-kt kernel: ---[ end trace 0000000000000000 ]--- Most of the time this happen after resuming the screens, the primary one get on, but the second is always in suspend mode, I have to open neither display settings in plasma or nvidia-settings to set on the second monitor (displayport slave link) after switching off and on the monitor. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c1 --- Comment #1 from Bruno Friedmann <bruno@ioda-net.ch> --- installed kernel S | Name | Type | Version | Arch | Repository ---+-----------------------------+---------+------------------------+--------+------------------ i+ | kernel-default | package | 6.0.7-1.1 | x86_64 | (System Packages) i+ | kernel-default | package | 6.0.6-1.1 | x86_64 | (System Packages) i+ | kernel-default | package | 6.0.8-1.1 | x86_64 | oss i+ | kernel-default-devel | package | 6.0.7-1.1 | x86_64 | (System Packages) i+ | kernel-default-devel | package | 6.0.6-1.1 | x86_64 | (System Packages) i+ | kernel-default-devel | package | 6.0.8-1.1 | x86_64 | oss i+ | kernel-devel | package | 6.0.7-1.1 | noarch | (System Packages) i+ | kernel-devel | package | 6.0.6-1.1 | noarch | (System Packages) i+ | kernel-devel | package | 6.0.8-1.1 | noarch | oss i+ | kernel-firmware-all | package | 20221109-1.1 | noarch | oss i+ | kernel-macros | package | 6.0.8-1.1 | noarch | oss i+ | kernel-obs-build | package | 6.0.8-1.1 | x86_64 | oss i+ | kernel-syms | package | 6.0.7-1.1 | x86_64 | (System Packages) i+ | kernel-syms | package | 6.0.6-1.1 | x86_64 | (System Packages) i+ | kernel-syms | package | 6.0.8-1.1 | x86_64 | oss installed nvidia (official repo) S | Name | Type | Version | Arch | Repository ---+---------------------------+---------+----------------------+--------+----------- i+ | kernel-firmware-nvidia | package | 20221109-1.1 | noarch | oss i+ | libnvidia-egl-wayland1 | package | 1.1.11-1.1 | x86_64 | oss i+ | nvidia-computeG06 | package | 525.53-16.1 | x86_64 | nvidia i+ | nvidia-computeG06-32bit | package | 525.53-16.1 | x86_64 | nvidia i+ | nvidia-gfxG06-kmp-default | package | 525.53_k6.0.8_1-16.3 | x86_64 | nvidia i+ | nvidia-glG06 | package | 525.53-16.1 | x86_64 | nvidia i+ | nvidia-glG06-32bit | package | 525.53-16.1 | x86_64 | nvidia i+ | nvidia-texture-tools | package | 2.1.2-2.8 | x86_64 | oss i+ | x11-video-nvidiaG06 | package | 525.53-16.1 | x86_64 | nvidia i+ | x11-video-nvidiaG06-32bit | package | 525.53-16.1 | x86_64 | nvidia NAME="openSUSE Tumbleweed" # VERSION="20221123" -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c2 --- Comment #2 from Bruno Friedmann <bruno@ioda-net.ch> --- Created attachment 863126 --> http://bugzilla.opensuse.org/attachment.cgi?id=863126&action=edit lshw full Hardware is 6 years old now, and has always work before recently. I've another computer running same TW version, which is a workstation not a laptop, with AMD CPU and nvidia card and this one doesn't seems to have those crashes. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c3 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ddadap@nvidia.com --- Comment #3 from Stefan Dirsch <sndirsch@suse.com> --- Hmm. No idea. Adding my contact at NVIDIA. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c4 --- Comment #4 from Bruno Friedmann <bruno@ioda-net.ch> --- With kernel 6.0.10-1.1 and nvidia 525.60.11 I'm now seeing a lot of [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event I will change next week the DP cable in case it make a difference. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c5 --- Comment #5 from Bruno Friedmann <bruno@ioda-net.ch> --- Under wayland I got more crashes from this style [170893.490622] plasmashell[3405]: segfault at 8 ip 00007f27a65bd546 sp 00007ffe8c057ac0 error 4 in libQt5Gui.so.5.15.7[7f27a6511000+4ec000] [170893.490633] Code: 00 00 00 90 48 8b 36 48 03 76 10 e9 d4 1b f6 ff 0f 1f 40 00 53 48 89 fb 48 83 ec 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 08 <48> 8b 46 08 48 8b b0 88 00 00 00 48 85 f6 74 22 48 8b 06 ff 50 18 [170904.203109] [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c6 --- Comment #6 from Bruno Friedmann <bruno@ioda-net.ch> --- To Stefan, really strange, since yesterday I've remove the cmdline parameter simplefb=1 and I simply didn't experiment crash, kernel freeze, firefox code error, and both external screen can work. Maybe that parameter is no more needed? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c7 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bruno@ioda-net.ch Flags| |needinfo?(bruno@ioda-net.ch | |) --- Comment #7 from Stefan Dirsch <sndirsch@suse.com> --- (In reply to Bruno Friedmann from comment #6)
To Stefan, really strange, since yesterday I've remove the cmdline parameter simplefb=1
simplefb=1 ? It should be nosimplefb=1 !
and I simply didn't experiment crash, kernel freeze, firefox code error, and both external screen can work.
Maybe that parameter is no more needed?
Could you doublecheck what option you had set? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c8 Bruno Friedmann <bruno@ioda-net.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(bruno@ioda-net.ch | |) | --- Comment #8 from Bruno Friedmann <bruno@ioda-net.ch> --- Yes sorry, non ecc memory, nosimplefb=1 is added when nvidia kmp are installed. I've clean now this one and rebuild mkinitrd. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c9 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(bruno@ioda-net.ch | |) --- Comment #9 from Stefan Dirsch <sndirsch@suse.com> --- (In reply to Bruno Friedmann from comment #8)
Yes sorry, non ecc memory,
?
nosimplefb=1 is added when nvidia kmp are installed.
Yes, of course, but I don't what or if you added simplefb=1 option ...
I've clean now this one and rebuild mkinitrd.
So you're testing again with nosimplefb=1 ? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c10 Bruno Friedmann <bruno@ioda-net.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(bruno@ioda-net.ch | |) | --- Comment #10 from Bruno Friedmann <bruno@ioda-net.ch> --- Actually with TW 20221205 I'm using those packages kernel-default 6.0.10-1.1 kernel-firmware-nvidia 20221130-1.1 nvidia 525.60.11-15.1 I've removed the nosimplefb=1 parameter, and since then, I didn't detect any crash, hang, nor spurious message about Firefox or other programs segfaulting. My external dual screen are again working without freezing the computer. About nosimplefb, My laptop has internal intel gpu desactivated, and only nvidia gpu quadro is in use, I'm using efi, and the console start already with 4k resolution (grub prompt for example). I don't know if that make a difference. I will have now to watch every nvidia install to remove and recreate initrd afterwards. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1205774 http://bugzilla.opensuse.org/show_bug.cgi?id=1205774#c11 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS Component|Kernel |X11 3rd Party Driver Assignee|kernel-bugs@opensuse.org |gfx-bugs@suse.de Summary|Kernel 6.0.8 + nvidia |Kernel 6.0.8 + nvidia >= |525.53 crash on BUG: kernel |525.53 + nosimplefb=1: |NULL pointer dereference, |crash on BUG: kernel NULL |#PF: supervisor instruction |pointer dereference, #PF: |fetch in kernel mode |supervisor instruction | |fetch in kernel mode QA Contact|qa-bugs@suse.de |sndirsch@suse.com --- Comment #11 from Stefan Dirsch <sndirsch@suse.com> --- First time I hear about issues with nvidia driver, which go away by re-enabling simplefb/simpledrm. Tracking now. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com