[Bug 1200048] New: Display sometimes dies after waking up computer

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 Bug ID: 1200048 Summary: Display sometimes dies after waking up computer Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Normal Priority: P5 - None Component: X11 3rd Party Driver Assignee: gfx-bugs@suse.de Reporter: aaron.schweiger@gmail.com QA Contact: sndirsch@suse.com Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0 Build Identifier: After moving my mouse, instead of asking for a password, the display switches to text mode before dying. I use my NVIDIA for GPGPU, so I tried to install a minimal system: i+ | kernel-firmware-nvidia | package | 20220516-1.1 | noarch | Main Repository (OSS) i+ | nvidia-computeG06 | package | 515.43.04-10.2 | x86_64 | nVidia Graphics Drivers i+ | nvidia-gfxG06-kmp-default | package | 515.43.04_k5.17.7_1-10.5 | x86_64 | nVidia Graphics Drivers i+ | libdrm_nouveau2 | package | 2.4.110-1.2 | x86_64 | Main Repository (OSS) inxi -Gxxx Graphics: Device-1: NVIDIA TU102 [GeForce RTX 2080 Ti] vendor: eVga.com. driver: nvidia v: 515.43.04 bus-ID: 03:00.0 chip-ID: 10de:1e04 class-ID: 0300 Display: x11 server: X.Org 21.1.3 compositor: kwin_x11 driver: loaded: modesetting unloaded: fbdev,vesa alternate: nouveau,nv,nvidia resolution: 1920x1080~60Hz s-dpi: 96 OpenGL: renderer: llvmpipe (LLVM 14.0.3 256 bits) v: 4.5 Mesa 22.1.0 direct render: Yes The system itself is a dual Xeon Dell 7910 with ECC memory. Here's an Oops that I suspect is the cause of the issue: May 30 20:57:18 localhost.localdomain kernel: PM: hibernation: hibernation exit May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failure reading maximum pixel clock value for display device HDMI-0. May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices May 30 20:57:18 localhost.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000800 May 30 20:57:18 localhost.localdomain kernel: #PF: supervisor read access in kernel mode May 30 20:57:18 localhost.localdomain kernel: #PF: error_code(0x0000) - not-present page May 30 20:57:18 localhost.localdomain kernel: PGD 0 P4D 0 May 30 20:57:18 localhost.localdomain kernel: Oops: 0000 [#1] PREEMPT SMP PTI May 30 20:57:18 localhost.localdomain kernel: CPU: 58 PID: 15864 Comm: Xorg.bin Tainted: P S W OE 5.17.7-1-default #1 openSUSE Tumbleweed b4ba1fcd97f0731b1076a42506ea31afd2937a1c May 30 20:57:18 localhost.localdomain kernel: Hardware name: Dell Inc. Precision Tower 7910/0215PR, BIOS A34 10/19/2020 May 30 20:57:18 localhost.localdomain kernel: RIP: 0010:_nv002439kms+0x24/0x110 [nvidia_modeset] May 30 20:57:18 localhost.localdomain kernel: Code: 75 84 5b c3 66 90 89 f6 41 56 41 55 41 54 55 41 89 d5 48 69 ee c0 2f 00 00 53 48 8b 87 58 04 00 00 49 89 fc 48 89 cf 48 89 cb <4c> 8b b4 28 00 08 00 00 e8 2f ff ff ff 41 83 fd 03 0f 87 c5 00 00 May 30 20:57:18 localhost.localdomain kernel: RSP: 0018:ffff9a46a68236c0 EFLAGS: 00010206 May 30 20:57:18 localhost.localdomain kernel: RAX: 0000000000000000 RBX: ffff9a46a21cf608 RCX: ffff9a46a21cf608 May 30 20:57:18 localhost.localdomain kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a46a21cf608 May 30 20:57:18 localhost.localdomain kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004 May 30 20:57:18 localhost.localdomain kernel: R10: 0000000000000004 R11: ffffffffc3b62570 R12: ffff8dee86c05008 May 30 20:57:18 localhost.localdomain kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 May 30 20:57:18 localhost.localdomain kernel: FS: 00007fbcf8028940(0000) GS:ffff8e157f980000(0000) knlGS:0000000000000000 May 30 20:57:18 localhost.localdomain kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 30 20:57:18 localhost.localdomain kernel: CR2: 0000000000000800 CR3: 00000014519b4004 CR4: 00000000001706e0 May 30 20:57:18 localhost.localdomain kernel: Call Trace: May 30 20:57:18 localhost.localdomain kernel: <TASK> May 30 20:57:18 localhost.localdomain kernel: ? _nv002567kms+0x6c9/0x2cd0 [nvidia_modeset 0e35234a034c3728c2fa25d8a19b0c9b91201dbf] May 30 20:57:18 localhost.localdomain kernel: ? alloc_vmap_area+0x94/0x830 May 30 20:57:18 localhost.localdomain kernel: ? kmem_cache_alloc_node+0x1c0/0x370 May 30 20:57:18 localhost.localdomain kernel: ? prepare_alloc_pages.constprop.0+0x82/0x140 May 30 20:57:18 localhost.localdomain kernel: ? __alloc_pages_bulk+0x30e/0x690 May 30 20:57:18 localhost.localdomain kernel: ? vmap_small_pages_range_noflush+0x301/0x4b0 May 30 20:57:18 localhost.localdomain kernel: ? __vmalloc_node_range+0x38f/0x510 May 30 20:57:18 localhost.localdomain kernel: ? _nv000534kms+0x50/0x50 [nvidia_modeset 0e35234a034c3728c2fa25d8a19b0c9b91201dbf] May 30 20:57:18 localhost.localdomain kernel: ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset 0e35234a034c3728c2fa25d8a19b0c9b91201dbf] May 30 20:57:18 localhost.localdomain kernel: ? nvkms_ioctl_from_kapi+0x47/0x80 [nvidia_modeset 0e35234a034c3728c2fa25d8a19b0c9b91201dbf] May 30 20:57:18 localhost.localdomain kernel: ? _nv000019kms+0x691/0x7f0 [nvidia_modeset 0e35234a034c3728c2fa25d8a19b0c9b91201dbf] May 30 20:57:18 localhost.localdomain kernel: ? drm_connector_list_iter_next+0x81/0xb0 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? nv_drm_atomic_apply_modeset_config.isra.0+0x48d/0x540 [nvidia_drm 840cbbe698be756b715aead483c5859533ca8672] May 30 20:57:18 localhost.localdomain kernel: ? drm_atomic_check_only+0x5a7/0x9f0 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_atomic_commit+0x13/0x60 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_atomic_helper_set_config+0x6d/0xa0 [drm_kms_helper b3a822f6764fffddcd0766156054ee91056277a4] May 30 20:57:18 localhost.localdomain kernel: ? drm_mode_setcrtc+0x395/0x780 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_mode_getcrtc+0x170/0x170 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_ioctl_kernel+0xbe/0x160 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_ioctl+0x21c/0x410 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? drm_mode_getcrtc+0x170/0x170 [drm eceed2d906d3f73f699f70947de57244439bee85] May 30 20:57:18 localhost.localdomain kernel: ? __x64_sys_ioctl+0x8d/0xc0 May 30 20:57:18 localhost.localdomain kernel: ? do_syscall_64+0x5b/0x80 May 30 20:57:18 localhost.localdomain kernel: ? do_syscall_64+0x67/0x80 May 30 20:57:18 localhost.localdomain kernel: ? exit_to_user_mode_prepare+0x194/0x230 May 30 20:57:18 localhost.localdomain kernel: ? syscall_exit_to_user_mode+0x18/0x40 May 30 20:57:18 localhost.localdomain kernel: ? do_syscall_64+0x67/0x80 May 30 20:57:18 localhost.localdomain kernel: ? exit_to_user_mode_prepare+0x194/0x230 May 30 20:57:18 localhost.localdomain kernel: ? syscall_exit_to_user_mode+0x18/0x40 May 30 20:57:18 localhost.localdomain kernel: ? do_syscall_64+0x67/0x80 May 30 20:57:18 localhost.localdomain kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae May 30 20:57:18 localhost.localdomain kernel: </TASK> May 30 20:57:18 localhost.localdomain kernel: Modules linked in: af_packet intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support mei_wdt ucsi_ccg typec_ucsi typec roles dell_wmi sparse_keymap video dell_smm_hwmon nvidia_drm(POE) intel_rapl_common nvidia_modeset(POE) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dell_smbios dcdbas nvidia_uvm(POE) irqbypass intel_wmi_thunderbolt dell_wmi_descriptor wmi_bmof pcspkr igb snd_hda_codec_realtek i2c_i801 i2c_algo_bit i2c_smbus lpc_ich snd_hda_codec_generic dca ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg nvidia(POE) snd_intel_sdw_acpi joydev snd_hda_codec snd_hda_core snd_hwdep mei_me drm_kms_helper e1000e snd_pcm mei cec snd_timer rc_core snd syscopyarea sysfillrect sysimgblt i2c_nvidia_gpu fb_sys_fops soundcore nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat May 30 20:57:18 localhost.localdomain kernel: nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nfnetlink rfkill ip6table_filter ip6_tables iptable_filter bpfilter dmi_sysfs squashfs loop tiny_power_button xfs drm fuse configfs ip_tables x_tables hid_generic usbhid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mxm_wmi aesni_intel xhci_pci crypto_simd xhci_pci_renesas cryptd xhci_hcd ehci_pci ehci_hcd mpt3sas usbcore sr_mod cdrom raid_class scsi_transport_sas wmi button btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr May 30 20:57:18 localhost.localdomain kernel: CR2: 0000000000000800 May 30 20:57:18 localhost.localdomain kernel: ---[ end trace 0000000000000000 ]--- May 30 20:57:18 localhost.localdomain kernel: RIP: 0010:_nv002439kms+0x24/0x110 [nvidia_modeset] May 30 20:57:18 localhost.localdomain kernel: Code: 75 84 5b c3 66 90 89 f6 41 56 41 55 41 54 55 41 89 d5 48 69 ee c0 2f 00 00 53 48 8b 87 58 04 00 00 49 89 fc 48 89 cf 48 89 cb <4c> 8b b4 28 00 08 00 00 e8 2f ff ff ff 41 83 fd 03 0f 87 c5 00 00 May 30 20:57:18 localhost.localdomain kernel: RSP: 0018:ffff9a46a68236c0 EFLAGS: 00010206 I use a cheap KVM to switch between computers. It seems likely the KVM is constantly disconnecting and reconnecting HDMI as I switch between computers and that is possibly related to the observed behavior. (As a possible work-around, I will try to use a DP-to-HDMI adapter.) Reproducible: Sometimes Steps to Reproduce: 1. Wait for computer to idle for a long time, 2. Use KVM to use other computer 2. Switch to computer via KVM 3. Move mouse to get out of lock screen --> crash happens -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c1 --- Comment #1 from Stefan Dirsch <sndirsch@suse.com> --- Wow! My packages for the Open Source driver are already in use. Although I didn't tell anybody outside of the company about. Honestly, I only did them for basic testing, if the driver loads at all and "nvidia-smi --query" works. It definitely makes sense to install nvidia-computeG06 in addition to that for GPGPU use to have libcuda available. Looks like what happens now is that modeset X driver is being used, which finds DRM support via the modeset option of the nvidia driver. Theoretically this should be possible, but I'm not sure anybody ever tried this. Since nVidia comes with its own nvidia X driver. So I suggest to install x11-video-nvidiaG06 package. Probably also nvidia-glG06 package. Also this is still Alpha quality on your GeForce RTX 2080 Ti. You needed to options nvidia NVreg_OpenRmEnableUnsupportedGpus=1 in modprobe.d/50-nvidia-default.conf, right? Things would be different, if this would be the gfxcard, which you only use for GPGPU. Then I think you could live with only nvidia-gfxG06-kmp-default kernel-firmware-nvidia-gsp nvidia-computeG06 and drive your monitors with another driver (intel, and, etc.) -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c2 --- Comment #2 from Stefan Dirsch <sndirsch@suse.com> --- BTW, where did you find kernel-firmware-nvidia 20220516-1.1 Seems to be a different package than mine. https://build.opensuse.org/package/show/X11:Drivers:Video/kernel-firmware-nv... -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c3 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #3 from Stefan Dirsch <sndirsch@suse.com> --- Oh no. You're not using the OpenSource nvidia modules at all. I was confused. I'm afraid what you're trying is not supported by nVidia. In theory it should work. modeset X driver with neither 2D acceleration (GLAMOR via nouveau won't work) nor 3D support for OpenGL. Question is also. Who really wants this. If you only want to use the nVidia card for GPGPU this is a different story. -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c4 --- Comment #4 from Stefan Dirsch <sndirsch@suse.com> --- But since you also want to use it for your Display ... this doesn't fly. You only have one gfxcard apparently according to inxi. -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c5 --- Comment #5 from Aaron Schweiger <aaron.schweiger@gmail.com> --- Hi, I have it set up this way so as to use the card in only the most basic configuration -- for both stability and the very low resource use on the card, note 0% GPU-Utilization: Tue May 31 21:52:17 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:03:00.0 On | N/A | | 24% 34C P8 37W / 250W | 10MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ I don't have any special configuration in 50-nvidia-default.conf. Per your suggestion, I will remove the libdrm_nouveau2; I'll also try x11-video-nvidiaG06. -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c6 --- Comment #6 from Aaron Schweiger <aaron.schweiger@gmail.com> --- I installed x11-video-nvidiaG06; I didn't remove libdrm_nouveau2 (is this a standard library?). I note the increased usage of the GPU below: Tue May 31 22:36:03 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:03:00.0 On | N/A | | 24% 41C P5 51W / 250W | 78MiB / 11264MiB | 27% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 32080 G /usr/bin/Xorg.bin 76MiB | +-----------------------------------------------------------------------------+ I'll report back if I see a crash with this configuration -- although I suspect this same instability was why I tried to switch to minimal drivers. -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c7 --- Comment #7 from Stefan Dirsch <sndirsch@suse.com> --- (In reply to Aaron Schweiger from comment #5)
I have it set up this way so as to use the card in only the most basic configuration -- for both stability and the very low resource use on the card, note 0% GPU-Utilization:
I'm sure a graphics driver can be very stable if you don't use a graphical desktop. ;-)
I don't have any special configuration in 50-nvidia-default.conf.
Sure. I had wrong assumptions - assuming you would be using the OpenSource nvidia kernel driver ...
Per your suggestion, I will remove the libdrm_nouveau2; I'll also try x11-video-nvidiaG06.
You don't need to remove libdrm_nouveau package. It's only been used by nouveau driver, but you're using nvidia driver with the installation of the nvidia kernel modules. In that case nouveau driver will be disabled. -- You are receiving this mail because: You are on the CC list for the bug.

http://bugzilla.opensuse.org/show_bug.cgi?id=1200048 http://bugzilla.opensuse.org/show_bug.cgi?id=1200048#c8 --- Comment #8 from Stefan Dirsch <sndirsch@suse.com> --- (In reply to Aaron Schweiger from comment #6)
I installed x11-video-nvidiaG06; I didn't remove libdrm_nouveau2 (is this a standard library?).
That's fine. See my comment above.
I note the increased usage of the GPU below:
Well, you've using a graphical desktop. What do yo expect?
I'll report back if I see a crash with this configuration -- although I suspect this same instability was why I tried to switch to minimal drivers.
Ok. There are not so many alternatives. Either nouveau driver works for you. If not, use nvidia driver. If this doesn't work either, well .. only completely unaccelerated "fbdev" driver remains, that may not even support the native resolution of your monitor, let alone support for multiple monitors. You are the first person I hear of trying to mix nvidia kernel modules with "modeset" X driver. Unfortunately a complete failure. :-(. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com