[Bug 1226116] New: nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving
https://bugzilla.suse.com/show_bug.cgi?id=1226116 Bug ID: 1226116 Summary: nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: X11 3rd Party Driver Assignee: gfx-bugs@suse.de Reporter: burnus@gmx.de QA Contact: sndirsch@suse.com Target Milestone: --- Found By: --- Blocker: --- Created attachment 875386 --> https://bugzilla.suse.com/attachment.cgi?id=875386&action=edit nvidia-bug-report.log.gz from nvidia-bug-report.sh --safe-mode [That's on a Lenovo P1 Gen 6 with "NVIDIA RTX A1000 6GB Laptop GPU".] When going to the suspend mode, it looks quite normal (screen gets dark, powerlight blinks.) However, when one presses a button: (1) it shows the terminal Window - not the Wayland desktop - with two unrelated warnings: i2c_hid_acpi i2c-ELAN0686:00: i2c_hid_get_input: incomplete report (31/65280) iwlwifi 0000:00:14.3: WRT: Invalid buffer destination (2) When going to Alt+F2, it shows a login screen but then nothing works (well, SysRq-{S,U,B} does work), but nothing on that screen nor Alt-Shift-F... When skipping (2): (3) When going to a terminal (e.g. Alt-F1), login is possible. Getting some diagnostic output: * nvidia-bug-report.sh stops early as an access to /proc/driver/nvidia/gpus/0000:01:00.0/information (I think it was that file) got stuck (lsof -p<pid> showed that file) – likewise, nvidia-smi did not output anything (both interruptible by 'ctrl-C'. * A reboot failed with some message by systemd related to nvidia-*.service; I thought it was nvidia-powerd.service, but looking at the logs, it could be also nvidia-suspend.service. In any case, it stated something that disabling (or enabling?) wasn't possible while the service was currently enabled (or disabled?). * "nvidia-bug-report.sh --safe-mode" this did work → see attached file. * * * dmesg showed many lines of the form: kernel: NVRM: kbusVerifyBar2_GM107: MMUTest BAR0 window offset 0x70e000 returned garbage 0x0 The attached .gz file from "nvidia-bug-report.sh --safe-mode" contains both dmesg and "journalctl -b -0" and has the line above 1,750,353 times. The "journalctl -b -0" output it contains (→ attachment) has: Jun 07 20:59:33 tux.net-b.de /usr/bin/nvidia-powerd[1477]: Dbus Connection is established Jun 07 21:29:48 tux.net-b.de suspend[3509]: nvidia-suspend.service Jun 07 21:29:48 tux.net-b.de logger[3509]: <13>Jun 7 21:29:48 suspend: nvidia-suspend.service Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvCheckFailedNoLog: Check failed: pMemDesc->_pInternalMapping != NULL @ mem_desc.c:2260 Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mem_utils.c:574 Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Ran out of a critical resource, other than memory [NV_ERR_INSUFFICIENT_RESOURCES] (0x0000001A) returned from memmgrMemCopy(pMemoryManager, &sysSurface, &vidSurface, copySize, TRANSFER_FLAGS_PREFER_CE) @ fbsr_gm107.c:1156 * * * I think the issue occurred when doing the suspend and not when waking up the system, but I might be mistaken. - I thought wall time showed it, but I am not completely sure as I woke it up quite quickly; however, the quoted assertion fails directly after nvidia-suspend.service, which implies that it happens during the suspend. BTW: With the older 550.78 driver, leaving the laptop a while alone (→ power save mode) ended up with a reboot or shortly showing the terminal (similar output as above) before rebooting. Thus, the 550.78 issue was definitely a suspend/power-save issue. [The triggered reboot would be harder to diagnose than the issue I have now.] * * * Installed nvidia packages (rpm -qa '*nvidia*') - all are now 550.90.07-23.1: nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64 kernel-firmware-nvidia-20240519-1.1.noarch nvidia-compute-G06-32bit-550.90.07-23.1.x86_64 nvidia-gl-G06-550.90.07-23.1.x86_64 nvidia-video-G06-32bit-550.90.07-23.1.x86_64 nvidia-compute-utils-G06-550.90.07-23.1.x86_64 nvidia-video-G06-550.90.07-23.1.x86_64 nvidia-utils-G06-550.90.07-23.1.x86_64 libnvidia-egl-wayland1-1.1.13-1.3.x86_64 nvidia-compute-G06-550.90.07-23.1.x86_64 nvidia-gl-G06-32bit-550.90.07-23.1.x86_64 kernel-firmware-nvidia-gspx-G06-550.90.07-1.1.x86_64 * * * Side remarks: (a) Contrary to the classic drivers, the open kernel driver offers the pageableMemoryAccess property, which permits via Linux kernel HMM support to migrate memory pages to/from the device when a the page is accessed. That's used, e.g., by GCC 15 (mainline) with OpenMP offload support when Unified-Shared Memory (USM) has been requested. See https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html / https://gcc.gnu.org/gcc-15/changes.html (b) The open kernels drivers permit showing the screen to both an external monitor and to the laptop screen, which didn't work with the default/classic driver. (c) The more recent classic/non-'open kernels' driver also tended to crash occasionally (either reboot [typically when doing 'zypper dup'; possibly due to some systemd interaction] - or a freeze with a kernel fail (blinking shift lock; not even SysRq worked), which is a known but unsolved issue for the 550 driver according to the Nvidia Linux forum. Thus, except for the issue reported in this bug, the open-kernels driver is better. :-) And the future (said to be the default with Nvidia's 555 driver). -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1226116 https://bugzilla.suse.com/show_bug.cgi?id=1226116#c1 --- Comment #1 from Tobias Burnus <burnus@gmx.de> --- Looking at https://forums.developer.nvidia.com/c/gpu-graphics/linux/148 Today, there was a reply to an issue reported by someone else, pointing to https://github.com/NVIDIA/open-gpu-kernel-modules/issues/472 That issue has plenty of comments and was opened Mar 11, 2023 for 525.85.05. Glancing through that issue: * I didn't see my assert * but 'MMUTest BAR0 window offset 0x70e000 returned garbage 0x0' showed up in one comment; the variant with 'f' instead of 'e' in the hex address showed up in another more recent comment. Plus: * May 21, 2024 a comment was:
We missed calling it out in the changelog explicitly (oops), but this should be fixed with 555.42.02. Please test. I'll leave this bug open while 555.xx is still in beta.
A bit later, some user reported:
I am also getting a crash on suspend once in a blue moon. Here's the logs whenever the crash happens: [...]
With the reply (on June 4, 2024):
Acknowledged the crash on suspend issue, we have filed a bug 4683310 internally for tracking purpose. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1226116 https://bugzilla.suse.com/show_bug.cgi?id=1226116#c4 Tobias Burnus <burnus@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(burnus@gmx.de) | --- Comment #4 from Tobias Burnus <burnus@gmx.de> ---
For this you need to edit /usr/lib/modprobe.d/50-nvidia-default.conf
I guess you mean: /usr/lib/modprobe.d/59-nvidia-default.conf of nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64 I can confirm that with /proc/driver/nvidia/params PreserveVideoMemoryAllocations: 0 suspending + unsuspending/waking up works and I also do not see any glitches, but I have not tried much. nvidia-smi also works. * * * Initially, I forgot to run: dracut --force systemctl disable nvidia-suspend.service systemctl disable nvidia-hibernate.service systemctl disable nvidia-resume.service and at least 'status' for suspend/resume showed that they were enabled, but also with them enabled it did work as described above. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1226116 https://bugzilla.suse.com/show_bug.cgi?id=1226116#c6 --- Comment #6 from Tobias Burnus <burnus@gmx.de> ---
May I ask which desktop you're using?
KDE Plasma 6 (Wayland) -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com