New subject: [Bug 1226116] nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving

8 Jun 2024

      https://bugzilla.suse.com/show_bug.cgi?id=1226116

            Bug ID: 1226116
           Summary: nvidia-open-driver-G06 ... 550.90.07: Fails
                    suspend/recover from powersaving
    Classification: openSUSE
           Product: openSUSE Tumbleweed
           Version: Current
          Hardware: Other
                OS: Other
            Status: NEW
          Severity: Normal
          Priority: P5 - None
         Component: X11 3rd Party Driver
          Assignee: gfx-bugs@suse.de
          Reporter: burnus@gmx.de
        QA Contact: sndirsch@suse.com
  Target Milestone: ---
          Found By: ---
           Blocker: ---

Created attachment 875386
  --> https://bugzilla.suse.com/attachment.cgi?id=875386&action=edit
nvidia-bug-report.log.gz from nvidia-bug-report.sh --safe-mode

[That's on a Lenovo P1 Gen 6 with "NVIDIA RTX A1000 6GB Laptop GPU".]

When going to the suspend mode, it looks quite normal (screen gets dark,
powerlight blinks.)

However, when one presses a button:

(1) it shows the terminal Window - not the Wayland desktop -
    with two unrelated warnings:
i2c_hid_acpi i2c-ELAN0686:00: i2c_hid_get_input: incomplete report (31/65280)
iwlwifi 0000:00:14.3: WRT: Invalid buffer destination

(2) When going to Alt+F2, it shows a login screen but
    then nothing works (well, SysRq-{S,U,B} does work), but nothing on that
    screen nor Alt-Shift-F...

When skipping (2):
(3) When going to a terminal (e.g. Alt-F1), login is possible.
Getting some diagnostic output:

* nvidia-bug-report.sh stops early as an access to
/proc/driver/nvidia/gpus/0000:01:00.0/information (I think it was that file)
got stuck (lsof -p<pid> showed that file) – likewise, nvidia-smi did not output
anything (both interruptible by 'ctrl-C'.

* A reboot failed with some message by systemd related to nvidia-*.service;
  I thought it was nvidia-powerd.service, but looking at the logs, it
  could be also nvidia-suspend.service. In any case, it stated something that
  disabling (or enabling?) wasn't possible while the service was currently
  enabled (or disabled?).

* "nvidia-bug-report.sh --safe-mode"
  this did work → see attached file.

* * *

dmesg showed many lines of the form:

kernel: NVRM: kbusVerifyBar2_GM107: MMUTest BAR0 window offset 0x70e000
returned garbage 0x0

The attached .gz file from "nvidia-bug-report.sh --safe-mode" contains both
dmesg and "journalctl -b -0" and has the line above 1,750,353 times.

The "journalctl -b -0" output it contains (→ attachment) has:

Jun 07 20:59:33 tux.net-b.de /usr/bin/nvidia-powerd[1477]: Dbus Connection is
established
Jun 07 21:29:48 tux.net-b.de suspend[3509]: nvidia-suspend.service
Jun 07 21:29:48 tux.net-b.de logger[3509]: <13>Jun  7 21:29:48 suspend:
nvidia-suspend.service

Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvCheckFailedNoLog: Check failed:
pMemDesc->_pInternalMapping != NULL @ mem_desc.c:2260
Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertFailedNoLog: Assertion
failed: 0 @ mem_utils.c:574
Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertOkFailedNoLog: Assertion
failed: Ran out of a critical resource, other than memory
[NV_ERR_INSUFFICIENT_RESOURCES] (0x0000001A) returned from
memmgrMemCopy(pMemoryManager, &sysSurface, &vidSurface, copySize,
TRANSFER_FLAGS_PREFER_CE) @ fbsr_gm107.c:1156

* * *

I think the issue occurred when doing the suspend and not when waking up the
system, but I might be mistaken. - I thought wall time showed it, but I am not
completely sure as I woke it up quite quickly; however, the quoted assertion
fails directly after nvidia-suspend.service, which implies that it happens
during the suspend.

BTW: With the older 550.78 driver, leaving the laptop a while alone (→ power
save mode) ended up with a reboot or shortly showing the terminal (similar
output as above) before rebooting. Thus, the 550.78 issue was definitely a
suspend/power-save issue. [The triggered reboot would be harder to diagnose
than the issue I have now.]

* * *

Installed nvidia packages (rpm -qa '*nvidia*') - all are now 550.90.07-23.1:

nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64
kernel-firmware-nvidia-20240519-1.1.noarch
nvidia-compute-G06-32bit-550.90.07-23.1.x86_64
nvidia-gl-G06-550.90.07-23.1.x86_64
nvidia-video-G06-32bit-550.90.07-23.1.x86_64
nvidia-compute-utils-G06-550.90.07-23.1.x86_64
nvidia-video-G06-550.90.07-23.1.x86_64
nvidia-utils-G06-550.90.07-23.1.x86_64
libnvidia-egl-wayland1-1.1.13-1.3.x86_64
nvidia-compute-G06-550.90.07-23.1.x86_64
nvidia-gl-G06-32bit-550.90.07-23.1.x86_64
kernel-firmware-nvidia-gspx-G06-550.90.07-1.1.x86_64

* * *

Side remarks:

(a) Contrary to the classic drivers, the open kernel driver offers the
pageableMemoryAccess property, which permits via Linux kernel HMM support to
migrate memory pages to/from the device when a the page is accessed. That's
used, e.g., by GCC 15 (mainline) with OpenMP offload support when
Unified-Shared Memory (USM) has been requested. See
https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html /
https://gcc.gnu.org/gcc-15/changes.html

(b) The open kernels drivers permit showing the screen to both an external
monitor and to the laptop screen, which didn't work with the default/classic
driver.

(c) The more recent classic/non-'open kernels' driver also tended to crash
occasionally (either reboot [typically when doing 'zypper dup'; possibly due to
some systemd interaction] - or a freeze with a kernel fail (blinking shift
lock; not even SysRq worked), which is a known but unsolved issue for the 550
driver according to the Nvidia Linux forum.

Thus, except for the issue reported in this bug, the open-kernels driver is
better. :-)
And the future (said to be the default with Nvidia's 555 driver).
-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1226116] New: nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving

bugzilla_noreply＠suse.com

bugzilla_noreply＠suse.com

bugzilla_noreply＠suse.com

bugzilla_noreply＠suse.com

tags

participants (1)