[Bug 1180742] New: [amdgpu]An AMD Vega series GPU randomly crashes
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 Bug ID: 1180742 Summary: [amdgpu]An AMD Vega series GPU randomly crashes Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.2 Hardware: x86-64 OS: openSUSE Leap 15.2 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: srid@rkmail.ru QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 844970 --> http://bugzilla.opensuse.org/attachment.cgi?id=844970&action=edit partial kernel log The AMDGPU kernel driver randomly crashes GPU, usually under load, with Radeon VII hardware. The GPU hang is relatively hard to hit, as it usually takes 5 to 7 days before it crashes. After a hang it attempts to reset the GPU, but sometimes the reset fails and system stays sort of unresponsive. You can still access it over network, and there's some sort of reaction on keyboard events, but display stays dead. Also, it seems to bring PCIe bus down to 1.0 mode, and it stays that until reboot. There's an upstream bug open that may have something to do about it: https://gitlab.freedesktop.org/drm/amd/-/issues/716 That particular GPU works fine on Windows machine openSUSE Leap 15.2, kernel 5.3.18-lp152.57-default #1 SMP Fri Dec 4 07:27:58 UTC 2020 (7be5551) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c2
--- Comment #2 from Iakov Karpov
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine (stuck right after bootloader, not even a single line after "loading initrd" on screen. Testing with Kernel:stable may require some time. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c4
--- Comment #4 from Iakov Karpov
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
I've been testing kernel 5.10.6-3.g183dcff-default of Kernel:stable for almost 14 days now, not a single crash. (In reply to Takashi Iwai from comment #3)
That's bad. Do you have the secure boot enabled? If so, disable it when you test a kernel from OBS repo that is other than the official release.
I'm on kernel 5.3.18-107.g0b709ea-default of Kernel:SLE15-SP3 now, it works for me. Didn't change anything about secure boot, though, I don't think I had it enabled. I'll report back when in another 2 weeks if it won't crash sooner. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c5
Iakov Karpov
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
It crashed on 12th day with 5.3.18-107.g0b709ea-default (Kernel:SLE15-SP3) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c6
--- Comment #6 from Iakov Karpov
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c10
--- Comment #10 from Iakov Karpov
Still not resolved in upstream according to the reports. Might be worked around by disabling the dynamic power management of the GPU or by the GPU frequency throttling manipulation.
Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not supported anymore, Leap 15.3 is probably not better if I read your feedback correctly. Leap 15.4 will be based on v5.14 kernel.
I'm currently using Leap 15.3 with kernel 5.15.13 of Kernel:stable:Backport. It's better, but still crashes sometimes. With 5.16.x kernels my crashing every few minutes, but I'm not sure the GPU is the case there. Was not able to recover any crash logs, so no bug report on that. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c11
Miroslav Bene��
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c12
--- Comment #12 from Takashi Iwai
participants (1)
-
bugzilla_noreply@suse.com