http://bugzilla.opensuse.org/show_bug.cgi?id=1180742
Bug ID: 1180742 Summary: [amdgpu]An AMD Vega series GPU randomly crashes Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.2 Hardware: x86-64 OS: openSUSE Leap 15.2 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: srid@rkmail.ru QA Contact: qa-bugs@suse.de Found By: --- Blocker: ---
Created attachment 844970 --> http://bugzilla.opensuse.org/attachment.cgi?id=844970&action=edit partial kernel log
The AMDGPU kernel driver randomly crashes GPU, usually under load, with Radeon VII hardware. The GPU hang is relatively hard to hit, as it usually takes 5 to 7 days before it crashes. After a hang it attempts to reset the GPU, but sometimes the reset fails and system stays sort of unresponsive. You can still access it over network, and there's some sort of reaction on keyboard events, but display stays dead. Also, it seems to bring PCIe bus down to 1.0 mode, and it stays that until reboot.
There's an upstream bug open that may have something to do about it: https://gitlab.freedesktop.org/drm/amd/-/issues/716
That particular GPU works fine on Windows machine
openSUSE Leap 15.2, kernel 5.3.18-lp152.57-default #1 SMP Fri Dec 4 07:27:58 UTC 2020 (7be5551)
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c2
--- Comment #2 from Iakov Karpov srid@rkmail.ru --- (In reply to Takashi Iwai from comment #1)
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine (stuck right after bootloader, not even a single line after "loading initrd" on screen.
Testing with Kernel:stable may require some time.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c4
--- Comment #4 from Iakov Karpov srid@rkmail.ru --- (In reply to Takashi Iwai from comment #1)
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
I've been testing kernel 5.10.6-3.g183dcff-default of Kernel:stable for almost 14 days now, not a single crash.
(In reply to Takashi Iwai from comment #3)
That's bad. Do you have the secure boot enabled? If so, disable it when you test a kernel from OBS repo that is other than the official release.
I'm on kernel 5.3.18-107.g0b709ea-default of Kernel:SLE15-SP3 now, it works for me. Didn't change anything about secure boot, though, I don't think I had it enabled. I'll report back when in another 2 weeks if it won't crash sooner.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c5
Iakov Karpov srid@rkmail.ru changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(srid@rkmail.ru) |
--- Comment #5 from Iakov Karpov srid@rkmail.ru --- (In reply to Takashi Iwai from comment #1)
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too. Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.
Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3? The latter contains the backport of DRM stack up to 5.9.x.
It crashed on 12th day with 5.3.18-107.g0b709ea-default (Kernel:SLE15-SP3)
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c6
--- Comment #6 from Iakov Karpov srid@rkmail.ru --- Created attachment 845864 --> http://bugzilla.opensuse.org/attachment.cgi?id=845864&action=edit Partial kernel log of 5.3.18-107.g0b709ea-default
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c10
--- Comment #10 from Iakov Karpov srid@rkmail.ru --- (In reply to Miroslav Bene�� from comment #9)
Still not resolved in upstream according to the reports. Might be worked around by disabling the dynamic power management of the GPU or by the GPU frequency throttling manipulation.
Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not supported anymore, Leap 15.3 is probably not better if I read your feedback correctly. Leap 15.4 will be based on v5.14 kernel.
I'm currently using Leap 15.3 with kernel 5.15.13 of Kernel:stable:Backport. It's better, but still crashes sometimes. With 5.16.x kernels my crashing every few minutes, but I'm not sure the GPU is the case there. Was not able to recover any crash logs, so no bug report on that.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c11
Miroslav Bene�� mbenes@suse.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |patrik.jakobsson@suse.com, | |tzimmermann@suse.com
--- Comment #11 from Miroslav Bene�� mbenes@suse.com --- Thanks for the feedback. I'll leave the bug open and will occasionally monitor it.
CCing Patrik and Thomas so that they are aware, but I am not sure if we can do anything here besides waiting for upstream.
http://bugzilla.opensuse.org/show_bug.cgi?id=1180742 http://bugzilla.opensuse.org/show_bug.cgi?id=1180742#c12
--- Comment #12 from Takashi Iwai tiwai@suse.com --- One thing that might be worth is to update kernel-firmware-amdgpu from OBS Kernel:stable:Backport repo (if not done yet).
kernel-bugs@lists.opensuse.org