[kernel-bugs] [Bug 1178318] New: RX 5600M shader clocks stuck at 300MHz with kernel 5.9.1
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 Bug ID: 1178318 Summary: RX 5600M shader clocks stuck at 300MHz with kernel 5.9.1 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: aaron.zakhrov@gmail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 843188 --> http://bugzilla.opensuse.org/attachment.cgi?id=843188&action=edit Unigine Heaven with GALLIUM HUD=shader-clock and DRI_PRIME=1 set With Kernel 5.9.1-default on my Dell G5 15 SE running OpenSUSE tumbleweed, my discrete Radeon RX 5600M has its shader clock stuck at 300MHz. The relevant upstream bug reports are: https://gitlab.freedesktop.org/drm/amd/-/issues/1272 and https://gitlab.freedesktop.org/drm/amd/-/issues/1252 This happens regardless of GPU load. Glxgears, vkcube and Unigine heaven all show the same behavior both plugged in and on battery. The clock is not being misreported either. I get very slow graphics performance on the discrete GPU when using kernel 5.9.1 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 Aaron Dominick <aaron.zakhrov@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High CC| |aaron.zakhrov@gmail.com -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c1 --- Comment #1 from Aaron Dominick <aaron.zakhrov@gmail.com> --- As far as I can tell, this seems to be either a configuration mismatch when the RPM is built. Using the 5.9.1 kernel compiled with make localmodconfig has the discrete GPU running at the proper AMD specified clock speeds (800 - 1700 MHz with a max of 1750Mhz when the high power profile is forced) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c2 --- Comment #2 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843211 --> http://bugzilla.opensuse.org/attachment.cgi?id=843211&action=edit GLXGears -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c4 --- Comment #4 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #3)
(In reply to Aaron Dominick from comment #1)
As far as I can tell, this seems to be either a configuration mismatch when the RPM is built. Using the 5.9.1 kernel compiled with make localmodconfig has the discrete GPU running at the proper AMD specified clock speeds (800 - 1700 MHz with a max of 1750Mhz when the high power profile is forced)
What exactly did you with localmodconfig? You rebuilt the kernel based on TW 5.9.1 kernel, or the config carried from the older 5.8.x?
In anyway, please give the kernel config and dmesg output of both good and bad-working 5.9.1 kernels, so that we can compare in details.
I built it based on the Tumbleweed 5.9.1 kernel. Attached is the Tumbleweed default kernel configuration and boot log -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c5 --- Comment #5 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843290 --> http://bugzilla.opensuse.org/attachment.cgi?id=843290&action=edit boot log -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c6 --- Comment #6 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843291 --> http://bugzilla.opensuse.org/attachment.cgi?id=843291&action=edit Default 5.9.1 configuration -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c7 --- Comment #7 from Takashi Iwai <tiwai@suse.com> --- Thanks. Are the above from your kernel? Could you re-install the original openSUSE kernel and reproduce the issue, and attach the dmesg output from there? At the next build, I recommend you to modify CONFIG_LOCALVERSION to point to a different suffix, so that you can install both your kernel and the original kernel at the same time. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c8 --- Comment #8 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #7)
Thanks. Are the above from your kernel? Could you re-install the original openSUSE kernel and reproduce the issue, and attach the dmesg output from there?
At the next build, I recommend you to modify CONFIG_LOCALVERSION to point to a different suffix, so that you can install both your kernel and the original kernel at the same time.
That is from the original OpenSUSE kernel (5.9.1-default) that came with the kernel-default RPM package -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c9 --- Comment #9 from Takashi Iwai <tiwai@suse.com> --- OK, then those are the results with failure, right? Now we need the results from the working state, i.e. from your own kernel. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c10 --- Comment #10 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843300 --> http://bugzilla.opensuse.org/attachment.cgi?id=843300&action=edit Boot log with 5.9.2-localmodconfig -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c11 --- Comment #11 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843301 --> http://bugzilla.opensuse.org/attachment.cgi?id=843301&action=edit config with localmodconfig -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c12 --- Comment #12 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #9)
OK, then those are the results with failure, right? Now we need the results from the working state, i.e. from your own kernel.
I've attached the files for the success state. The kernel compiled with localmodconfig removes the clock issue but loads only modules that were loaded on compile time. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c13 --- Comment #13 from Takashi Iwai <tiwai@suse.com> --- Was it really the pure result of make localmodconfig? I wonder it because the log showed more differences, e.g. EDAC messages are gone on your kernel, and some RCU-related messages are also gone. Also, the elantech i2c stuff might be missing as well. Please double-check whether you have the same modules loaded on both kernels. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c14 --- Comment #14 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843307 --> http://bugzilla.opensuse.org/attachment.cgi?id=843307&action=edit Bad kernel lsmod This is the lsmod of kernel 5.9.1-1-default The default OpenSUSE kernel -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c15 --- Comment #15 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843308 --> http://bugzilla.opensuse.org/attachment.cgi?id=843308&action=edit Good kernel lsmod This is from Kernel 5.9.2-localmodconfig that I build with make localmodconfig -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c16 --- Comment #16 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #13)
Was it really the pure result of make localmodconfig?
I wonder it because the log showed more differences, e.g. EDAC messages are gone on your kernel, and some RCU-related messages are also gone. Also, the elantech i2c stuff might be missing as well.
Please double-check whether you have the same modules loaded on both kernels.
It seems to be a conflict with OpenSUSE's kernel confguration and the ACPI-D3Cold patch that is mentioned here: https://gitlab.freedesktop.org/drm/amd/-/issues/1252 When I manually built a 5.9-rc7 kernel with that patch applied I noticed the same behavior. On Kernel 5.8.x before the patch was applied, the discrete GPU's shader clock would drop from 800MHz down to 300-400MHz on Unigine Heaven or similarly intensive workloads. Lighter workloads like vkcube and glxgears would stay at 800MHz. I am not sure what downstream patch causes this but the issue seems to go away on kernel 5.10-rc2 built with olddefconfig. I can share those logs and config files as well. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c17 --- Comment #17 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843312 --> http://bugzilla.opensuse.org/attachment.cgi?id=843312&action=edit 5.10-rc2 bootlog -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c18 --- Comment #18 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Created attachment 843313 --> http://bugzilla.opensuse.org/attachment.cgi?id=843313&action=edit 5.10-rc2 config -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c19 --- Comment #19 from Takashi Iwai <tiwai@suse.com> --- There is a 5.10-rc kernel package in OBS Kernel:HEAD repo. Could you check with it, too? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c20 --- Comment #20 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #19)
There is a 5.10-rc kernel package in OBS Kernel:HEAD repo. Could you check with it, too?
The kernel from OBS Kernel:HEAD works fine -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c21 --- Comment #21 from Takashi Iwai <tiwai@suse.com> --- Thanks for confirmation. So it's indeed fixed in 5.10 tree, and it implies that the issue isn't about the config but the real root cause was something else that could be "fixed" properly :) The question is what's the fix... -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c22 --- Comment #22 from Aaron Dominick <aaron.zakhrov@gmail.com> --- (In reply to Takashi Iwai from comment #21)
Thanks for confirmation. So it's indeed fixed in 5.10 tree, and it implies that the issue isn't about the config but the real root cause was something else that could be "fixed" properly :)
The question is what's the fix...
I am not entirely sure myself but it seems to be related to either ACPI or AMDGPU power management code. On kernel 5.8.15 (which I dont have anymore unfortunately) the shader clocks would run between 300MHz and 800MHz depending on GPU load. On kernels 5.9.x the clock runs at 300MHz flat and on kernel 5.10 it runs at the proper "game" clock speed between 800MHz and 1750MHz. I think it is one of the additional SUSE patches. The changelog for 5.10 says they have been removed. 2 of those relate to AMDGPU. I think that is what fixed it? Either that or one of the new config options. The downstream patches should not have affected my manually built kernel from kernel.org In any case I think we can close this bug since it seems to be fixed in 5.10 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1178318 http://bugzilla.opensuse.org/show_bug.cgi?id=1178318#c23 Aaron Dominick <aaron.zakhrov@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #23 from Aaron Dominick <aaron.zakhrov@gmail.com> --- Can confirm that all shader clock speed problems are fixed in the 5.10 series. I'm still not sure why 5.9.x is behaving like it is. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com