Bug ID | 1190854 |
---|---|
Summary | amdgpu fails to probe Radeon Pro WX3200 |
Classification | openSUSE |
Product | openSUSE Tumbleweed |
Version | Current |
Hardware | x86-64 |
OS | openSUSE Tumbleweed |
Status | NEW |
Severity | Major |
Priority | P5 - None |
Component | Kernel |
Assignee | kernel-bugs@opensuse.org |
Reporter | grasland@lal.in2p3.fr |
QA Contact | qa-bugs@suse.de |
Found By | --- |
Blocker | --- |
Created attachment 852729 [details]
dmesg log, with ANSI color codes
On a Dell Precision 3650 equipped with a Radeon Pro WX3200, something very
wrong happens during the amdgpu initialization process (or perhaps before, see
below), which makes the card unusable on Linux. Thus, I have to fall back to
the CPU's built-in graphics for the time being on this machine.
The GPU is brand new and works on both the UEFI setup screen and Windows, which
suggests to me that the hardware is not completely broken. But of course, this
does not fully exclude the possibility of a hardware issue that would only be
triggered by the Linux initialization process.
You will find attached the dmesg log, with and without ANSI color codes as I
personally find dmesg' color output much easier to visually parse.
---
Let's have a closer look at the amdgpu logs, since that's what I'm directly
interested in here.
First we get the standard stuff, nothing special here as far as I can see. The
CRAT table bit does looks suspicious, but it also happens during the init of my
home Radeon RX 560 that otherwise works fine under Linux, so this does not seem
to be a fatal issue...
771 ��������� [ 3.484654] [drm] amdgpu kernel modesetting enabled.
772 ��������� [ 3.484700] amdgpu: CRAT table not found
773 ��������� [ 3.484701] amdgpu: Virtual CRAT table created for CPU
774 ��������� [ 3.484707] amdgpu: Topology: Add CPU node
...then we get a power management failure, which may or may not be related to
the problem at hand...
775 ��������� [ 3.484764] amdgpu 0000:01:00.0: can't change power state from
D3hot to D0 (config space inaccessible)
...then more "normal" stuff, in the sense of stuff that looks close to what I
see in the init logs of amdgpu on my RX 560...
776 ��������� [ 3.484888] [drm] initializing kernel modesetting (POLARIS12
0x1002:0x6981 0x1028:0x2B0D 0x10).
777 ��������� [ 3.484890] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature not supported
778 ��������� [ 3.484895] [drm] register mmio base: 0x6E900000
779 ��������� [ 3.484896] [drm] register mmio size: 262144
780 ��������� [ 3.484898] [drm] add ip block number 0 <vi_common>
781 ��������� [ 3.484899] [drm] add ip block number 1 <gmc_v8_0>
782 ��������� [ 3.484899] [drm] add ip block number 2 <tonga_ih>
783 ��������� [ 3.484900] [drm] add ip block number 3 <gfx_v8_0>
784 ��������� [ 3.484900] [drm] add ip block number 4 <sdma_v3_0>
785 ��������� [ 3.484901] [drm] add ip block number 5 <powerplay>
786 ��������� [ 3.484901] [drm] add ip block number 6 <dm>
787 ��������� [ 3.484902] [drm] add ip block number 7 <uvd_v6_0>
788 ��������� [ 3.484902] [drm] add ip block number 8 <vce_v3_0>
789 ��������� [ 3.484913] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT
790 ��������� [ 3.484914] amdgpu: ATOM BIOS: 113-D0155200-100
791 ��������� [ 3.484932] [drm] vm size is 256 GB, 2 levels, block size is
10-bit, fragment size is 9-bit
...however, after this the driver tries to set up MMIO to communicate with the
device, and here is where I think things go very wrong.
792 ��������� [ 3.485705] amdgpu 0000:01:00.0: BAR 2: releasing [mem
0x6100000000-0x61001fffff 64bit pref]
793 ��������� [ 3.485708] amdgpu 0000:01:00.0: BAR 0: releasing [mem
0x6000000000-0x60ffffffff 64bit pref]
794 ��������� [ 3.485709] [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR*
Problem resizing BAR0 (-16).
795 ��������� [ 3.485811] amdgpu 0000:01:00.0: BAR 0: assigned [mem
0x6000000000-0x60ffffffff 64bit pref]
796 ��������� [ 3.485831] amdgpu 0000:01:00.0: BAR 0: error updating (0x00000c !=
0xffffffff)
797 ��������� [ 3.485847] amdgpu 0000:01:00.0: BAR 0: error updating (high
0x000060 != 0xffffffff)
798 ��������� [ 3.485863] amdgpu 0000:01:00.0: BAR 2: assigned [mem
0x6100000000-0x61001fffff 64bit pref]
799 ��������� [ 3.485864] amdgpu 0000:01:00.0: BAR 2: error updating (0x00000c !=
0xffffffff)
800 ��������� [ 3.485880] amdgpu 0000:01:00.0: BAR 2: error updating (high
0x000061 != 0xffffffff)
801 ��������� [ 3.485901] amdgpu 0000:01:00.0: amdgpu: VRAM: 4294967295M
0x000000FFFF000000 - 0x001000FFFEEFFFFF (4294967295M used)
802 ��������� [ 3.485903] amdgpu 0000:01:00.0: amdgpu: GART: 256M
0x0000000000000000 - 0x000000000FFFFFFF
803 ��������� [ 3.485905] [drm] Detected VRAM RAM=4294967295M, BAR=4096M
804 ��������� [ 3.485906] [drm] RAM width 64bits UNKNOWN
805 ��������� [ 3.485913] [drm] amdgpu: 4294967295M of VRAM memory ready
806 ��������� [ 3.485914] [drm] amdgpu: 48026M of GTT memory ready.
807 ��������� [ 3.485916] [drm] GART: num cpu pages 65536, num gpu pages 65536
Failing to set up PCI BARs sounds bad, and concluding from this failed setup
process that the GPU has 4 PiB of VRAM sounds even worse. I wouldn't be
surprised if this were the point where things really went hopeless.
But amdgpu is not the kind of program that will stop cleanly when the first
error occurs, so it will first wait for an event that doesn't want to happen...
808 ��������� [ 3.600399] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout
!
[...]
813 ��������� [ 3.714578] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout
!
...then emits a few more messages that also appear during the successful
initialization process of my home RX 560, so which I consider "normal"...
814 ��������� [ 3.715166] [drm] PCIE GART of 256M enabled (table at
0x000000FFFF900000).
815 ��������� [ 3.716221] [drm] Chained IB support enabled!
816 ��������� [ 3.721721] amdgpu: hwmgr_sw_init smu backed is polaris10_smu
...and only then it dies brutally.
817 ��������� [ 3.726521] amdgpu:
818 ��������� last message was failed ret is 65535
819 ��������� [ 3.726522] amdgpu:
820 ��������� failed to send message 100 ret is 65535
821 ��������� [ 3.726525] amdgpu: SMC address must be 4 byte aligned.
822 ��������� [ 3.726525] amdgpu: [AVFS][Polaris10_SetupGfxLvlStruct] Problems
copying VRConfig value over to SMC
823 ��������� [ 3.726526] amdgpu: [AVFS][Polaris10_AVFSEventMgr] Could not Copy
Graphics Level table over to SMU
824 ��������� [ 3.726565] amdgpu:
825 ��������� last message was failed ret is 65535
826 ��������� [ 3.726566] amdgpu:
827 ��������� failed to send message 252 ret is 65535
828 ��������� [ 3.726566] amdgpu:
829 ��������� last message was failed ret is 65535
830 ��������� [ 3.726567] amdgpu:
831 ��������� failed to send message 253 ret is 65535
832 ��������� [ 3.726569] amdgpu:
833 ��������� last message was failed ret is 65535
834 ��������� [ 3.726570] amdgpu:
835 ��������� failed to send message 250 ret is 65535
836 ��������� [ 3.726571] amdgpu:
837 ��������� last message was failed ret is 65535
838 ��������� [ 3.726571] amdgpu:
839 ��������� failed to send message 251 ret is 65535
840 ��������� [ 3.726572] amdgpu:
841 ��������� last message was failed ret is 65535
842 ��������� [ 3.726573] amdgpu:
843 ��������� failed to send message 254 ret is 65535
844 ��������� [ 3.861852] [drm] Timeout wait for RLC serdes 0,0
[...]
846 ��������� [ 3.975981] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
847 ��������� [ 3.976112] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init
of IP block <gfx_v8_0> failed -110
848 ��������� [ 3.976241] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init
failed
849 ��������� [ 3.976242] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU
init
850 ��������� [ 3.976244] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
851 ��������� [ 3.977567] amdgpu: probe of 0000:01:00.0 failed with error -110
852 ��������� [ 3.977650] [drm] amdgpu: ttm finalized
---
So, to summarize, it seems something goes wrong at the MMIO setup stage, which
may or may not be related to an earlier power management failure. My question
is, how can I investigate this further?
Here's some random trial and error stuff I already tried to no avail:
- Upgrade the UEFI firmware. It's now at the latest release, and it did not
change a thing.
- Attempt to disable ASPM in case PCIe power management could be the issue (I
found reports of an ASPM-related nvidia driver crash that started with the same
ACPI error message at the beginning). Does not change anything here, rolled
back.
- Attempt to downgrade to an earlier amdgpu firmware, as this fixed a
similar-looking crash on Arch. Did not help here, went back to the current AMD
firmware.
- Attempt to set the AMD GPU as the primary GPU in UEFI. This does not fix
amdgpu initialization, and it breaks i915 initialization, which is definitely
worse. Rolled back.