[Bug 1190854] New: amdgpu fails to probe Radeon Pro WX3200

24 Sep 2021

      http://bugzilla.opensuse.org/show_bug.cgi?id=1190854

            Bug ID: 1190854
           Summary: amdgpu fails to probe Radeon Pro WX3200
    Classification: openSUSE
           Product: openSUSE Tumbleweed
           Version: Current
          Hardware: x86-64
                OS: openSUSE Tumbleweed
            Status: NEW
          Severity: Major
          Priority: P5 - None
         Component: Kernel
          Assignee: kernel-bugs@opensuse.org
          Reporter: grasland@lal.in2p3.fr
        QA Contact: qa-bugs@suse.de
          Found By: ---
           Blocker: ---

Created attachment 852729
  --> http://bugzilla.opensuse.org/attachment.cgi?id=852729&action=edit
dmesg log, with ANSI color codes

On a Dell Precision 3650 equipped with a Radeon Pro WX3200, something very
wrong happens during the amdgpu initialization process (or perhaps before, see
below), which makes the card unusable on Linux. Thus, I have to fall back to
the CPU's built-in graphics for the time being on this machine.

The GPU is brand new and works on both the UEFI setup screen and Windows, which
suggests to me that the hardware is not completely broken. But of course, this
does not fully exclude the possibility of a hardware issue that would only be
triggered by  the Linux initialization process.

You will find attached the dmesg log, with and without ANSI color codes as I
personally find dmesg' color output much easier to visually parse.

---

Let's have a closer look at the amdgpu logs, since that's what I'm directly
interested in here.

First we get the standard stuff, nothing special here as far as I can see. The
CRAT table bit does looks suspicious, but it also happens during the init of my
home Radeon RX 560 that otherwise works fine under Linux, so this does not seem
to be a fatal issue...

 771   ��� [    3.484654] [drm] amdgpu kernel modesetting enabled.
 772   ��� [    3.484700] amdgpu: CRAT table not found
 773   ��� [    3.484701] amdgpu: Virtual CRAT table created for CPU
 774   ��� [    3.484707] amdgpu: Topology: Add CPU node

...then we get a power management failure, which may or may not be related to
the problem at hand...

 775   ��� [    3.484764] amdgpu 0000:01:00.0: can't change power state from
D3hot to D0 (config space inaccessible)

...then more "normal" stuff, in the sense of stuff that looks close to what I
see in the init logs of amdgpu on my RX 560...

 776   ��� [    3.484888] [drm] initializing kernel modesetting (POLARIS12
0x1002:0x6981 0x1028:0x2B0D 0x10).
 777   ��� [    3.484890] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature not supported
 778   ��� [    3.484895] [drm] register mmio base: 0x6E900000
 779   ��� [    3.484896] [drm] register mmio size: 262144
 780   ��� [    3.484898] [drm] add ip block number 0 <vi_common>
 781   ��� [    3.484899] [drm] add ip block number 1 <gmc_v8_0>
 782   ��� [    3.484899] [drm] add ip block number 2 <tonga_ih>
 783   ��� [    3.484900] [drm] add ip block number 3 <gfx_v8_0>
 784   ��� [    3.484900] [drm] add ip block number 4 <sdma_v3_0>
 785   ��� [    3.484901] [drm] add ip block number 5 <powerplay>
 786   ��� [    3.484901] [drm] add ip block number 6 <dm>
 787   ��� [    3.484902] [drm] add ip block number 7 <uvd_v6_0>
 788   ��� [    3.484902] [drm] add ip block number 8 <vce_v3_0>
 789   ��� [    3.484913] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT
 790   ��� [    3.484914] amdgpu: ATOM BIOS: 113-D0155200-100
 791   ��� [    3.484932] [drm] vm size is 256 GB, 2 levels, block size is
10-bit, fragment size is 9-bit

...however, after this the driver tries to set up MMIO to communicate with the
device, and here is where I think things go very wrong.

 792   ��� [    3.485705] amdgpu 0000:01:00.0: BAR 2: releasing [mem
0x6100000000-0x61001fffff 64bit pref]
 793   ��� [    3.485708] amdgpu 0000:01:00.0: BAR 0: releasing [mem
0x6000000000-0x60ffffffff 64bit pref]
 794   ��� [    3.485709] [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR*
Problem resizing BAR0 (-16).
 795   ��� [    3.485811] amdgpu 0000:01:00.0: BAR 0: assigned [mem
0x6000000000-0x60ffffffff 64bit pref]
 796   ��� [    3.485831] amdgpu 0000:01:00.0: BAR 0: error updating (0x00000c !=
0xffffffff)
 797   ��� [    3.485847] amdgpu 0000:01:00.0: BAR 0: error updating (high
0x000060 != 0xffffffff)
 798   ��� [    3.485863] amdgpu 0000:01:00.0: BAR 2: assigned [mem
0x6100000000-0x61001fffff 64bit pref]
 799   ��� [    3.485864] amdgpu 0000:01:00.0: BAR 2: error updating (0x00000c !=
0xffffffff)
 800   ��� [    3.485880] amdgpu 0000:01:00.0: BAR 2: error updating (high
0x000061 != 0xffffffff)
 801   ��� [    3.485901] amdgpu 0000:01:00.0: amdgpu: VRAM: 4294967295M
0x000000FFFF000000 - 0x001000FFFEEFFFFF (4294967295M used)
 802   ��� [    3.485903] amdgpu 0000:01:00.0: amdgpu: GART: 256M
0x0000000000000000 - 0x000000000FFFFFFF
 803   ��� [    3.485905] [drm] Detected VRAM RAM=4294967295M, BAR=4096M
 804   ��� [    3.485906] [drm] RAM width 64bits UNKNOWN
 805   ��� [    3.485913] [drm] amdgpu: 4294967295M of VRAM memory ready
 806   ��� [    3.485914] [drm] amdgpu: 48026M of GTT memory ready.
 807   ��� [    3.485916] [drm] GART: num cpu pages 65536, num gpu pages 65536

Failing to set up PCI BARs sounds bad, and concluding from this failed setup
process that the GPU has 4 PiB of VRAM sounds even worse. I wouldn't be
surprised if this were the point where things really went hopeless.

But amdgpu is not the kind of program that will stop cleanly when the first
error occurs, so it will first wait for an event that doesn't want to happen...

 808   ��� [    3.600399] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout
!
[...]
 813   ��� [    3.714578] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout
!

...then emits a few more messages that also appear during the successful
initialization process of my home RX 560, so which I consider "normal"...

 814   ��� [    3.715166] [drm] PCIE GART of 256M enabled (table at
0x000000FFFF900000).
 815   ��� [    3.716221] [drm] Chained IB support enabled!
 816   ��� [    3.721721] amdgpu: hwmgr_sw_init smu backed is polaris10_smu

...and only then it dies brutally.

 817   ��� [    3.726521] amdgpu: 
 818   ���                 last message was failed ret is 65535
 819   ��� [    3.726522] amdgpu: 
 820   ���                 failed to send message 100 ret is 65535 
 821   ��� [    3.726525] amdgpu: SMC address must be 4 byte aligned.
 822   ��� [    3.726525] amdgpu: [AVFS][Polaris10_SetupGfxLvlStruct] Problems
copying VRConfig value over to SMC
 823   ��� [    3.726526] amdgpu: [AVFS][Polaris10_AVFSEventMgr] Could not Copy
Graphics Level table over to SMU
 824   ��� [    3.726565] amdgpu: 
 825   ���                 last message was failed ret is 65535
 826   ��� [    3.726566] amdgpu: 
 827   ���                 failed to send message 252 ret is 65535 
 828   ��� [    3.726566] amdgpu: 
 829   ���                 last message was failed ret is 65535
 830   ��� [    3.726567] amdgpu: 
 831   ���                 failed to send message 253 ret is 65535 
 832   ��� [    3.726569] amdgpu: 
 833   ���                 last message was failed ret is 65535
 834   ��� [    3.726570] amdgpu: 
 835   ���                 failed to send message 250 ret is 65535 
 836   ��� [    3.726571] amdgpu: 
 837   ���                 last message was failed ret is 65535
 838   ��� [    3.726571] amdgpu: 
 839   ���                 failed to send message 251 ret is 65535 
 840   ��� [    3.726572] amdgpu: 
 841   ���                 last message was failed ret is 65535
 842   ��� [    3.726573] amdgpu: 
 843   ���                 failed to send message 254 ret is 65535 
 844   ��� [    3.861852] [drm] Timeout wait for RLC serdes 0,0
[...]
 846   ��� [    3.975981] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
 847   ��� [    3.976112] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init
of IP block <gfx_v8_0> failed -110
 848   ��� [    3.976241] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init
failed
 849   ��� [    3.976242] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU
init
 850   ��� [    3.976244] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
 851   ��� [    3.977567] amdgpu: probe of 0000:01:00.0 failed with error -110
 852   ��� [    3.977650] [drm] amdgpu: ttm finalized

---

So, to summarize, it seems something goes wrong at the MMIO setup stage, which
may or may not be related to an earlier power management failure. My question
is, how can I investigate this further?

Here's some random trial and error stuff I already tried to no avail:

- Upgrade the UEFI firmware. It's now at the latest release, and it did not
change a thing.
- Attempt to disable ASPM in case PCIe power management could be the issue (I
found reports of an ASPM-related nvidia driver crash that started with the same
ACPI error message at the beginning). Does not change anything here, rolled
back.
- Attempt to downgrade to an earlier amdgpu firmware, as this fixed a
similar-looking crash on Arch. Did not help here, went back to the current AMD
firmware.
- Attempt to set the AMD GPU as the primary GPU in UEFI. This does not fix
amdgpu initialization, and it breaks i915 initialization, which is definitely
worse. Rolled back.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1190854] New: amdgpu fails to probe Radeon Pro WX3200

bugzilla_noreply＠suse.com