[Bug 1190854] New: amdgpu fails to probe Radeon Pro WX3200
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 Bug ID: 1190854 Summary: amdgpu fails to probe Radeon Pro WX3200 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: grasland@lal.in2p3.fr QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 852729 --> http://bugzilla.opensuse.org/attachment.cgi?id=852729&action=edit dmesg log, with ANSI color codes On a Dell Precision 3650 equipped with a Radeon Pro WX3200, something very wrong happens during the amdgpu initialization process (or perhaps before, see below), which makes the card unusable on Linux. Thus, I have to fall back to the CPU's built-in graphics for the time being on this machine. The GPU is brand new and works on both the UEFI setup screen and Windows, which suggests to me that the hardware is not completely broken. But of course, this does not fully exclude the possibility of a hardware issue that would only be triggered by the Linux initialization process. You will find attached the dmesg log, with and without ANSI color codes as I personally find dmesg' color output much easier to visually parse. --- Let's have a closer look at the amdgpu logs, since that's what I'm directly interested in here. First we get the standard stuff, nothing special here as far as I can see. The CRAT table bit does looks suspicious, but it also happens during the init of my home Radeon RX 560 that otherwise works fine under Linux, so this does not seem to be a fatal issue... 771 ��� [ 3.484654] [drm] amdgpu kernel modesetting enabled. 772 ��� [ 3.484700] amdgpu: CRAT table not found 773 ��� [ 3.484701] amdgpu: Virtual CRAT table created for CPU 774 ��� [ 3.484707] amdgpu: Topology: Add CPU node ...then we get a power management failure, which may or may not be related to the problem at hand... 775 ��� [ 3.484764] amdgpu 0000:01:00.0: can't change power state from D3hot to D0 (config space inaccessible) ...then more "normal" stuff, in the sense of stuff that looks close to what I see in the init logs of amdgpu on my RX 560... 776 ��� [ 3.484888] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x6981 0x1028:0x2B0D 0x10). 777 ��� [ 3.484890] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported 778 ��� [ 3.484895] [drm] register mmio base: 0x6E900000 779 ��� [ 3.484896] [drm] register mmio size: 262144 780 ��� [ 3.484898] [drm] add ip block number 0 <vi_common> 781 ��� [ 3.484899] [drm] add ip block number 1 <gmc_v8_0> 782 ��� [ 3.484899] [drm] add ip block number 2 <tonga_ih> 783 ��� [ 3.484900] [drm] add ip block number 3 <gfx_v8_0> 784 ��� [ 3.484900] [drm] add ip block number 4 <sdma_v3_0> 785 ��� [ 3.484901] [drm] add ip block number 5 <powerplay> 786 ��� [ 3.484901] [drm] add ip block number 6 <dm> 787 ��� [ 3.484902] [drm] add ip block number 7 <uvd_v6_0> 788 ��� [ 3.484902] [drm] add ip block number 8 <vce_v3_0> 789 ��� [ 3.484913] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT 790 ��� [ 3.484914] amdgpu: ATOM BIOS: 113-D0155200-100 791 ��� [ 3.484932] [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit ...however, after this the driver tries to set up MMIO to communicate with the device, and here is where I think things go very wrong. 792 ��� [ 3.485705] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x6100000000-0x61001fffff 64bit pref] 793 ��� [ 3.485708] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x6000000000-0x60ffffffff 64bit pref] 794 ��� [ 3.485709] [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR* Problem resizing BAR0 (-16). 795 ��� [ 3.485811] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x6000000000-0x60ffffffff 64bit pref] 796 ��� [ 3.485831] amdgpu 0000:01:00.0: BAR 0: error updating (0x00000c != 0xffffffff) 797 ��� [ 3.485847] amdgpu 0000:01:00.0: BAR 0: error updating (high 0x000060 != 0xffffffff) 798 ��� [ 3.485863] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x6100000000-0x61001fffff 64bit pref] 799 ��� [ 3.485864] amdgpu 0000:01:00.0: BAR 2: error updating (0x00000c != 0xffffffff) 800 ��� [ 3.485880] amdgpu 0000:01:00.0: BAR 2: error updating (high 0x000061 != 0xffffffff) 801 ��� [ 3.485901] amdgpu 0000:01:00.0: amdgpu: VRAM: 4294967295M 0x000000FFFF000000 - 0x001000FFFEEFFFFF (4294967295M used) 802 ��� [ 3.485903] amdgpu 0000:01:00.0: amdgpu: GART: 256M 0x0000000000000000 - 0x000000000FFFFFFF 803 ��� [ 3.485905] [drm] Detected VRAM RAM=4294967295M, BAR=4096M 804 ��� [ 3.485906] [drm] RAM width 64bits UNKNOWN 805 ��� [ 3.485913] [drm] amdgpu: 4294967295M of VRAM memory ready 806 ��� [ 3.485914] [drm] amdgpu: 48026M of GTT memory ready. 807 ��� [ 3.485916] [drm] GART: num cpu pages 65536, num gpu pages 65536 Failing to set up PCI BARs sounds bad, and concluding from this failed setup process that the GPU has 4 PiB of VRAM sounds even worse. I wouldn't be surprised if this were the point where things really went hopeless. But amdgpu is not the kind of program that will stop cleanly when the first error occurs, so it will first wait for an event that doesn't want to happen... 808 ��� [ 3.600399] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout ! [...] 813 ��� [ 3.714578] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout ! ...then emits a few more messages that also appear during the successful initialization process of my home RX 560, so which I consider "normal"... 814 ��� [ 3.715166] [drm] PCIE GART of 256M enabled (table at 0x000000FFFF900000). 815 ��� [ 3.716221] [drm] Chained IB support enabled! 816 ��� [ 3.721721] amdgpu: hwmgr_sw_init smu backed is polaris10_smu ...and only then it dies brutally. 817 ��� [ 3.726521] amdgpu: 818 ��� last message was failed ret is 65535 819 ��� [ 3.726522] amdgpu: 820 ��� failed to send message 100 ret is 65535 821 ��� [ 3.726525] amdgpu: SMC address must be 4 byte aligned. 822 ��� [ 3.726525] amdgpu: [AVFS][Polaris10_SetupGfxLvlStruct] Problems copying VRConfig value over to SMC 823 ��� [ 3.726526] amdgpu: [AVFS][Polaris10_AVFSEventMgr] Could not Copy Graphics Level table over to SMU 824 ��� [ 3.726565] amdgpu: 825 ��� last message was failed ret is 65535 826 ��� [ 3.726566] amdgpu: 827 ��� failed to send message 252 ret is 65535 828 ��� [ 3.726566] amdgpu: 829 ��� last message was failed ret is 65535 830 ��� [ 3.726567] amdgpu: 831 ��� failed to send message 253 ret is 65535 832 ��� [ 3.726569] amdgpu: 833 ��� last message was failed ret is 65535 834 ��� [ 3.726570] amdgpu: 835 ��� failed to send message 250 ret is 65535 836 ��� [ 3.726571] amdgpu: 837 ��� last message was failed ret is 65535 838 ��� [ 3.726571] amdgpu: 839 ��� failed to send message 251 ret is 65535 840 ��� [ 3.726572] amdgpu: 841 ��� last message was failed ret is 65535 842 ��� [ 3.726573] amdgpu: 843 ��� failed to send message 254 ret is 65535 844 ��� [ 3.861852] [drm] Timeout wait for RLC serdes 0,0 [...] 846 ��� [ 3.975981] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110) 847 ��� [ 3.976112] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -110 848 ��� [ 3.976241] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed 849 ��� [ 3.976242] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init 850 ��� [ 3.976244] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device. 851 ��� [ 3.977567] amdgpu: probe of 0000:01:00.0 failed with error -110 852 ��� [ 3.977650] [drm] amdgpu: ttm finalized --- So, to summarize, it seems something goes wrong at the MMIO setup stage, which may or may not be related to an earlier power management failure. My question is, how can I investigate this further? Here's some random trial and error stuff I already tried to no avail: - Upgrade the UEFI firmware. It's now at the latest release, and it did not change a thing. - Attempt to disable ASPM in case PCIe power management could be the issue (I found reports of an ASPM-related nvidia driver crash that started with the same ACPI error message at the beginning). Does not change anything here, rolled back. - Attempt to downgrade to an earlier amdgpu firmware, as this fixed a similar-looking crash on Arch. Did not help here, went back to the current AMD firmware. - Attempt to set the AMD GPU as the primary GPU in UEFI. This does not fix amdgpu initialization, and it breaks i915 initialization, which is definitely worse. Rolled back. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c1 --- Comment #1 from Hadrien Grasland <grasland@lal.in2p3.fr> --- Created attachment 852730 --> http://bugzilla.opensuse.org/attachment.cgi?id=852730&action=edit dmesg log, without ANSI color codes -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c2 --- Comment #2 from Hadrien Grasland <grasland@lal.in2p3.fr> --- Cross-checking the dmesg logs before the amdgpu failure, I also found this, which sounds possibly related, although this is getting beyond my level of PCI jargon proficiency so I can't analyse it in depth. 512 ��� [ 1.053284] pci 0000:01:00.0: can't claim BAR 6 [mem 0xfffe0000-0xffffffff pref]: no compatible bridge window 513 ��� [ 1.053289] pci 0000:00:15.0: BAR 0: assigned [mem 0x4010000000-0x4010000fff 64bit] 514 ��� [ 1.053365] pci 0000:00:15.1: BAR 0: assigned [mem 0x4010001000-0x4010001fff 64bit] 515 ��� [ 1.053379] pci 0000:00:1f.5: BAR 0: assigned [mem 0x6ea20000-0x6ea20fff] 516 ��� [ 1.053389] pci 0000:01:00.0: BAR 6: assigned [mem 0x6e960000-0x6e97ffff pref] Here it seems that the BAR 6 from the AMD GPU tries to get to one memory-mapped location, and ends up in another due to a "bridge window" issue. Not sure if the amdgpu driver is meant to tolerate this well. --- Also, even earlier in the log, there are some ACPI errors that come from the PCI Root Bridge [PC00], which may or may not be related to the power management errors that I see later on: 347 ��� [ 0.913731] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.PGON.PBGE], AE_NOT_FOUND (20210604/psargs-330) 348 ��� [ 0.913743] ACPI Error: Aborting method \_SB.PC00.PGON due to previous error (AE_NOT_FOUND) (20210604/psparse-529) 349 ��� [ 0.913749] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_NOT_FOUND) (20210604/psparse-529) [...] 434 ��� [ 1.001774] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.PGOF.PBGE], AE_NOT_FOUND (20210604/psargs-330) 435 ��� [ 1.001784] ACPI Error: Aborting method \_SB.PC00.PGOF due to previous error (AE_NOT_FOUND) (20210604/psparse-529) 436 ��� [ 1.001790] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._OFF due to previous error (AE_NOT_FOUND) (20210604/psparse-529) Hope this helps. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c4 --- Comment #4 from Hadrien Grasland <grasland@lal.in2p3.fr> --- Created attachment 852841 --> http://bugzilla.opensuse.org/attachment.cgi?id=852841&action=edit Nothing changes, perhaps due to kernel bug https://bugzilla.kernel.org/show_bug.cgi?id=81431 ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c6 --- Comment #6 from Hadrien Grasland <grasland@lal.in2p3.fr> --- Created attachment 853412 --> http://bugzilla.opensuse.org/attachment.cgi?id=853412&action=edit dmesg log with 32-bit mappings forced and resizable BAR disabled I found two firmware settings related to device memory mappings: 1. One that inhibits mapping 64-bit integrated devices above 4 GB. 2. One that disables resizable BAR support. Strangely enough, these two firmware settings seem to be interdependent: - If I try to force 32-bit mappings without disabling resizable BAR, then nothing changes in dmesg, and on my next trip to the UEFI setup screen I see that the 32-bit mapping setting was silently reverted back off by the firmware. - If I try to disable resizable BAR support without forcing 32-bit mappings, then I still see messages about BAR resizing in dmesg, suggesting that the change was ineffective. However, resizable BARs are still marked as disabled on my next trip to the UEFI setup screen. - If I force 32-bit mappings AND disable resizable BARs, then I get the enclosed log. AMDGPU initialization still fails, and GPU VMEM detection is still broken, but all the logs about BARs that were previously present are gone, suggesting that at least this did something. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c7 --- Comment #7 from Takashi Iwai <tiwai@suse.com> --- With 32-bit mappings forced and resizable BAR disabled, could you try to downgrade the AMDGPU firmware files and retest? The old firmware files can be retrieved from linux-firmware git tree. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git And, when you update the firmware, don't forget to re-generate initrd. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c8 --- Comment #8 from Hadrien Grasland <grasland@lal.in2p3.fr> --- Created attachment 853509 --> http://bugzilla.opensuse.org/attachment.cgi?id=853509&action=edit dmesg logs with various firmwares I assume you meant https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git , and this is what I used. - This particular card needs firmware file polaris12_k_mc.bin to boot, which was introduced by linux-firmware commit 6cca1381f328e7df55ae8bb8ac515b945d35f9f5 from 2018-12-03. Before that, amdgpu will complain about the missing file and refuse to stard. - The first functional firmware produces the same dmesg logs as the current firmware (timestamps aside). - I tested a couple of intermediary linux-firmware commits that touched amdgpu/polaris*, aiming for version consistency between polaris10 and polaris12 firmware files (since the amdgpu logs mention both polaris10 and polaris12), and the dmesg output remained identical. So the firmware does not seem to be the culprit here... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c9 --- Comment #9 from Hadrien Grasland <grasland@lal.in2p3.fr> --- FWIW, I keep wondering about this power management error near the start of the amdgpu initialization logs: 769 ��� [ 4.101984] amdgpu 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) I don't know enough about PCI and ACPI power management to tell how serious this is, but intuitively "config space inacessible" sounds bad... and if I randomly search the web for "can't change power state from D3cold to D0 (config space inaccessible)", I find various interesting-looking discussion threads, several of which concern hardware initialization problems at boot. Unfortunately, however, most of these threads are about hardware other than AMD GPUs, and they contain more questions than answers. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c10 Hadrien Grasland <grasland@lal.in2p3.fr> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from Hadrien Grasland <grasland@lal.in2p3.fr> --- This problem was fixed by the Linux 5.15 update! The BAR allocation error remains, but the power management error has gone away, suggesting that the latter was the culprit after all. All previously configured firmware hacks (disabling resizable BARs, forcing mapping to lower 32-bit adresses) can be removed while keeping the graphics configuration working as expected. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c11 --- Comment #11 from Takashi Iwai <tiwai@suse.com> --- Good to hear. Could you check whether the kernel in OBS Kernel:SLE15-SP4 repo still shows the problem? http://download.opensuse.org/repositories/Kernel:/stable/standard/ -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c12 --- Comment #12 from Hadrien Grasland <grasland@lal.in2p3.fr> --- I think I'll be able to try next week. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c13 --- Comment #13 from Hadrien Grasland <grasland@lal.in2p3.fr> --- The AMD GPU does initialize correctly with the default kernel from repo Kernel:SLE15-SP4 (5.14.21-8.g9b37a45-default). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 http://bugzilla.opensuse.org/show_bug.cgi?id=1190854#c14 --- Comment #14 from Takashi Iwai <tiwai@suse.com> --- Great, thanks for confirmation! -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com