http://bugzilla.opensuse.org/show_bug.cgi?id=1190854 Bug ID: 1190854 Summary: amdgpu fails to probe Radeon Pro WX3200 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: grasland@lal.in2p3.fr QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 852729 --> http://bugzilla.opensuse.org/attachment.cgi?id=852729&action=edit dmesg log, with ANSI color codes On a Dell Precision 3650 equipped with a Radeon Pro WX3200, something very wrong happens during the amdgpu initialization process (or perhaps before, see below), which makes the card unusable on Linux. Thus, I have to fall back to the CPU's built-in graphics for the time being on this machine. The GPU is brand new and works on both the UEFI setup screen and Windows, which suggests to me that the hardware is not completely broken. But of course, this does not fully exclude the possibility of a hardware issue that would only be triggered by the Linux initialization process. You will find attached the dmesg log, with and without ANSI color codes as I personally find dmesg' color output much easier to visually parse. --- Let's have a closer look at the amdgpu logs, since that's what I'm directly interested in here. First we get the standard stuff, nothing special here as far as I can see. The CRAT table bit does looks suspicious, but it also happens during the init of my home Radeon RX 560 that otherwise works fine under Linux, so this does not seem to be a fatal issue... 771 ��� [ 3.484654] [drm] amdgpu kernel modesetting enabled. 772 ��� [ 3.484700] amdgpu: CRAT table not found 773 ��� [ 3.484701] amdgpu: Virtual CRAT table created for CPU 774 ��� [ 3.484707] amdgpu: Topology: Add CPU node ...then we get a power management failure, which may or may not be related to the problem at hand... 775 ��� [ 3.484764] amdgpu 0000:01:00.0: can't change power state from D3hot to D0 (config space inaccessible) ...then more "normal" stuff, in the sense of stuff that looks close to what I see in the init logs of amdgpu on my RX 560... 776 ��� [ 3.484888] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x6981 0x1028:0x2B0D 0x10). 777 ��� [ 3.484890] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported 778 ��� [ 3.484895] [drm] register mmio base: 0x6E900000 779 ��� [ 3.484896] [drm] register mmio size: 262144 780 ��� [ 3.484898] [drm] add ip block number 0 <vi_common> 781 ��� [ 3.484899] [drm] add ip block number 1 <gmc_v8_0> 782 ��� [ 3.484899] [drm] add ip block number 2 <tonga_ih> 783 ��� [ 3.484900] [drm] add ip block number 3 <gfx_v8_0> 784 ��� [ 3.484900] [drm] add ip block number 4 <sdma_v3_0> 785 ��� [ 3.484901] [drm] add ip block number 5 <powerplay> 786 ��� [ 3.484901] [drm] add ip block number 6 <dm> 787 ��� [ 3.484902] [drm] add ip block number 7 <uvd_v6_0> 788 ��� [ 3.484902] [drm] add ip block number 8 <vce_v3_0> 789 ��� [ 3.484913] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT 790 ��� [ 3.484914] amdgpu: ATOM BIOS: 113-D0155200-100 791 ��� [ 3.484932] [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit ...however, after this the driver tries to set up MMIO to communicate with the device, and here is where I think things go very wrong. 792 ��� [ 3.485705] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x6100000000-0x61001fffff 64bit pref] 793 ��� [ 3.485708] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x6000000000-0x60ffffffff 64bit pref] 794 ��� [ 3.485709] [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR* Problem resizing BAR0 (-16). 795 ��� [ 3.485811] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x6000000000-0x60ffffffff 64bit pref] 796 ��� [ 3.485831] amdgpu 0000:01:00.0: BAR 0: error updating (0x00000c != 0xffffffff) 797 ��� [ 3.485847] amdgpu 0000:01:00.0: BAR 0: error updating (high 0x000060 != 0xffffffff) 798 ��� [ 3.485863] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x6100000000-0x61001fffff 64bit pref] 799 ��� [ 3.485864] amdgpu 0000:01:00.0: BAR 2: error updating (0x00000c != 0xffffffff) 800 ��� [ 3.485880] amdgpu 0000:01:00.0: BAR 2: error updating (high 0x000061 != 0xffffffff) 801 ��� [ 3.485901] amdgpu 0000:01:00.0: amdgpu: VRAM: 4294967295M 0x000000FFFF000000 - 0x001000FFFEEFFFFF (4294967295M used) 802 ��� [ 3.485903] amdgpu 0000:01:00.0: amdgpu: GART: 256M 0x0000000000000000 - 0x000000000FFFFFFF 803 ��� [ 3.485905] [drm] Detected VRAM RAM=4294967295M, BAR=4096M 804 ��� [ 3.485906] [drm] RAM width 64bits UNKNOWN 805 ��� [ 3.485913] [drm] amdgpu: 4294967295M of VRAM memory ready 806 ��� [ 3.485914] [drm] amdgpu: 48026M of GTT memory ready. 807 ��� [ 3.485916] [drm] GART: num cpu pages 65536, num gpu pages 65536 Failing to set up PCI BARs sounds bad, and concluding from this failed setup process that the GPU has 4 PiB of VRAM sounds even worse. I wouldn't be surprised if this were the point where things really went hopeless. But amdgpu is not the kind of program that will stop cleanly when the first error occurs, so it will first wait for an event that doesn't want to happen... 808 ��� [ 3.600399] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout ! [...] 813 ��� [ 3.714578] amdgpu 0000:01:00.0: amdgpu: Wait for MC idle timedout ! ...then emits a few more messages that also appear during the successful initialization process of my home RX 560, so which I consider "normal"... 814 ��� [ 3.715166] [drm] PCIE GART of 256M enabled (table at 0x000000FFFF900000). 815 ��� [ 3.716221] [drm] Chained IB support enabled! 816 ��� [ 3.721721] amdgpu: hwmgr_sw_init smu backed is polaris10_smu ...and only then it dies brutally. 817 ��� [ 3.726521] amdgpu: 818 ��� last message was failed ret is 65535 819 ��� [ 3.726522] amdgpu: 820 ��� failed to send message 100 ret is 65535 821 ��� [ 3.726525] amdgpu: SMC address must be 4 byte aligned. 822 ��� [ 3.726525] amdgpu: [AVFS][Polaris10_SetupGfxLvlStruct] Problems copying VRConfig value over to SMC 823 ��� [ 3.726526] amdgpu: [AVFS][Polaris10_AVFSEventMgr] Could not Copy Graphics Level table over to SMU 824 ��� [ 3.726565] amdgpu: 825 ��� last message was failed ret is 65535 826 ��� [ 3.726566] amdgpu: 827 ��� failed to send message 252 ret is 65535 828 ��� [ 3.726566] amdgpu: 829 ��� last message was failed ret is 65535 830 ��� [ 3.726567] amdgpu: 831 ��� failed to send message 253 ret is 65535 832 ��� [ 3.726569] amdgpu: 833 ��� last message was failed ret is 65535 834 ��� [ 3.726570] amdgpu: 835 ��� failed to send message 250 ret is 65535 836 ��� [ 3.726571] amdgpu: 837 ��� last message was failed ret is 65535 838 ��� [ 3.726571] amdgpu: 839 ��� failed to send message 251 ret is 65535 840 ��� [ 3.726572] amdgpu: 841 ��� last message was failed ret is 65535 842 ��� [ 3.726573] amdgpu: 843 ��� failed to send message 254 ret is 65535 844 ��� [ 3.861852] [drm] Timeout wait for RLC serdes 0,0 [...] 846 ��� [ 3.975981] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110) 847 ��� [ 3.976112] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -110 848 ��� [ 3.976241] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed 849 ��� [ 3.976242] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init 850 ��� [ 3.976244] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device. 851 ��� [ 3.977567] amdgpu: probe of 0000:01:00.0 failed with error -110 852 ��� [ 3.977650] [drm] amdgpu: ttm finalized --- So, to summarize, it seems something goes wrong at the MMIO setup stage, which may or may not be related to an earlier power management failure. My question is, how can I investigate this further? Here's some random trial and error stuff I already tried to no avail: - Upgrade the UEFI firmware. It's now at the latest release, and it did not change a thing. - Attempt to disable ASPM in case PCIe power management could be the issue (I found reports of an ASPM-related nvidia driver crash that started with the same ACPI error message at the beginning). Does not change anything here, rolled back. - Attempt to downgrade to an earlier amdgpu firmware, as this fixed a similar-looking crash on Arch. Did not help here, went back to the current AMD firmware. - Attempt to set the AMD GPU as the primary GPU in UEFI. This does not fix amdgpu initialization, and it breaks i915 initialization, which is definitely worse. Rolled back. -- You are receiving this mail because: You are on the CC list for the bug.