[Bug 855501] New: [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD) <ensonic> [drm:r600_resume] *ERROR* r600 startup failed on resume
https://bugzilla.novell.com/show_bug.cgi?id=855501 https://bugzilla.novell.com/show_bug.cgi?id=855501#c0 Summary: [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD) <ensonic> [drm:r600_resume] *ERROR* r600 startup failed on resume Classification: openSUSE Product: openSUSE 13.1 Version: Final Platform: x86-64 OS/Version: openSUSE 13.1 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: michelschaffers@gmail.com QAContact: qa-bugs@suse.de CC: jeffm@suse.com, ensonic@sonicpulse.de, bpetkov@suse.com Depends on: 841365 Found By: --- Blocker: Yes Boot fails as described in bug#841365 - kernel crashes - login screen never shows up - Fix does not work in my case. Here follows the last lines of /var/log/messages: 2013-12-14T09:25:13.220867+01:00 linux-etuk kdm[823]: Quitting Plymouth with transition 2013-12-14T09:25:13.409008+01:00 linux-etuk kdm[823]: Is Plymouth still running? no 2013-12-14T09:25:22.993756+01:00 linux-etuk kernel: [ 34.168239] radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec 2013-12-14T09:25:22.993777+01:00 linux-etuk kernel: [ 34.168248] radeon 0000:01:05.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000001) 2013-12-14T09:25:22.994083+01:00 linux-etuk kernel: [ 34.169298] radeon 0000:01:05.0: Saved 121 dwords of commands on ring 0. 2013-12-14T09:25:22.994094+01:00 linux-etuk kernel: [ 34.169305] radeon 0000:01:05.0: GPU softreset: 0x00000108 2013-12-14T09:25:22.994097+01:00 linux-etuk kernel: [ 34.169308] radeon 0000:01:05.0: R_008010_GRBM_STATUS = 0xA0003030 2013-12-14T09:25:22.994100+01:00 linux-etuk kernel: [ 34.169310] radeon 0000:01:05.0: R_008014_GRBM_STATUS2 = 0x00000003 2013-12-14T09:25:22.994103+01:00 linux-etuk kernel: [ 34.169317] radeon 0000:01:05.0: R_000E50_SRBM_STATUS = 0x20023040 2013-12-14T09:25:22.994106+01:00 linux-etuk kernel: [ 34.169321] radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 2013-12-14T09:25:22.994108+01:00 linux-etuk kernel: [ 34.169325] radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000000 2013-12-14T09:25:22.994110+01:00 linux-etuk kernel: [ 34.169328] radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000800 2013-12-14T09:25:22.994113+01:00 linux-etuk kernel: [ 34.169332] radeon 0000:01:05.0: R_008680_CP_STAT = 0x800000C1 2013-12-14T09:25:22.994115+01:00 linux-etuk kernel: [ 34.169336] radeon 0000:01:05.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 2013-12-14T09:25:23.190740+01:00 linux-etuk kernel: [ 34.365519] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00004001 2013-12-14T09:25:23.190753+01:00 linux-etuk kernel: [ 34.365572] radeon 0000:01:05.0: SRBM_SOFT_RESET=0x00000500 2013-12-14T09:25:23.192754+01:00 linux-etuk kernel: [ 34.367674] radeon 0000:01:05.0: R_008010_GRBM_STATUS = 0xA0003030 2013-12-14T09:25:23.192766+01:00 linux-etuk kernel: [ 34.367676] radeon 0000:01:05.0: R_008014_GRBM_STATUS2 = 0x00000003 2013-12-14T09:25:23.192769+01:00 linux-etuk kernel: [ 34.367677] radeon 0000:01:05.0: R_000E50_SRBM_STATUS = 0x2002B040 2013-12-14T09:25:23.192772+01:00 linux-etuk kernel: [ 34.367679] radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000 2013-12-14T09:25:23.192774+01:00 linux-etuk kernel: [ 34.367680] radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000000 2013-12-14T09:25:23.192777+01:00 linux-etuk kernel: [ 34.367682] radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000000 2013-12-14T09:25:23.192779+01:00 linux-etuk kernel: [ 34.367683] radeon 0000:01:05.0: R_008680_CP_STAT = 0x80100000 2013-12-14T09:25:23.192781+01:00 linux-etuk kernel: [ 34.367684] radeon 0000:01:05.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 2013-12-14T09:25:23.192783+01:00 linux-etuk kernel: [ 34.367687] radeon 0000:01:05.0: GPU reset succeeded, trying to resume 2013-12-14T09:25:23.328743+01:00 linux-etuk kernel: [ 34.503813] [drm] PCIE GART of 512M enabled (table at 0x00000000C0040000). 2013-12-14T09:25:23.329744+01:00 linux-etuk kernel: [ 34.503864] radeon 0000:01:05.0: WB enabled 2013-12-14T09:25:23.329763+01:00 linux-etuk kernel: [ 34.503867] radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000c00 and cpu addr 0xffff88021156ac00 2013-12-14T09:25:23.329767+01:00 linux-etuk kernel: [ 34.503868] radeon 0000:01:05.0: fence driver on ring 3 use gpu addr 0x00000000a0000c0c and cpu addr 0xffff88021156ac0c 2013-12-14T09:25:23.329770+01:00 linux-etuk kernel: [ 34.504080] radeon 0000:01:05.0: setting latency timer to 64 2013-12-14T09:25:23.497739+01:00 linux-etuk kernel: [ 34.672517] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD) 2013-12-14T09:25:23.497746+01:00 linux-etuk kernel: [ 34.672519] [drm:r600_resume] *ERROR* r600 startup failed on resume Here follows output of "lspci -nn | grep VGA": 01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS780L [Radeon 3000] [1002:9616] Here follows output of "cat /sys/devices/virtual/dmi/id/board_name": M5A78L-M/USB3 I could observe the issue with following kernels: kernel /boot/vmlinuz-3.13.0-rc3-3.g2bf5161-desktop kernel /boot/vmlinuz-3.11.10-3.g137a69e-desktop kernel /boot/vmlinuz-3.11.6-4-desktop kernel /boot/vmlinuz-3.7.10-1.1-desktop kernel /boot/vmlinuz-3.7.10-1.16-desktop Pressing the esc key during the boot, so as to display the messages in place of the suse picture, helps sometimes but not that often. Removing "quiet" in the boot command line works around the issue -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c1
--- Comment #1 from Borislav Petkov
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c2
--- Comment #2 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c3
--- Comment #3 from Borislav Petkov
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c4
--- Comment #4 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c5
--- Comment #5 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c6
--- Comment #6 from Borislav Petkov
Me: should I try the firmware from https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/?
Yes. Simply copy the radeon/ directory to /lib/firmware/. You might want to stash away the one you have there now just for the test.
You: Also, try the latest upstream kernel packages here http://kernel.opensuse.org/packages/vanilla *with* the latest radeon microcode you've already updated in the previous step.
Me: upgraded to kernel version Desktop -- openSUSE - 3.13.0-rc3-4.g39ea148: ok for 4 boots out of 4; could not find anything suspicious in /var/log/messages
Me: added "quiet" back in the boot command line (that did not work around the problem, in contrast to what I wrote in my original description of the issue): ok: for almost all boots (1 I am not sure of)
Ok, so it looks like upstream has been fixed in the meantime - the question is, which is the fix. :)
You: Also, is this an upgrage to openSUSE 13.1 from an older version? If so, can you try a clean install? I had a couple of bug reports happening because of stale leftovers from previous installation after an upgrade.
Me: it is an upgrade from the previous opensuse version. To be honest, I am not that confident my system is stable now! So I think I will do a clean install. But I first need to backup everything, and then find the time for this...
Sounds like a plan.
You: Also, please upload full dmesg. Me: please find uploaded a dmseg of a successfull boot (hard to get, as the issue is that boot fails most of the time!): is is a dmesg before doing the different actions you suggested in the previous comment.
Thanks. Nothing out of the ordinary there, AFAICT.
You: Has radeon ever worked on this box? If so, please upload full dmesg from a working kernel too. Me: it never worked; it is a brand new mother board.
Oh, so the successful dmesg is when you remove "quiet" from the command line?
Me: as a wrap up, I have two questions left: 1) should I upgrade the firmware from linux-firmware.git?
Sure, although the 13.1 package should have the latest. I should probably check. Yeah, just replace the radeon/ dir temporarily as a test, as I said above.
2) is it recommended and/or worhwhile to do a clean install?
Right, if you need to reinstall your machine anyway, then it certainly wouldn't hurt. Otherwise, wait a bit longer with this until we've tried all other options. HTH. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c7
--- Comment #7 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c8
--- Comment #8 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c
Takashi Iwai
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c9
--- Comment #9 from Michel Schaffers
https://bugzilla.novell.com/show_bug.cgi?id=855501
https://bugzilla.novell.com/show_bug.cgi?id=855501#c10
--- Comment #10 from Borislav Petkov
[ 9.139065] [Hardware Error]: MC4 Error (node 0): Watchdog timeout due to lack of progress. [ 9.139073] [Hardware Error]: Error Status: System Fatal error. [ 9.139076] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|UE|MiscV|PCC|AddrV|-|-]: 0xfe00000000070f0f [ 9.139081] [Hardware Error]: MC4_ADDR: 0x00000000d003afc0 [ 9.139083] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
Here it is: a transaction to/from a device on your machine timeouts. This is reported by the machine check exception above. If I'd have to guess, this is probably related to the GPU. Now, please go and check whether there's a BIOS update for your motherboard. If there is, try to upgrade it but make sure you don't brick your motherboard while doing that :) Also, upload a *full* dmesg here after booting your machine with "debug log_buf_len=16M systemd.log_target=null ignore_loglevel" Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=855501
Jiri Slaby
participants (1)
-
bugzilla_noreply@novell.com