[Bug 954783] New: DRI_PRIME=1 radeon kernel module crash
http://bugzilla.opensuse.org/show_bug.cgi?id=954783 Bug ID: 954783 Summary: DRI_PRIME=1 radeon kernel module crash Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: x86-64 OS: openSUSE 42.1 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: lathanderjk@gmail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- HW: HP zbook 14 AMD M4100 + Intel HD 4400 Kernel randomly crash when use DRI_PRIME=1 command to acces dedicated GPU. This bug does not occur on openSUSE 13.2 ( for working PRIME configuration it is necessary repository http://download.opensuse.org/repositories/X11:/XOrg/openSUSE_13.2/ ) This is probably same bug https://bugs.freedesktop.org/show_bug.cgi?id=92258 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c9
--- Comment #9 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c10
--- Comment #10 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c11
--- Comment #11 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c12
--- Comment #12 from Takashi Iwai
I also us TLP/TLP-rdw can by relevant for this bug?
I have little idea about this stuff, but it's not the official Leap package, right? Try to keep as clean as possible. I tried glmatrix 10x parallel running with DRI_PRIME=1 for an hour, but it doesn't hit any issue. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c13
--- Comment #13 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c14
--- Comment #14 from Takashi Iwai
TLP is from openSUSE-Leap-42.1-Oss repository,
Ah, I didn't know of this in the official repo. But yes, please try without it.
VLC+VLC-codecs and necessary dependencies is the only thing from another repository.
I don't think this is an issue as long as you can reproduce without VLC itself. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c15
--- Comment #15 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c16
--- Comment #16 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c17
--- Comment #17 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c18
--- Comment #18 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c19
--- Comment #19 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c20
--- Comment #20 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c21
--- Comment #21 from Takashi Iwai
echo c > /proc/sysrq-trigger immediately reset computer but folder is still empty.
Didn't it switch to kdump kernel? Check "systemctl status kdump" -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c22
--- Comment #22 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c23
--- Comment #23 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c24
--- Comment #24 from Takashi Iwai
I have two option Kdump Low Memory (72-2698) is set to 72MB Kdump High Memory (0-5358) 49MB
whats is proper value?
Try to double them. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c25
--- Comment #25 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c26
--- Comment #26 from Takashi Iwai
144/98 reboot, GLmatrix system freezy but /var/crash still empty
Test with sysrq-trigger at first that kdump actually works. If it doesn't, either the reserved memory is still too short, or something is missing in the setup. Does /proc/cmdline show the proper crashdump=xxx option? Also, the journal shows that kdump is loaded? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c27
--- Comment #27 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c28
--- Comment #28 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c29
--- Comment #29 from Takashi Iwai
with sysrq-trigger works fine.
OK, so it implies that the crash isn't a normal kernel logic error, but it's directly crashing to the system stall or reset. Too bad. Then the only way to watch is via serial console. Or with some luck, we might be able to catch the log via netconsole. Basically without a crash log, we can't assume even that it's the similar as the bug you pointed (https://bugs.freedesktop.org/show_bug.cgi?id=92258). So, for now, the most important thing is to get any kernel log showing the crash. Then another thing to check would be to test the newer kernel (e.g. FACTORY 4.3 kernel) and/or test the newer X packages (available in OBS X11:XOrg repo). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c30
--- Comment #30 from Takashi Iwai
But when i use external monitor via DP ( DELL U2410) DRI_PRIME=1 glmatrix every time crash after couple minutes. (3-10 minutes, glmatrix parallel 10x) On internal LCD sometimes work fine for long time, sometimes also crash after few minutes.
BTW, this is interesting. Because DRI_PRIME=1 is only for 3D rendering, basically it doesn't matter which output is. The only difference is the window size of glmatrix. More memory usage, more chance for crash, as it seems. You can try to lower the display resolution via xrandr, say, 1024x768 on the DP monitor, and run glmatrix there. This may reduce the crash probability, if my guess is correct. Also, before testing any newer versions: you can test rather the old one that worked before. Namely, install openSUSE 13.2 kernel on top of Leap system. Install it like rpm -ivh kernel-default.rpm --nodeps --oldpackage then it should be still bootable. Confirm that this doesn't lead to any crash. If this works stably, it's really a kernel regression. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c31
--- Comment #31 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c32
--- Comment #32 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c33
--- Comment #33 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c34
--- Comment #34 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c35
--- Comment #35 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c36
--- Comment #36 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c37
--- Comment #37 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c38
--- Comment #38 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c39
--- Comment #39 from Takashi Iwai
I found nothing relevant in log, actualy nothing at all in that time and never happen again, tested for couple days, probably another bug, not related to radeon or need special circumstances.
But i set up netconsole(with my old mini 2133 Debian) and try capture old DRI_PRIME=1 bug.
4.1.12-1 immediately after i use DRI_PRIME=1 show:
[ 187.561982] [drm:radeon_pm_late_init [radeon]] *ERROR* failed to create device file for dpm state [ 187.562556] [drm:radeon_pm_late_init [radeon]] *ERROR* failed to create device file for dpm state [ 187.563102] [drm:radeon_pm_late_init [radeon]] *ERROR* failed to create device file for power profile [ 187.563643] [drm:radeon_pm_late_init [radeon]] *ERROR* failed to create device file for power method
These are the bugs that have been fixed in 4.1.13.
[ 187.678751] snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
This is usually non-fatal issue. It's checked since recent kernels.
and nothing more until crash.
4.1.13-5 show every time when DRI_PRIME=1 is in use.
[ 378.149829] snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
This is same as above, thus no cause. So, we didn't catch the cause yet...
3.16.7-29
nothing show via netconsole when i use DRI_PRIME=1 and work absolutly stable.
The old kernel had no check in HD-audio driver.
With 4.1.12 crash everytime after couple minutes (max 7min), but 4.1.13-5 crash 32min/2min/11min when DRI_PRIME=1 when used.(tested three times)
Does it with a high CPU load, or it happens in idle state, too? I wonder whether it's a software failure, or it's triggered by hardware (by some change in kernel, etc). Also, if it's about graphics, you may boot with drm.debug=0x0e option. Then it'll show more verbose logs regarding graphics. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c40
--- Comment #40 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c41
--- Comment #41 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c42
--- Comment #42 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c43
--- Comment #43 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c44
--- Comment #44 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c45
--- Comment #45 from Takashi Iwai
I installed Arch for testing multiple old LTS kernels ( 3.14,3.18...) but DRI_PRIME=1 works stable even with 4.1.13, after day of testing nothing happen(gameplay, glmatrix... only slightly worse performance than openSUSE_Leap but still much better than IGP
Could you try kernel-vanilla from 4.1.13? It might be that some our backports broke something. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c46
--- Comment #46 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c47
--- Comment #47 from Takashi Iwai
Arch Linux DRI_PRIME=1 kernel-4.1.15(drm.debug=0x0e,dmesg -n 8) looks like this for comparsion
Does Arch kernel work, or does it show the same crash, too? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c48
--- Comment #48 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c49
--- Comment #49 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c50
--- Comment #50 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c51
--- Comment #51 from Takashi Iwai
I tested Arch first with 4.1.13 works stable also 4.2.7 and currently running on 4.3.3 all stable. I have Leap on another drive, i try but Leap has issue with 4.2 and 4.3 too.
Then it makes me wonder whether it's rather a user-space side difference (e.g. X driver or configuration) -- especially if the same issue happened with 4.1.3 kernel-vanilla. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c52
--- Comment #52 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c53
--- Comment #53 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c54
--- Comment #54 from Takashi Iwai
X configuration looks similiar DRI2 glamor on both Arch and Leap. (Arch has a little worse framerate -20-30% something must by different.) But as I write i tested Leap with latest X/mesa etc. and same result. when i have a time i try with another distribution like Xubuntu/Debian how works RPIME.
You can copy the kernel and module files into different systems, e.g. copy Arch kernel (/boot/vmlinuz-*, System.map-*) and modules (/lib/modules/*) into Leap system. Recreate initrd (run mkinitrd once), and run "grub2-mkconfig -o /boot/grub2/grub.cfg", and you'll be able to boot it. Vice versa, copying openSUSE kernel to Arch system should work similarly. If Arch kernel on Leap shows working, we can try to see the exact difference. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c55
--- Comment #55 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c56
--- Comment #56 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c57
--- Comment #57 from Takashi Iwai
both /lib64/modules and /lib/modules/ ? (on Arch both have same size 147MB)
On SUSE, it's /lib/modules/*
Where is System.map located?
Usually /boot. With the same suffix as vmlinuz and initrd. (In reply to Jozef Kovac from comment #56)
Same folder /boot/ on openSUSE much more files (new instalation on external HDD, only OS and all updates)
You don't need but vmlinuz and initrd only for booting. Other files are relevant for development or other purposes.
System.map is present but not on Arch, means what?
It might work without it, maybe dracut doesn't require it unlike the old mkinitrd script. Just give it a try. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c58
--- Comment #58 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c59
--- Comment #59 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c60
--- Comment #60 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c61
--- Comment #61 from Takashi Iwai
But on Arch is only initramfs present, still reneame initramfs to initrd?
[polo@polo boot]$ dir grub intel-ucode.img syslinux initramfs-4.1-x86_64-fallback.img linux41-x86_64.kver vmlinuz-4.1-x86_64 initramfs-4.1-x86_64.img linux43-x86_64.kver vmlinuz-4.3-x86_64 initramfs-4.3-x86_64-fallback.img lost+found initramfs-4.3-x86_64.img memtest86+
The filename doesn't matter for the kernel itself. But it matters for GRUB who passes the init ramdisk to kernel. And openSUSE's grub configuration assumes the initrd being /boot/initrd-xxx. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c62
--- Comment #62 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c63
--- Comment #63 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c64
--- Comment #64 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c65
--- Comment #65 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c66
--- Comment #66 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c67
Vladislav Kamenev
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c68
--- Comment #68 from Vladislav Kamenev
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
Vladislav Kamenev
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c69
--- Comment #69 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c70
--- Comment #70 from Jozef Kovac
http://bugzilla.opensuse.org/show_bug.cgi?id=954783
http://bugzilla.opensuse.org/show_bug.cgi?id=954783#c71
--- Comment #71 from Vladislav Kamenev
3.18.30 on Leap works without problem.
3.19.0 and 3.19.8 crash ( after few minutes with DRI_PRIME=1 same test on IGP stable)
Could u say, whether "3.18.30 on Leap" is simillar to one that i can get from kernel.org or not? Not using openSUSE actually. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com