[Bug 944659] New: After installing Leap M2, 9 out of 10 times, the system hangs with black screen, just after the bootloader hands over
http://bugzilla.opensuse.org/show_bug.cgi?id=944659 Bug ID: 944659 Summary: After installing Leap M2, 9 out of 10 times, the system hangs with black screen, just after the bootloader hands over Classification: openSUSE Product: openSUSE Distribution Version: 42.1 Milestone 2 Hardware: Other OS: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: milan.zimmermann@gmail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36 Build Identifier: AFter installing Leap M2, the system does not start *most of the time*. Just after the bootloader messages go away, there is some activity where I see the keyboard, either FNLOCK or NUMLOCK flash, and after that the system hangs with a black screen (it never switches to what I would describe as the alt-ctrl-F7 screen) This happens 9 out of 10 times, whether I just reset, or cold reboot the system. I made numerous changes in the BIOS after this occured to try to resolve it (including of course to set settings to Save) but the behaviour is still the same. As I noticed the keyboard, either FNLOCK or NUMLOCK activity after bootloader screen goes away, I tried with several keyboards, same result. I also noticed that *sometines* after the bootloader message goes away, the FNLOCK goes ON, if I am fast enough to switch it back OFF, the system will start. This is AMD with Radeon: home-server:~ # /sbin/lspci -nnk | grep -i vga -A2 01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS880 [Radeon HD 4290] [1002:9714] Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000] Kernel driver in use: radeon I suspect some sort of driver bug or timing issue but that is a speculation. Reproducible: Always Steps to Reproduce: 1. Reboot the system 2. Bootloader shows it's message 3. Black Screen Actual Results: The system hangs just after the bootloader messages go away. This happens 9 out of 10 times. The times when the system starts, it behaves normally. Expected Results: The system switches to the F7 display (as if I hit Alt-Ctrl-F7) and shows the login screen. I described more details in the Details section. As I mentioned, this is Radeon on chip card home-server:~ # /sbin/lspci -nnk | grep -i vga -A2 01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS880 [Radeon HD 4290] [1002:9714] Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000] Kernel driver in use: radeon -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c1
--- Comment #1 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c2
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c3
--- Comment #3 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c4
--- Comment #4 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c5
--- Comment #5 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c6
--- Comment #6 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c7
--- Comment #7 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c8
--- Comment #8 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c9
--- Comment #9 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c10
--- Comment #10 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c11
--- Comment #11 from Takashi Iwai
2. What about the rescue system on DVD? Is it only the installed system that hangs? : Not sure I understand - I have only one OS installed; this is on a SSD with 2 partitions, /dev/sda1 for the OS (no further partitioning) and /dev/sda2 for swap. I have not tried to boot into the rescue on the USB - do you want me to try?
Yes, I meant the rescue boot item from the installation DVD. (In reply to milan zimmermann from comment #10)
I also have an update here: Let me look into two things:
1. When I said that I changed DEFAULT_APPEND in sysconfig->bootloader and removed the "quiet" option - I did that, it is still gone.
Not only changing /etc/sysconfig/bootloader, but you'll have to refresh the grub configuration, too. This can be done via YaST bootloader dialog. Or, it'll be easier to edit /etc/default/grub instead (edit $GRUB_CMDLINE_LINUX_DEFAULT), then update the real grub config via /usr/sbin/grub2-mkconfig -o /boot/grub2/grub.cfg
But it appears this did not actually affect the bootloader. The reason I think that is when you pointed out the 'e' option, I tried that removed "quiet", and the system did boot (one try only so this is not conclusive) but most importantly, the boot process looked different, it does show all the messages (which removing "quiet" from the DEFAULT_APPEND).
Right, the purpose to remove quiet option is to see the kernel log at the hang.
2. I noticed my SECURE_BOOT is set to "yes" - I plan to experiment with setting it to "no".
Oh, that's an interesting point. Yes, please investigate it, too. Thanks. If you find out that the hang happens far before the kernel starts showing many messages, you might try to pass dis_ucode_ldr boot option. It's just to be sure, though. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c12
--- Comment #12 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c13
--- Comment #13 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c14
--- Comment #14 from Takashi Iwai
How can I deliver the messages to this report? I took a camera picture of the messages, but the result is too big for an attachment. Can I email it to your email (if so, what is it)? Or is there a log this stuff goes to?
It's not problem to attach a picture on Bugzilla. Use attachment. The size doesn't matter unless it's over 100MB or so :)
I see no clear errors there, except (this may be a message) "radeon: ... registered panic notifier". The last message is ""systemd-journald ... Received request to flush runtime journal from PID 1"
Maybe your first test with nomodeset wasn't performed properly? Could you retry with nomodeset boot option? This will result in the lower (or no) graphics with VESA fb, but it should be working at least. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c15
--- Comment #15 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c16
--- Comment #16 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c17
--- Comment #17 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c18
--- Comment #18 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c19
--- Comment #19 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c20
--- Comment #20 from Takashi Iwai
I tested with nomodeset making sure to mkconfig and that it appears in /boot/grub2/grub.cfg. It did not help, failed to boot 5 times in 6 tries. In the succesful boot, the screen resolution was lower, the display narrower and stretched, all indications that the nomodeset did kick in.
But interestingly, in dmesg with the nomodeset, there is an error:
[ 3.167331] pata_jmicron 0000:05:00.1: enabling device (0000 -> 0001) [ 3.167740] [drm] VGACON disable radeon kernel modesetting. [ 3.167758] [drm:radeon_init [radeon]] *ERROR* No UMS support in radeon module!
This is OK, the expected result.
I am attaching the full dmesg.
Overall, from some 1000 boots or so I did during the last week, the only reliably working setting was when I boot using USB and in F5 (Kernel), set "No ACPI" . But setting acpi=off in the bootloader does not have the same effect as I noted.
ACPI=off supposedly disables some devices indirectly, so this might help avoiding the bad point.
Not sure where to take it next, but thanks very much for your help so far.
BTW, Would you have some idea how to find what actual setting is set when I select "No ACPI" in the USB boot in F5 (Kernel)?
You can take a look at /proc/cmdline. Judging from the boot screen you attached in comment 16, this doesn't seem like a crash of any driver. Now I read through the kernel log, the possible hit after the last dying message is acpi-cpufreq. Could you try to blacklist it, e.g. adding the following line to /etc/modprobe.d/99-local.conf? blacklist acpi-cpufreq Then reboot and retest. Check dmesg output to verify whether acpi-cpufreq If a message with "acpi-cpufreq" appears, the blacklist didn't work -- as a temporary test, just remove the module from /lib/modules/$VERSION/kernel/drivers/cpufreq directory. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c21
--- Comment #21 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c22
--- Comment #22 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c23
--- Comment #23 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c24
--- Comment #24 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c25
--- Comment #25 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c26
--- Comment #26 from Takashi Iwai
Thanks for following up.
The issue is the same after update to 42.1 Beta.
After many long searches I came to a hard to prove and (of course quite possibly incorrect) conclusion this has something to do with the Audio support in the kernel; there are dmesg like
snd_hda_codec_hdmi hdaudioC1D0: HDMI ATI/AMD: no speaker allocation for ELD
just at the time the system normally hangs.
This is an utterly harmless message, found normally when plugging with a monitor without a speaker, and it must be just coincidence that this is seen at last. I can say it because I am the upstream maintainer of sound subsystem :) What we really need is to figure out whether this is really a kernel hang. If yes, what kind of hang. Since you don't get any kernel Oops or panic message, it doesn't look like a normal kernel hang due to a kernel bug, but either a hardware hang (hardware defect or hang by a driver bug) or some bad task blocking the whole system. As acpi=off seems curing, the odd is more to the former. If so, it's tough to figure out. You need to start from a minimal system that works reliably by disabling the hardware components as much as possible, then enable piece by piece until it hits the issue again. Or you can try older kernels. For example, 3.11.x kernel in openSUSE-13.1, 3.12.x for SLE12, 3.16.x for openSUSE-13.2. If any older kernel works, we may try bisection to spot out the regression. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c27
--- Comment #27 from milan zimmermann
(In reply to milan zimmermann from comment #25)
Thanks for following up.
The issue is the same after update to 42.1 Beta.
After many long searches I came to a hard to prove and (of course quite possibly incorrect) conclusion this has something to do with the Audio support in the kernel; there are dmesg like
snd_hda_codec_hdmi hdaudioC1D0: HDMI ATI/AMD: no speaker allocation for ELD
just at the time the system normally hangs.
This is an utterly harmless message, found normally when plugging with a monitor without a speaker, and it must be just coincidence that this is seen at last. I can say it because I am the upstream maintainer of sound subsystem :)
Great thanks, I will not push in that direction.
What we really need is to figure out whether this is really a kernel hang. If yes, what kind of hang.
Since you don't get any kernel Oops or panic message, it doesn't look like a normal kernel hang due to a kernel bug, but either a hardware hang (hardware defect or hang by a driver bug) or some bad task blocking the whole system. As acpi=off seems curing, the odd is more to the former.
Regarding acpi=off: To be precise, having booted to "rescue system" with acpi=off works 100%. But if I set acpi=off in my hard disk boot, there is no difference and it mostly hangs.
If so, it's tough to figure out. You need to start from a minimal system that works reliably by disabling the hardware components as much as possible, then enable piece by piece until it hits the issue again.
From the best I can tell, I did everything I can think of. I have disabled devices in BIOS. I have pulled every plug, USB and otherwise, including the monitor out, and when I plug monitor back in the system shows the hang message.
Or you can try older kernels. For example, 3.11.x kernel in openSUSE-13.1, 3.12.x for SLE12, 3.16.x for openSUSE-13.2. If any older kernel works, we may try bisection to spot out the regression.
I think trying 13.2 is worth it. I do not want to go to 13.1 I use btrfs and not sure it is supported, but if it is, I can try that. Would you have an advice how to add 13.2 repo so it forces to take kernel from it? Thanks -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c28
--- Comment #28 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c29
--- Comment #29 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c30
--- Comment #30 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c31
--- Comment #31 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c32
--- Comment #32 from milan zimmermann
Just added http://download.opensuse.org/repositories/home:/tiwai:/kernel:/3.17/standard... as a repo, will switch the kernel and test it.
This kernel (3.17.6-1.g12b7bf1-desktop) booted 5 times out of five. That is probably enough text for this one, but will try a few more tomorrow. Dmesg attached. Will test 3.18 and 3.19 next - tomorrow after I get some work done, it is way after midnight here, will report here. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c33
--- Comment #33 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c34
--- Comment #34 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c35
--- Comment #35 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c36
--- Comment #36 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c37
--- Comment #37 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c38
--- Comment #38 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c39
--- Comment #39 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c40
--- Comment #40 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c41
--- Comment #41 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c42
--- Comment #42 from milan zimmermann
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
http://bugzilla.opensuse.org/show_bug.cgi?id=944659#c43
Stephan Kulow
http://bugzilla.opensuse.org/show_bug.cgi?id=944659
Ludwig Nussel
participants (1)
-
bugzilla_noreply@novell.com