[Bug 935086] New: System hangs on resume from hibernation
http://bugzilla.opensuse.org/show_bug.cgi?id=935086 Bug ID: 935086 Summary: System hangs on resume from hibernation Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: Other OS: openSUSE 13.2 Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: novell-ugeuder@sneakemail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 638210 --> http://bugzilla.opensuse.org/attachment.cgi?id=638210&action=edit screen image when resume from hibernate hangs During resuming from hibernation to disk my system hangs. I just installed the new dracut (patch openSUSE-2015-427) but the problem persists. I am aware that there are a couple of similar bug reports already, e.g. https://bugzilla.opensuse.org/show_bug.cgi?id=917221 but I'm not sure this is the same issue. My hang has been 100% repeatable since 13.2 came out. (Actually system has been reinstalled once since then, problem exactly the same) When I started to debug the problem and added splash=0 rd.break=pre-mount to the kernel command line the problem went away. - With default kernel args: every resume hangs - With additional splash=0: every resume hangs - With additional splash=0 rd.break=pre-mount: every resume succeeds So my guess is this could be a timing issue. Entering the rd shell and waiting until I press Crtl-D makes the system slow enough that it works. The last console message before hanging is (see attached screen image) [ OK ] Reached target Remote File Systems The problem must be somewhere after that. But when trying to debug where it is the problem went away. splash=0 rd.break=pre-mount is kind of acceptable for me, but still that's not the way things should work. My setup: /boot is ext2 rest of the disk is LUKS encrypted, LVM, rootfs is btrfs, /home is xfs Some other report mentions that encrypted disks don't work, soem problem with crypttab. However, this has never been my problem. The disk password is always asked an opening the encryption seems to succeed. The hand occurs somewhat later. I guess the problem is difficult to understand/fix from the information provided. So all hints to debug it further / provide more info are welcome. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c1
--- Comment #1 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c2
Bernhard Wiedemann
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c3
--- Comment #3 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c4
--- Comment #4 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c5
--- Comment #5 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c8
Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c9
Takashi Iwai
Created attachment 638634 [details] screen picture when resume from hibernate hangs + SysRq
I use systemctl hibernate to hibernate the system. (Originally I used KDE desktop to call hibernate from the GUI. The hang existed already there with the same symptoms. Recently I have switched to i3 window manager and I call systemctl hibernate from command line).
When the system hangs in resume SysRq just prints the headlines, but no information. (see attached screen shot) What does that mean?
Is quiet boot option removed? Also, increase the log level via alt-sysrq-8 or 9 beforehand.
While experimenting with SysRq I noticed that "SysRq i" (SIGKILL to all) makes the resume complete. (tried twice, worked twice) The system seemed functional after that, but I did not dare to really use it, because I'm not sure what might be in an inconsistent state after the killing. What can we learn from that? I guess it means that the hang was still in initramfs, so killing all initramfs processes made it resuming the real root. But I don't understand the details how the real root could come up when everything is killed.
The kernel already started to the resume, as its prompt already shows. But I wonder how the remote file system message appears *after* it. So, this looks like that two things are running concurrently and conflicting.
(After the resume had completed SysRq showed the complete information, not just the headlines.
I also tried "echo disk | sudo tee /sys/power/state". In this case the system does not hang when resuming. What do we learn from that?
What if you pass resumedelay=10 boot option? This will delay the resume in 10 seconds after kicked off. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c10
Uwe Geuder
Is quiet boot option removed?
There is no quiet option on my kernel line. It was the first thing I removed when starting this debugging exercise and I have not put it back since
Also, increase the log level via alt-sysrq-8 or 9 beforehand.
The log level seems to have no effect on SysRq-L SysRq-P, and SysRq-T. when the system hangs no information is shown even if the log level is 8 or 9. When the system works, full information is show even if the log level is 0. During the boot more messages are shown if log level is 9. But there is no additional message shortly before the system hangs at "Reached target Remote File system" as shown in the screen pictures before.
What if you pass resumedelay=10 boot option? This will delay the resume in 10 seconds after kicked off.
I changed to resumedelay=30 in order to be sure absolutely I don't miss any effect of the option. This option does not seem to work as intended. The delay happens every time (also on fresh boots) even before the disk password is asked. Because the snapshot is inside the encrypted volume I don't think the system can already know at this point of time that it should resume. After entering the disk password the parameter seems to have no effect any more in either of the 3 cases: 1.) systemctl hibernate: resume hangs forever 2.) echo disk > /sys/power/state: no additional wait of 30 seconds when the system is resuming 3.) Alt-SysRq-i when the system hangs: no additional wait when the system is resuming. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c11
Takashi Iwai
(In reply to Takashi Iwai from comment #9)
Is quiet boot option removed?
There is no quiet option on my kernel line. It was the first thing I removed when starting this debugging exercise and I have not put it back since
Also, increase the log level via alt-sysrq-8 or 9 beforehand.
The log level seems to have no effect on SysRq-L SysRq-P, and SysRq-T. when the system hangs no information is shown even if the log level is 8 or 9. When the system works, full information is show even if the log level is 0.
During the boot more messages are shown if log level is 9. But there is no additional message shortly before the system hangs at "Reached target Remote File system" as shown in the screen pictures before.
What if you pass resumedelay=10 boot option? This will delay the resume in 10 seconds after kicked off.
I changed to resumedelay=30 in order to be sure absolutely I don't miss any effect of the option. This option does not seem to work as intended. The delay happens every time (also on fresh boots) even before the disk password is asked. Because the snapshot is inside the encrypted volume I don't think the system can already know at this point of time that it should resume.
Yeah, it's no help, unfortunately. The real resume is triggered in dracut 95resume module, and this makes skipping the resumedelay option. As a blind shot: could you try to uninstall the package "suspend"? This is a user-space suspend and it often does thing badly with openSUSE 13.2 and later. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c12
Uwe Geuder
As a blind shot: could you try to uninstall the package "suspend"? This is a user-space suspend and it often does thing badly with openSUSE 13.2 and later.
Yes, after removing the "suspend" package the system resumes without hanging. Thanks for that tip. (The bug report https://bugzilla.opensuse.org/show_bug.cgi?id=917221m suggests that uninstalling pm-utils would help, but that was not the case for me.) s2disk progress is no longer displayed (well, it's part of the suspend package so that's not a surprise). Instead the same progress messages are shown as when writing directly into /sys/power/state. After that I removed my debugging support and went back to the kernel command line containing "splash=silent quiet" as it was initially after installation. (instead of just "splash=0" used during the debugging.) Resume still works. There is only one cosmetic issue. Plymouth screen is not displayed when the system hibernates, instead console message are visible. I know it works in 13.1, I don't remember whether it has ever worked in 13.2, because I have used "splash=0" for too long. Personally the console messages don't disturb me. But from a distro point are we happy with the solution of uninstalling suspend package and having no plymouth screen during hibernate? Could uninstalling suspend package break some other setups than mine? (As said I'm not 100% sure whether uninstalling suspend made plymouth during suspend go away or whether it's an unrelated issue. But I need to stop debugging for now and do some "real" work...) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c13
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c14
--- Comment #14 from Uwe Geuder
Could you try one more test?
No problem, I'm glad to getting this solved.
Please reinstall suspend and pm-utils but change SLEEP_MODULE to "kernel" in pm-utils default. Does this make resume working again?
Just to clarify: Removing pm-utils was something I tried without success in a previous installation of 13.2. For the whole lifetime of this report pm-utils has always been installed. I re-installed the suspend package and created a configuration file like this: $ cat /etc/pm/config.d/sleepmodule.config SLEEP_MODULE="kernel" Resume works without hanging. The suspend hooks are executed as shown in the attached log file. E.g. grubonce suppresses the usual grub menu when waking up from hibernation. That suppression was not there while suspend package was uninstalled. However, I notice the following issues 1. I made 5 hibernate cycles. In one of the 5 the machine crashed and rebooted during the resume. Obviously at the reboot no valid snapshot was found anymore, so a fresh boot occurred. No traces where left in the logs what has happened. So it probably happened shortly before, during, or short after switching to the real root file system and no information could be stored to the filesystem. I don't have time now to make more reliable statistics how often that really happens. But while using the rd.break-premount work-around it has not happened during ~3 months, some 50-100 resumes. 2. There is no plymouth screen during hibernation. Instead there is flickering and console messages are visible. As said before, not a showstopper for me, but a regression from 13.1 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c15
Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c16
--- Comment #16 from Uwe Geuder
OK, then this is some wrong dracut and suspend setup. I suspect 95resume has an issue with suspend package. Maybe this is a dup of bug 925873 and bug 905424.
I think both bugs are completely different from the symptoms observed. Also for me rd.break=pre-mount makes the resume complete, but the reporter of 925873 writes it still hangs (different location than mine). But what all 3 bug reports have probably in common: There is some nasty competition between pm-utils, suspend, systemd, and kernel hibernate functionality. (I do not used suspend to RAM, but I guess it's the same there). So as a distro it might be useful to remove some package(s) and make sure the remaining packages co-operate nicely. The other reports had some comments what seems particularly old/unmaintained. I have no information to add at this moment. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c17
--- Comment #17 from Takashi Iwai
(In reply to Takashi Iwai from comment #13)
Could you try one more test?
No problem, I'm glad to getting this solved.
Please reinstall suspend and pm-utils but change SLEEP_MODULE to "kernel" in pm-utils default. Does this make resume working again?
Just to clarify: Removing pm-utils was something I tried without success in a previous installation of 13.2. For the whole lifetime of this report pm-utils has always been installed.
I re-installed the suspend package and created a configuration file like this:
$ cat /etc/pm/config.d/sleepmodule.config SLEEP_MODULE="kernel"
Resume works without hanging. The suspend hooks are executed as shown in the attached log file. E.g. grubonce suppresses the usual grub menu when waking up from hibernation. That suppression was not there while suspend package was uninstalled.
However, I notice the following issues
1. I made 5 hibernate cycles. In one of the 5 the machine crashed and rebooted during the resume. Obviously at the reboot no valid snapshot was found anymore, so a fresh boot occurred. No traces where left in the logs what has happened. So it probably happened shortly before, during, or short after switching to the real root file system and no information could be stored to the filesystem. I don't have time now to make more reliable statistics how often that really happens. But while using the rd.break-premount work-around it has not happened during ~3 months, some 50-100 resumes.
Hmm, I can't think of the relation with pm-utils immediately. (Actually not figured out why user-suspend got broken but kernel-suspend works.) All things look like a side effect of racy resume procedure to me, so I won't be surprised if some instability remains even without pm-utils.
2. There is no plymouth screen during hibernation. Instead there is flickering and console messages are visible. As said before, not a showstopper for me, but a regression from 13.1
This is a known drawback of kernel-suspend, IIRC. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c18
--- Comment #18 from Uwe Geuder
All things look like a side effect of racy resume procedure to me, so I
Yes, same here. I went back to the default configuration and added ps -ef >/dev/console sleep 15 at the beginning of /usr/lib/dracut/modules.d/95resume/resume.sh (needs to be built with dracut --add debug, because by default the ps binary is not in initramfs) I would have expected that the sleep 15 prevents the hang as does the rd.break=pre-mount But it does not, resume hangs in the "old" location. I attach the screen picture of ps output, but probably it's more entertaining than helpful to understand what is going on. Need to stop debugging now. Maybe I can add even more debugging in a few days. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c19
--- Comment #19 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c20
--- Comment #20 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c21
--- Comment #21 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c22
--- Comment #22 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c23
--- Comment #23 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c24
--- Comment #24 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c25
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c26
--- Comment #26 from Takashi Iwai
Could you add "noresume" boot option (while keeping "resume=xxx" option) and retest for a few times whether you still get the unexpected reboot with SLEEP_METHOD="kernel"?
I meant SLEEP_MODULE, of course. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c27
--- Comment #27 from Uwe Geuder
Also, increase the log level via alt-sysrq-8 or 9 beforehand.
The log level seems to have no effect on SysRq-L SysRq-P, and SysRq-T. when the system hangs no information is shown even if the log level is 8 or 9. When the system works, full information is show even if the log level is 0.
During the boot more messages are shown if log level is 9. But there is no additional message shortly before the system hangs at "Reached target Remote File system" as shown in the screen pictures before.
This was incorrect. On this keyboard I need to use the keypad digits to make it work, Alt-SysRq-KP_8 instead of Alt-SysRq-8. which I used before. So the corrected information is: the kernel seems to be fully alive every time when the system hangs. Obviously the log level was too low before. Unfortunately task list information does by far not fit into the console scrollback buffer. So I we need that I have find a way to increase the scrollback buffer first. Form the CPU state I can see that always 3 of my cores are in idle and the 4th core is handing the SysRq. So this might look like a deadlock in user space. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c28
--- Comment #28 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c29
--- Comment #29 from Uwe Geuder
From the picture we see that after reporting the compresssion ratio there will be 2 lines of size information of the compressed image.
When the hang occurs only 0 or 1 line of compressed image statistics is written and then the system hangs (as shown in previous attachments 639392 and 639393) If you look at code starting from line 669 in load.c you see that it does nothing besides printf() http://paste.opensuse.org/45471188 That really would look like printf() is hanging, but I cannot believe that. On one side it could vaguely explain why rd.break=pre-mount makes a difference, the console can be in different state after having been in an interactive shell just before. But in the normal setup stdout of the resume process is not the console at all, it's the socket to journald. And the hang has been reproducible with the socket and with the console. So no, hanging in printf() makes no sense to me. My printf debugging occurs even later, so we are not even close to the place where the kernel switches to the structures restored from snapshot. The hang seems to occur while the resume process runs in user space or does trivial printf() at most. Of course getting a call stack of the resume process might be helpful. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c30
--- Comment #30 from Uwe Geuder
Hm, then uswsuspend might be also broken with the recent kernel. This method isn't used by many distros, so little tested, I'm afraid.
...
Maybe it's better to concentrate on stabilization of kernel hibernation.
According to the kernel documentation kernel resume is not support from an LVM2 partition at all. So it should not work for me at all. I find this a bit hard to believe, maybe the code has been improved but the the documentation has not been updated. I have asked the suspend maintainers, let's see whether we get an answer. http://marc.info/?l=linux-pm&m=143544300618472&w=2 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c31
--- Comment #31 from Uwe Geuder
BTW, while looking at the dmesg output on my machine, I noticed that the system triggers hibernate-resume twice: once the kernel itself and once by dracut.
Thanks for yet another idea. I need to investigate that later. The needinfo flag is still active. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c32
--- Comment #32 from Uwe Geuder
BTW, while looking at the dmesg output on my machine, I noticed that the system triggers hibernate-resume twice: once the kernel itself and once by dracut.
Could you add "noresume" boot option (while keeping "resume=xxx" option) and retest for a few times whether you still get the unexpected reboot with SLEEP_METHOD="kernel"?
Also does this have any influence with SLEEP_METHOD="uswsusp"?
Ah I did not even know that the kernel can obviously resume directly from its command line without any help from initramfs. Well, I have used disk encryption longer than hibernate, so this does not apply to me. Yes, the first resume failure has always been in the kernel log. I never understood where it comes from. That's also the place where resumedelay=10 takes effect. But of course with LUKS encryption and LVM that resume has never succeeded for me and never will. I don't think the failed attempt should confuse the kernel, the device just does not exist. It got confused there and misfunction latet, that would be a bad bug. Anyway I tried the "nosuspend" parameter (only with uswsusp so far). It looks like dracut also respects this parameter and skips the whole resume script. So not at good idea, at least not with uswsusp. It will boot with filesystem mounts after hibernate, which can always mean data loss. Not sure about kernel mode. One resume action is in the same 95suspend script. If that is the only one, it will not resume either. But need to test. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c33
Uwe Geuder
(In reply to Takashi Iwai from comment #25)
Could you add "noresume" boot option (while keeping "resume=xxx" option) and retest for a few times whether you still get the unexpected reboot with SLEEP_METHOD="kernel"?
I meant SLEEP_MODULE, of course.
If noresume is on the kernel cmd line and SLEEP_MODULE="kernel" the system does not even hibernate (systemctl hibernate does nothing visible). At least /usr/lib/pm-utils/sleep.d/99Zgrub seems to check for the noresume option, I have not studied it details what happens if there is a match. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c34
--- Comment #34 from Takashi Iwai
(In reply to Takashi Iwai from comment #26)
(In reply to Takashi Iwai from comment #25)
Could you add "noresume" boot option (while keeping "resume=xxx" option) and retest for a few times whether you still get the unexpected reboot with SLEEP_METHOD="kernel"?
I meant SLEEP_MODULE, of course.
If noresume is on the kernel cmd line and SLEEP_MODULE="kernel" the system does not even hibernate (systemctl hibernate does nothing visible).
At least /usr/lib/pm-utils/sleep.d/99Zgrub seems to check for the noresume option, I have not studied it details what happens if there is a match.
Thanks, but I don't think it worth to test further in this way. We're going to remove suspend and pm-utils packages as an update fix even for openSUSE 13.2 in the end. So, please test again without suspend and pm-utils packages, and confirm that it works stably enough. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c35
--- Comment #35 from Uwe Geuder
So, please test again without suspend and pm-utils packages, and confirm that it works stably enough.
OK, packages removed and kernel command line restored. One test was successful, but I will test a couple of days in normal usage (and hopefully be able to add a copule of extra hibernations) and report then how it went. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c36
--- Comment #36 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c37
--- Comment #37 from Uwe Geuder
http://bugzilla.opensuse.org/show_bug.cgi?id=935086
http://bugzilla.opensuse.org/show_bug.cgi?id=935086#c38
Takashi Iwai
participants (1)
-
bugzilla_noreply@novell.com