[Bug 235197] New: HP nx6325: hibernate / suspend to disk hangs during suspend
https://bugzilla.novell.com/show_bug.cgi?id=235197 Summary: HP nx6325: hibernate / suspend to disk hangs during suspend Product: openSUSE 10.2 Version: Final Platform: x86-64 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: bk@novell.com ReportedBy: bk@novell.com QAContact: qa@suse.de CC: jplack@novell.com, bodo.bauer@novell.com, pavel@novell.com, stefan.fent@novell.com, trenn@novell.com, seife@novell.com I'm debugging hibernate / suspend to disk on the HP nx6325 (ATI SB400 chipset), the cases below are tested with "powersave -U" from the GUI or the Hibernate trigger from the GNOME Task bar. - without additional kernel options, it hangs on suspend early - switching to the text console, /etc/pm/hooks/50modules hangs in "modprobe -r button" -> Added a check to 50modules to skip unloading button.ko, continued testing with this change: - with "noapic", it resumes once from the Desktop, but 2nd hangs on suspend - with "noapic maxcpus=0" (and with unloading button.ko disabled) things do look better so far... side note: with "maxcpus=0", without "noapic" the machine is slow in booting and hangs eventually durng boot. The same happens on the Acer Ferrari 1000, it's suspend problem is tracked in bug 228344. Needs more testing and fixing of the issues found, for example the broken unload of button.ko. A temporary workaround for clients can be also be to list the ACPI modules to load in /etc/sysconfig/powersave/common like this: ACPI_MODULES="ac battery processor" Fix for rmmod button.ko during suspend wanted/needed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 seife@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fseidel@novell.com, hmacht@novell.com ------- Comment #1 from seife@novell.com 2007-01-15 10:48 MST ------- (In reply to comment #0)
I'm debugging hibernate / suspend to disk on the HP nx6325 (ATI SB400 chipset), the cases below are tested with "powersave -U" from the GUI or the Hibernate trigger from the GNOME Task bar.
- without additional kernel options, it hangs on suspend early - switching to the text console, /etc/pm/hooks/50modules hangs in "modprobe -r button" -> Added a check to 50modules to skip unloading button.ko,
you could also edit /etc/pm/config:SUSPEND_MODULES :-) This might be Thomas' famous "ACPI is fscked up on HP" bug.
continued testing with this change:
- with "noapic", it resumes once from the Desktop, but 2nd hangs on suspend - with "noapic maxcpus=0" (and with unloading button.ko disabled) things do look better so far...
side note: with "maxcpus=0", without "noapic" the machine is slow in booting and hangs eventually durng boot. The same happens on the Acer Ferrari 1000, it's suspend problem is tracked in bug 228344.
Needs more testing and fixing of the issues found, for example the broken unload of button.ko. A temporary workaround for clients can be also be to list the ACPI modules to load in /etc/sysconfig/powersave/common
/etc/sysconfig/powersave is no longer used for suspend settings, pm-utils is used now.
like this:
ACPI_MODULES="ac battery processor"
Fix for rmmod button.ko during suspend wanted/needed.
edit /etc/pm/config. But if "modprobe -r button" does not work, it is a kernel problem IMO :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #2 from trenn@novell.com 2007-01-15 12:22 MST ------- About what kernel version is the button rmmod problem? On HP ACPI BIOS things could be broken, that "possibly" could affect the button unload. Fix is to build psmouse as module and unload it on shutdown or this (Copy and Pasted). Both needs one reboot of a fixed kernel to work again: Index: linux-2.6.18-SL102_BRANCH/drivers/input/serio/serio.c =================================================================== --- linux-2.6.18-SL102_BRANCH.orig/drivers/input/serio/serio.c +++ linux-2.6.18-SL102_BRANCH/drivers/input/serio/serio.c @@ -772,6 +772,7 @@ static struct bus_type serio_bus = { .name = "serio", .probe = serio_driver_probe, .remove = serio_driver_remove, + .shutdown = serio_driver_remove, }; -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #3 from bk@novell.com 2007-01-18 10:11 MST ------- The kernel which I tested is a 10.2 kernel, I'll supply the exact version. I'll test with the latest kernel from factory or latest upstream and maybe check for tg3 suspend/resume patches or applicable git trees before looking into these issues much deeper. Yes, one could expect the HP ACPI BIOS to be a possible source of the problems, there are updates for the initial version, and a german Linux user page of this laptop describes some of the breakages in the BIOS of this laptop and provides some DSDT fixes for it and a completely rewritten DSDT: http://thinksilicon.redprohosting.de/index.php?page=39 One of the workarounds which is described on the page is the usage of the boot option acpi_os_name="Microsoft Windows NT" to make the BIOS think that it's running with Windows NT. Adding ".shutdown = serio_driver_remove," to serio.c apparently helped, but turning psmouse.c into a module in addition to this change made suspend worse for me, so I recompiled with psmouse compiled-in. Because I identified the tg3 driver as the source of the hang of suspend after writing the image to disk, I also added it in /etc/sysconfig/powersave/sleep to UNLOAD_MODULES_BEFORE_SUSPEND2DISK on the HP nx6325 and since, the system seems to suspend/resume without kernel problems and I started hitting userspace errors because like KDE not switching input focus on click on windows and GNOME apps crashing on timeouts from DBUS but where the system kept running otherwise. I'll verify whether there were other changes of significance and possibly check that with on a vanilla installation. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #4 from seife@novell.com 2007-01-18 11:04 MST ------- Just a note... (In reply to comment #3)
writing the image to disk, I also added it in /etc/sysconfig/powersave/sleep to UNLOAD_MODULES_BEFORE_SUSPEND2DISK on the HP nx6325 and since, the system
On 10.2 this is a NOOP. Use /etc/pm/config. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #5 from pavel@novell.com 2007-01-18 17:12 MST ------- Thanks for great work you are doing. Notice that "right" solution for tg3 problem is not unloading it, but fixing its suspend/resume methods (and pushing that upstream). I'll understand if you do not have time to do that... but it would be nice. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #6 from bk@novell.com 2007-01-19 13:19 MST ------- Thank you very much for your plaudit! I want to give it back to you too. It seems tg3 isn't the real cause, only a rather reliable trigger. I now saw that sometimes suspend even works with tg3 loaded, but I cannot reproduce these occasions. BUT... I can trigger the hang now also without tg3: I have enabled many dev_dbg() debug printks in drivers/power/resume.c by defining DEBUG at the top of it and I'm triggering the hang now reliably without tg3 when I set the console loglevel to 8 so that all of them are printed on the framebuffer console. Reducing the console loglevel to 7 makes suspend work again, unless I load tg3 and run "ifconfig eth0 up" to cause that tg3_suspend and tg3_resume actually do something. Looking at tg3_set_power_state() which is (I assume needlessly) called even two times (from the logged messages) below tg3_resume, I saw many udelay's, and even a few msleep()s and a few msleep_interruptible() somewhere else in tg3.c. Subjectively, I then felt that tg3_resume took even a noticealble time between the printk on enty and the printk on exit was printed. I then reduced the time and even removed some of the udelay() calls and turned some msleep() calls into udelay() calls, and removed one of the two calls to tg3_set_power_state() inside tg3_resume and now the suspend does not hang on the first try after boot, and rcnetwork start starts the network and lets me ping a remote host (which even works), but then the driver/card gets into a bad state and I saw an ifdown script hanging but the next successive suspend attempt then hangs, so I apparently did a bit too much. When suspend hangs in may testing, it always hangs after this message: ata2: SATA link down (SStatus 0 SControl 310) At first I didn't take much note if this message because it is printed also during the working suspend cycles. However it's clear to me now, that if that is not followed but this: When suspend works, that message is directly followed by: ata1: SATA link up 1.5 Gbps (SStatus 113 Scontrol 210)" and two more messages until sda is brought back and the data is written to swap. When suspend does not work, I get with console loglevel 9: sd 0:0:0:0 resuming (that's one of the dev_dbg printks in resume.c) PM: writing image. swsusp:free swap pages: 2879160 After 6 mintues, I get: sd 0:0:0:0 timing out on command, waited 360s sd 0:0:0:0 SCSI error: return code = 0x00000028 end_request: I/O error, dev sda, sector 210756814 Read-error on swap-device (8:0:210756822) Saving image data pages (117788 pages) ... 4% and repeats these with an interval of 6 minutes. I captured some screen-shots of the various hangs with my digicam. Scrolling up one page, I also see: sata_sil 0000:00:12.0 resuming So that is called at least, maybe further debug should also go into that area? AFAICS, it seems that the disk SATA link never comes up again on resume when tg3 is loaded and it's eth0 is up, or when I log many debug messages. To the contrary, it is reported to be down directly before the disk should come back to life. So maybe we do hit a SATA timeout which is not recovered? BTW (I guess it's unrelated): For these last tests, I tried using a kernel which is compiled with lockdep checking and suspending it triggers some printks of: lockdep: not fixing up alternatives. It's logged after "CPU 1 is now offline" and after "Enabling non-boot CPUs ..." -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #7 from trenn@novell.com 2007-01-20 04:08 MST ------- I possibly see the same. On my nc6400, suspend worked with SLES10, but does not anymore with current SLES10-SP1 kernel. I think I also saw sd timout messages.... I did not try that much yet (init=/bin/bash also did not work), then I removed my ACPI suspend patches, but still no go. Next I will try to remove libata SP1 patches or whatever came in. Could be something that found it's way into mainline and got backported to our SP1 kernel? In this case I hope to be able to identify the bad one soon... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 pavel@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |teheo@novell.com ------- Comment #8 from pavel@novell.com 2007-01-20 13:34 MST ------- Tejun, maybe we should be more conservative with SP1 changes? I mean, powersaving is nice, but perhaps it should get mainline testing, first? Or at least testing in suse10.3? (If you want some help with upstreaming those patches, I guess I have some interest in that...) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #9 from seife@novell.com 2007-01-20 16:00 MST ------- Bernd, take a look at bug #235475 and the patches there. They are already in the SP1 kernel CVS. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |bk@novell.com ------- Comment #10 from teheo@novell.com 2007-01-20 19:35 MST ------- The decision to backport mainline libata was made before I joined the company. So, I guess I'm immune from the blame game. ;-) I agree that it's a big change but I also agree with the decision because I don't think there would be a way to implement pending FATE entries and fixing bugs. There were just a lot of things which were "can't do now, will be done in SP1" including ahci NCQ and suspend/resume support. The backported EH has been in mainline for quite some time now but there hasn't been suspending related bug reports, interestingly. Or, if you're talking about ahci link PS. Yeap, it's new but a completely separate thing which shouldn't affect anything when disabled. There is a related bug (#236331). I'll follow up on that and make sure it actually is staying out of the way. Bernd, please give a shot at the latest CVS kernel. The problem seems similar. Also, do you have a way to obtain the console log like a digital camera? It would help a lot if you can post full dmesg. (PS. I'll need some help. Dunno when I'll work on it tho. I'll cc you when I make some progress. Thanks.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #11 from teheo@novell.com 2007-01-20 19:52 MST ------- Created an attachment (id=114066) --> (https://bugzilla.novell.com/attachment.cgi?id=114066&action=view) pm-debug If the current CVS kernel doesn't work, please apply the attached patch and report what the kernel says. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #12 from bk@novell.com 2007-01-23 10:46 MST ------- Tejun, following your last comment, I tried the latest SuSE Kernel-of-the-day from http://ftp.suse.com/pub/projects/kernel/kotd/x86_64/HEAD/ which always contains kernel source and binary builds from our kernel CVS, and have seen suspend/resume working with sata_sil while tg3 was loaded and active. So the SATA driver problem apparently disappeared with your fixes in our CVS. The psmouse issue also seems to have disappeared, the fix from Thomas Renninger is apparently included and working. What remains for the HP nx6425 are other problems, meaning that I had to build a kernel myself with a minimum config to work around these remaining issues: - rmmod button also hangs with our CVS kernel - the pre-built kernel rpm (kernel-default.rpm) hangs early on suspend (source of this problem is unknown, has to be tracked down) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED Info Provider|bk@novell.com | Resolution| |FIXED ------- Comment #13 from teheo@novell.com 2007-01-23 18:35 MST ------- Thanks, Bernhard. I'm closing this bug. Please open new bug report for the remaining problems. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 stefan.fent@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #14 from stefan.fent@novell.com 2007-01-24 03:36 MST ------- Hmm according to Comment #12, this bug is far from being fixed - only parts of it are fixed. Reopening. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #17 from bk@novell.com 2007-01-24 11:17 MST ------- Update for the status: The psmouse fix from comment #2 is not included in the CVS kernel, because it's not yet sure if it is the right and complete fix (e.g. to be merged upstream, it needs also to work in other configurations, specifically, when psmouse is compiled as a module and unloaded before resume) A few suspend/resume cycles didn't reproduce the remaining psmouse problem hang. To verify that the issue is still there, without the fix, I did a few more tests: I have done 3 suspend/resume cycles with the default binary kernel rpm from monday's CVS, but the 3rd attempt (having booted up to runlevel 3 before this attempt) hung on resume. The next suspend, started from a fresh reboot to runlevel 3, also hung on resume. The next reboot to runlevel 2 suspended once but hung on the second cylce. And after a resume with X started, the Xserver didn't show a mouse after resume but otherwise worked. Starting X again, it didn't find the configured Synaptics pointer, but also it's radeon driver failed to initialize the graphics hardware. Besides rmmod button, also rmmod thermal hangs, I'll hand that part likely over to Thomas Renninger. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 bodo.bauer@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC|bodo.bauer@novell.com | ------- Comment #18 from bodo.bauer@novell.com 2007-01-25 08:39 MST ------- I'm leaving Novell. If TPM assistance is needed, please ask Joachim Plack (AMD related issues) or Oliver Ries (general x86_64/i386) for assitance. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 behlert@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |241095 nThis| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197 ------- Comment #19 from bk@novell.com 2007-02-28 10:02 MST ------- A version of the psmouse shutdown fix which Thomas posted here in Comment #2 is now included in our latest Kernel for 10.3, but instead of serio_driver_remove, it calls serio_driver_remove, which calls serio_shutdown, which calls psmouse_cleanup, which does this, AFAICS: /* * Some boxes, such as HP nx7400, get terribly confused if mouse * is not fully enabled before suspending/shutting down. */ ps2_command(&psmouse->ps2dev, NULL, PSMOUSE_CMD_ENABLE); If this is right for the nx6325 as well (which can be assumed), this will work, but a final test on the same machine would be good to verify it. This is the 10.3/KOTD changelog entry after which psmouse should not break things on suspend: * Wed Feb 21 2007 - trenn@suse.de - patches.fixes/psmouse-fiddle-with-reset.patch: psmouse - properly reset mouse on shutdown/suspend. - patches.fixes/serio-cleanup-to-bus_2.patch: i8042 - let serio bus suspend ports. Fix suspend to ram (246948). One needs to do a number of repeated suspend/resume cycles in X with apps running in order to verify that the issue is fixed/not fixed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=235197#c20 --- Comment #20 from Thomas Scholz <tscholz@suse.de> 2007-09-18 08:10:20 MST --- Created an attachment (id=173083) --> (https://bugzilla.novell.com/attachment.cgi?id=173083) photo of the kernel panic -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=235197 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=235197#c22 Thomas Renninger <trenn@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #22 from Thomas Renninger <trenn@novell.com> 2008-08-29 03:37:26 MDT --- 10.2..., this one is so old. Rafael, the suspend maintainer, AFAIK has such a machine. So I very much expect this bug is fixed in later distributions already. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com