[Bug 235197] HP nx6325: hibernate / suspend to disk hangs during suspend

19 Jan 2007

      https://bugzilla.novell.com/show_bug.cgi?id=235197

------- Comment #6 from bk@novell.com  2007-01-19 13:19 MST -------
Thank you very much for your plaudit! I want to give it back to you too.

It seems tg3 isn't the real cause, only a rather reliable trigger.
I now saw that sometimes suspend even works with tg3 loaded,
but I cannot reproduce these occasions.

BUT... I can trigger the hang now also without tg3:

I have enabled many dev_dbg() debug printks in drivers/power/resume.c by
defining DEBUG at the top of it and I'm triggering the hang now reliably
without tg3 when I set the console loglevel to 8 so that all of them are
printed on the framebuffer console. Reducing the console loglevel to 7 makes
suspend work again, unless I load tg3 and run "ifconfig eth0 up" to cause
that tg3_suspend and tg3_resume actually do something.

Looking at tg3_set_power_state() which is (I assume needlessly) called even
two times (from the logged messages) below tg3_resume, I saw many udelay's,
and even a few msleep()s and a few msleep_interruptible() somewhere else
in tg3.c.

Subjectively, I then felt that tg3_resume took even a noticealble time
between the printk on enty and the printk on exit was printed. 

I then reduced the time and even removed some of the udelay() calls
and turned some msleep() calls into udelay() calls, and removed one
of the two calls to tg3_set_power_state() inside tg3_resume and now
the suspend does not hang on the first try after boot, and rcnetwork start
starts the network and lets me ping a remote host (which even works),
but then the driver/card gets into a bad state and I saw an ifdown
script hanging but the next successive suspend attempt then hangs,
so I apparently did a bit too much.

When suspend hangs in may testing, it always hangs after this message:

ata2: SATA link down (SStatus 0 SControl 310)

At first I didn't take much note if this message because it is printed
also during the working suspend cycles. However it's clear to me now,
that if that is not followed but this:

When suspend works, that message is directly followed by:
ata1: SATA link up 1.5 Gbps (SStatus 113 Scontrol 210)" and two more
messages until sda is brought back and the data is written to swap.

When suspend does not work, I get with console loglevel 9:

sd 0:0:0:0 resuming (that's one of the dev_dbg printks in resume.c)
PM: writing image.
swsusp:free swap pages: 2879160

After 6 mintues, I get:

sd 0:0:0:0 timing out on command, waited 360s
sd 0:0:0:0 SCSI error: return code = 0x00000028 
end_request: I/O error, dev sda, sector 210756814
Read-error on swap-device (8:0:210756822)
Saving image data pages (117788 pages) ...   4%

and repeats these with an interval of 6 minutes.
I captured some screen-shots of the various hangs with my digicam.

Scrolling up one page, I also see:

sata_sil 0000:00:12.0 resuming

So that is called at least, maybe further debug should also go into that area?

AFAICS, it seems that the disk SATA link never comes up again on resume
when tg3 is loaded and it's eth0 is up, or when I log many debug messages.

To the contrary, it is reported to be down directly before the disk
should come back to life. So maybe we do hit a SATA timeout which
is not recovered?

BTW (I guess it's unrelated):
For these last tests, I tried using a kernel which is compiled with
lockdep checking and suspending it triggers some printks of:

lockdep: not fixing up alternatives.

It's logged after "CPU 1 is now offline" and after "Enabling non-boot CPUs ..."

-- 
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 235197] HP nx6325: hibernate / suspend to disk hangs during suspend

bugzilla_noreply＠novell.com