[Bug 612794] New: process hanging in D state
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c0 Summary: process hanging in D state Classification: openSUSE Product: openSUSE 11.3 Version: Factory Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: meissner@novell.com QAContact: qa@suse.de Found By: Development Blocker: --- On my ppc64 machine with autobuild running I get "lsof" from seccheck entering "D" state. smells a bit NFS related. or uname related. I will attach sysrq-t output -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c1 --- Comment #1 from Marcus Meissner <meissner@novell.com> 2010-06-09 09:45:24 UTC --- Created an attachment (id=368051) --> (http://bugzilla.novell.com/attachment.cgi?id=368051) sysrq-t.log sysrq-t output -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c2 --- Comment #2 from Marcus Meissner <meissner@novell.com> 2010-06-09 09:52:11 UTC --- lsof currently has process 24325 open (autobuild) ls -la /proc/27555/fd insgesamt 0 dr-x------ 2 root root 0 9. Jun 11:45 . dr-xr-xr-x 7 root root 0 9. Jun 11:37 .. lr-x------ 1 root root 64 9. Jun 11:45 0 -> /dev/null l-wx------ 1 root root 64 9. Jun 11:45 1 -> pipe:[5639109] l-wx------ 1 root root 64 9. Jun 11:45 2 -> pipe:[5638514] lr-x------ 1 root root 64 9. Jun 11:45 3 -> /proc lr-x------ 1 root root 64 9. Jun 11:45 4 -> /proc/24325/fd i cant really see what the autobuild script has open :( -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c3 Marcus Meissner <meissner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High CC| |ro@novell.com Severity|Normal |Major --- Comment #3 from Marcus Meissner <meissner@novell.com> 2010-06-16 14:50:37 UTC --- also seen by Rudi -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c4 --- Comment #4 from Marcus Meissner <meissner@novell.com> 2010-06-17 14:30:22 UTC --- Created an attachment (id=369785) --> (http://bugzilla.novell.com/attachment.cgi?id=369785) syrq-t.log hung yet again some hours after reboot. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c5 Marcus Meissner <meissner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 - High |P1 - Urgent CC| |dmueller@novell.com Severity|Major |Critical --- Comment #5 from Marcus Meissner <meissner@novell.com> 2010-06-22 10:59:26 UTC --- also seen by Dirk I think. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c6 --- Comment #6 from Dirk Mueller <dmueller@novell.com> 2010-06-22 23:32:10 CEST --- for me also "sync" gets stuck (can't really see where, strace does not do anything when attaching to the process, same for gdb, it just hangs). which means that also "reboot" without "-n" hangs. it frequently happens over the weekend when autobuild was performing a lot of build jobs (friday evening rebuild). how can i debug this? it is really getting annoying to reset the machine every monday. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c10 --- Comment #10 from Jeff Mahoney <jeffm@novell.com> 2010-06-22 23:08:50 UTC --- (In reply to comment #6)
for me also "sync" gets stuck (can't really see where, strace does not do anything when attaching to the process, same for gdb, it just hangs).
Yep. If a process is in D state, it can't be attached to via ptrace, which is what both strace and gdb use. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c11 Neil Brown <nfbrown@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |meissner@novell.com --- Comment #11 from Neil Brown <nfbrown@novell.com> 2010-06-22 23:28:24 UTC --- autobuild wants to write to file, is first flushing writes that someone else made, and is waiting in writeback_single_inode for some other thread to finish synching the inode. uname is closing a file while exiting and so (as this is NFS) is synching it out and is blocked in nfs_writepages (form writeback_single_inode - so uname is what autobuild is waiting for) trying to get a lock on a page. flush-0:23 wants to flush data and is also waiting for uname to finish nfs_writepages lsof performing a 'stat' which needs for flush writes so that it can be sure the 'size' is correct, so it is in nfs_writepages waiting for uname to finish so it can have a turn. So the central problem is uname trying to lock a page. My guess is that the problem file is a log file - connected to stdout on uname. The second trace shows a similar pattern, though 'w' is the central blocker. I note that 'grape' is running 2.6.34-rc6-7-ppc64. Is that right? an -rc for 11.3? Maybe it has changed since 2 weeks ago when the problem was reported. It looks a bit like the bug fixed by commit a6305ddb080fb483ca41ca56cacb6f96089f0c8e which is in -rc6. Do we know exactly what kernel was running on grape at the time? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c12 --- Comment #12 from Marcus Meissner <meissner@novell.com> 2010-06-24 15:17:50 UTC --- i have just booted 2.6.34-9 flushd was in "D" already. I then typed "sync" which entered "D" state and large parts of the usual suspects immediately followed. echo "t" > sysrq-trigger output follows. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c13 Marcus Meissner <meissner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|meissner@novell.com | --- Comment #13 from Marcus Meissner <meissner@novell.com> 2010-06-24 15:19:50 UTC --- Created an attachment (id=371535) --> (http://bugzilla.novell.com/attachment.cgi?id=371535) 2.6.34-9-diskwait.log 2.6.34-9 diskwait log. rpm -q --changelog kernel-ppc64-2.6.34-9.10.ppc |less * Mit Jun 02 2010 bphilips@suse.de - patches.drivers/e1000e-entropy-source.patch: Reintroduce IRQF_SHARED to fix non-MSI case (bnc#610362). -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c14 Neil Brown <nfbrown@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #14 from Neil Brown <nfbrown@novell.com> 2010-06-24 21:52:06 UTC --- Thanks. It looks like the same problem - lots of processes waiting on writeback or occasionally the page lock. Maybe the commit I mentioned above didn't quite fix the problem. I'll go exploring. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c15 Neil Brown <nfbrown@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |meissner@novell.com --- Comment #15 from Neil Brown <nfbrown@novell.com> 2010-06-29 00:39:33 UTC --- Problem appears to be fixed upstream by commit 0522f6adedd2736cbca3c0e16ca51df668993eee The description is an exact fit of the symptom. I have added this patch to git for openSUSE-11.3 Please test and confirm. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c16 --- Comment #16 from Marcus Meissner <meissner@novell.com> 2010-07-01 12:10:21 UTC --- current openSUSE-11.3 branch kernel built and installed on my machine, rebooted this morning. after 4 hours: No Diskwait processes yet... will keep you updated. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c17 --- Comment #17 from Dirk Mueller <dmueller@novell.com> 2010-07-01 18:44:21 CEST --- also installed kernel with this fix now on x86_64 (after I had another lockup today) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c18 Marcus Meissner <meissner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED InfoProvider|meissner@novell.com | --- Comment #18 from Marcus Meissner <meissner@novell.com> 2010-07-05 12:30:45 UTC --- my machine has run over the weekend, full autobuilding etc. no hangs anymore. -> fixed I would say -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c19 Neil Brown <nfbrown@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #19 from Neil Brown <nfbrown@novell.com> 2010-07-06 00:46:08 UTC --- Thanks. Patch is in git for 11.3 and is already upstream, so resolvng as FIXED. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=612794 http://bugzilla.novell.com/show_bug.cgi?id=612794#c20 --- Comment #20 from Bernhard Wiedemann <bwiedemann@suse.com> --- This is an autogenerated message for OBS integration: This bug (612794) was mentioned in https://build.opensuse.org/request/show/42266 Factory / kernel-source https://build.opensuse.org/request/show/42378 11.3:Test / kernel-source -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com