[Bug 819677] New: nbd sometimes disconnects and announces an I/O error
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c0 Summary: nbd sometimes disconnects and announces an I/O error Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: Yarny@public-files.de QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=538994) --> (http://bugzilla.novell.com/attachment.cgi?id=538994) Output of dmesg after the bug occured User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0 nbd-devices sometimes disconnect during operation, i.e., the application using it reports an I/O error and dmesg is full of error. Further access attempts to the device also fail. Reproducible: Sometimes Steps to Reproduce: Connect two machines via IP. On machine #1: $ dd if=/dev/null of=/tmp/empty_big_file bs=1G seek=1 $ nbd-server 61000 /tmp/empty_big_file -r On machine #2: $ modprobe nbd $ nbd-client $IP_OF_MACHINE_1 61000 /dev/nbd0 $ sha1sum /dev/nbd0 Actual Results: After a while sha1sum stops with:
sha1sum: /dev/nbd0: Input/output error Also dmesg has lots of related errors (see attachment).
The bug does not always occur. After some experimentation, I suspect that it depends on the size of the block device. Even with an endless loop of sha1sum's, I couldn't reproduce the error with a 128MiB nbd --- maybe the reason for this is a cache on the client which prevents repeated accesses to the server for small devices. The machines during my tests were VirtualBoxes with 768MiB RAM (hosted on openSUSE 12.2 machine). The bug doesn't appear with openSUSE 12.2 (on the client side). I stumbled over this bug while testing pxe booting with kiwi. Here, the bug appears very reliably (unfortunatelly). During my experimentations, I had the impression that the occurence of the bug depends on the speed of the server machine: If the server is busy with something else and responds very slowly to the requests of the client, every now and then the pxe image boots successfully. During my research, I found the following bug reports which appear to be related to this issue: <URL:https://bugzilla.redhat.com/948718> <URL:https://lkml.org/lkml/2012/8/9/508> -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
The bug doesn't appear with openSUSE 12.2 (on the client side). I stand corrected. The bug also occurs on an up-to-date openSUSE 12.2 installation. However, I still fail to trigger it with an installation from
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c1 --- Comment #1 from Yarny Yarny <Yarny@public-files.de> 2013-05-17 22:40:30 UTC --- the DVD (without updates). I played a bit with patches from oss-update repo (12.2) and it turned out that the bug occurs with openSUSE-2012-700 installed. This patch is a kernel update which, besides other things, contains this non-security fix:
- nbd: clear waiting_queue on shutdown (bnc#778630). Therefore I suspect that this fix causes the erroneous behaviour.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c2 Paul Clements <paul.clements@steeleye.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |paul.clements@steeleye.com --- Comment #2 from Paul Clements <paul.clements@steeleye.com> 2013-05-21 14:41:05 UTC --- (In reply to comment #1)
I played a bit with patches from oss-update repo (12.2) and it turned out that the bug occurs with openSUSE-2012-700 installed. This patch is a kernel update which, besides other things, contains this non-security fix:
- nbd: clear waiting_queue on shutdown (bnc#778630). Therefore I suspect that this fix causes the erroneous behaviour.
I don't think that patch is causing any bad behavior. However, I think I do know what is causing this. The sudden disconnect problem has been discussed on nbd-general: http://www.mail-archive.com/nbd-general@lists.sourceforge.net/msg01328.html There is a kernel patch attached here: http://www.mail-archive.com/nbd-general@lists.sourceforge.net/msg01336.html If you could try that and confirm that you're seeing the same, SIGCHLD interrupting wait_event_interruptible? If that is the problem, the ultimate fix is to mask SIGCHLD in nbd-client. Latest nbd-client from github does this. That fix is described here: http://www.mail-archive.com/nbd-general@lists.sourceforge.net/msg01374.html -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
If you could try that and confirm that you're seeing the same, SIGCHLD interrupting wait_event_interruptible? I'm seeing this: got ERESTARTSYS: sig:17 code:262145 sender:646 According to "ps axjf" 646 is "[nbd-client] <defunct>", which is a child of the nbd-client that established the connection. Your patch also fixes (or workaround's) the problem, i.e.,
If that is the problem, the ultimate fix is to mask SIGCHLD in nbd-client. Umm ok, then I change the "Component" of
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c3 Yarny Yarny <Yarny@public-files.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|Kernel |Basesystem AssignedTo|kernel-maintainers@forge.pr |bnc-team-screening@forge.pr |ovo.novell.com |ovo.novell.com --- Comment #3 from Yarny Yarny <Yarny@public-files.de> 2013-05-26 19:24:26 UTC --- Hi Paul, thanks for your comments. the nbd device stays alive after this message occurs. this bug from "kernel" to "basesystem". -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c FeiXiang Zhang <fxzhang@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bnc-team-screening@forge.pr |ms@suse.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=819677 https://bugzilla.novell.com/show_bug.cgi?id=819677#c4 --- Comment #4 from Yarny Yarny <Yarny@public-files.de> 2013-09-08 22:02:50 UTC --- I'm happy to report that I cannot reproduce this bug with the current Factory version (nbd-client 3.3). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com