[Bug 934202] New: NFS: nfs4_handle_delegation_recall_error: unhandled error -13.
http://bugzilla.suse.com/show_bug.cgi?id=934202 Bug ID: 934202 Summary: NFS: nfs4_handle_delegation_recall_error: unhandled error -13. Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Other Assignee: nfbrown@suse.com Reporter: tchatzimichos@suse.com QA Contact: qa-bugs@suse.de CC: mcaj@suse.com, mpluskal@suse.com, ncutler@suse.com Found By: --- Blocker: --- This bug was reported first in the internal SUSE ticketing system. I am copy-pasting all the current comments from that ticket to this one: **** Nathan Cutler ****: Ever since the migration from login.suse.cz to nfs.suse.cz that took place over the weekend 2015-05-30 and 2015-05-31, my workstation's NFS home (pantograf.suse.cz) in unstable. The problem manifests approx. once every 3-4 hours of usage. Randomly, for no apparent reason. First, I/O to /home partition starts to hang. Load average starts to rise and a mystery process "10.100.2.89-man" appears in "top" listing. This process is unkillable even with SIGKILL. Running programs continue to work as long as they do not attempt I/O to the home partition, then they freeze up. The only fix I know of is to reboot the machine. Referring to the log, the problem appears to be apparmor-related: 2015-06-03T10:52:33.390867+02:00 pantograf kernel: [ 1066.488685] audit: type=1400 audit(1433321553.389:49): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.390885+02:00 pantograf kernel: [ 1066.488720] nfs: RPC call returned error 13 2015-06-03T10:52:33.391849+02:00 pantograf kernel: [ 1066.489192] audit: type=1400 audit(1433321553.390:50): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.391856+02:00 pantograf kernel: [ 1066.489204] nfs: RPC call returned error 13 2015-06-03T10:52:33.414853+02:00 pantograf kernel: [ 1066.512310] audit: type=1400 audit(1433321553.413:51): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5178 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.414863+02:00 pantograf kernel: [ 1066.512326] nfs: RPC call returned error 13 2015-06-03T10:52:33.415875+02:00 pantograf kernel: [ 1066.513919] audit: type=1400 audit(1433321553.414:52): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.415888+02:00 pantograf kernel: [ 1066.513935] nfs: RPC call returned error 13 2015-06-03T10:52:33.416857+02:00 pantograf kernel: [ 1066.514853] audit: type=1400 audit(1433321553.415:53): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.416869+02:00 pantograf kernel: [ 1066.514868] nfs: RPC call returned error 13 2015-06-03T10:52:33.417857+02:00 pantograf kernel: [ 1066.515346] audit: type=1400 audit(1433321553.416:54): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.417867+02:00 pantograf kernel: [ 1066.515358] nfs: RPC call returned error 13 Please advise. Update: I disabled AppArmor so those messages are gone But the I/O lockups continue. After examining the log closer, I believe this is the relevant message: 2015-06-03T16:52:33.508329+02:00 pantograf kernel: [13191.933894] NFS: nfs4_handle_delegation_recall_error: unhandled error -13. **** Neil Brown ****: I'm pretty sure that I've seen exactly this bug already. I just need to remember which one... I don't think this is related to kerberos at all. It is purely an NFSv4 state management issue. It is a bit like 831029, but maybe not exactly the same. I'll have a more thorough look after the weekend. Exactly which kernel do you have on your client? NeilBrown **** Nathan Cutler ****: Linux pantograf 3.16.7-21-desktop #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015 (93c1539) x86_64 x86_64 x86_64 GNU/Linux **** Theo Chatzimichos ****: Just for the record, Martin Caj's PC runs KDE with the same kernel version, and it has mounted the NFS /home since Friday. I just checked its dmesg and didn't see such error messages reported. The machine is now on low load though since Martin is away **** Nathan Cutler ****: It only happens if you actually use the desktop. I can leave KDE running all night, all day and as long as I do not use the machine, or use it only via ssh from home, nothing happens. The I/O freezes only happen when I am sitting physically at the machine, clicking in Thunderbird, Pidgin, Firefox, etc. When Martin gets back I will work with him to reproduce it. Nathan **** Martin Pluskal ****: Hi I have also encountered those issues (on current TW with kernel 4.1.0-rc6-2.g8f2a4d1-desktop) I have captured dmesg after issuing "rpcdebug -m nfs -s all" - see http://users.suse.cz/~mpluskal/bugs/nfs_issues/dmesg.txt.xz - perhaps it contains something useful. Kind regards Martin **** Neil Brown ****: I've had a bit more of a look at this and I think it is a new problem. There have been quite a few different bugs which involve the nn.nn.nn.nn-man thread spinning, but this doesn't seem to fit the pattern of any of them. So maybe it would be best to open a bug on bugzilla. The "unhandled error -13." is an NFS4ERR_ACCES error. When the client tried to open some file, the Netapp gave it a delegation. It presumably then opened the file again without telling the server, on the strength of that delegation. At some point the delegation needed to be returned so the client requested explicit opens from the server. One of those failed with "EACCESS" meaning the user on the client didn't seem to have permission. That is very odd. The state management thread then goes into a spin. This suggests that it was the state management thread which decided that the delegation needed to be returned. Maybe it got some sort of state error. I cannot easily find a path through the code that would make it spin, but it is fairly complex code. One thing that might help is to capture the content of /proc/PID/stack for the threads that is using all the CPU time. Capture that a dozen times or so in quick succession. Or maybe capture hundreds and discard the duplicates. The other thing that could be useful is a tcpdunp trace of NFS (port 2049) traffic between client and server when the problem occurs. So if you can reliably trigger it, that should be easy to get. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 --- Comment #1 from Martin Pluskal <mpluskal@suse.com> --- I have captured "tcpdump port 2049 -w nfs_dump.log" - http://users.suse.cz/~mpluskal/bugs/934202/nfs_dump.log.xz when issue occured: # dmesg ... [33386.454585] Netfilter messages via NETLINK v0.30. [48056.618081] NFS: v4 server nfs.suse.cz returned a bad sequence-id error! [48056.643065] NFS: nfs4_handle_delegation_recall_error: unhandled error -10026. [48096.219741] NFS: v4 server nfs.suse.cz returned a bad sequence-id error! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 --- Comment #2 from Neil Brown <nfbrown@suse.com> --- Thanks for the trace. Can you tell me how it corresponds to the symptom? Did you start the trace before there was a problem or only after? The trace seems to run for about 4 hours (14585 seconds). from 07:27:30 to 11:30:35, though I'm not sure what timezone. When in that time did the problem occur? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 --- Comment #3 from Martin Pluskal <mpluskal@suse.com> --- (In reply to Neil F Brown from comment #2)
Thanks for the trace. Can you tell me how it corresponds to the symptom? Sure, error from dmesg occured before end of thare Did you start the trace before there was a problem or only after? The trace seems to run for about 4 hours (14585 seconds). from 07:27:30 to 11:30:35, though I'm not sure what timezone. When in that time did the problem occur?
Sure, error from dmesg occurred before end of trace (few minutes), trace was obviously captured before and during time issue occurred, do you assume that capturing trace after issue would be somehow useful? Timezone is CEST. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #4 from Neil Brown <nfbrown@suse.com> --- No, capturing much after the error (more than a few minutes) wouldn't be likely to help. I just need to be sure that I know what I'm looking at - I don't like to assume. I didn't realise at first, but this seems to be very different symptoms than the original problem description. The error message is certainly different and I strong suspect the NN.NN.NN.NN-man process didn't go into a spin - is that correct? There are a few odd things in the trace, but I think the first odd thing comes from the client. In frame 67293 the client tries to close a file which it opened in frame 28511- the file is called "places.sqlite". Then immediately after a successful response from the server (well... 600usec later), the client sends another close request for the same file. The server accepts this request (which is a little odd, but I'm not certain it is wrong) but the next sequenced request (OPEN CLOSE LOCK LOCKU are all 'sequenced' requests and have sequence numbers) gets an error: BAD_SEQID. This is the CLOSE in frame 67385. This close is correctly sequenced so the server shouldn't complain, but presumably it got confused by the earlier double close. I cannot see how the Linux client could be sending a double close like that. I guess I need more data.... Could you (or anyone who can reproduce this) please collect both the tcpdump trace and nfs debugging with "rpcdebug -m nfs -s all". Then when a problem occurs, provide: - the tcpdump log - the kernel logs - any notes on what you noticed, including whether any processes were spinning. - if processes were spinning, a few copies of /proc/PID/stack for those processes. - exact kernel version. Hopefully I will be able to pull all that together. Thanks. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c5 --- Comment #5 from Martin Pluskal <mpluskal@suse.com> --- (In reply to Neil F Brown from comment #4)
Could you (or anyone who can reproduce this) please collect both the tcpdump trace and nfs debugging with "rpcdebug -m nfs -s all". Then when a problem occurs, provide: - the tcpdump log # sudo tcpdump port 2049 -w nfs_dump_new.log http://users.suse.cz/~mpluskal/bugs/934202/nfs_dump_new.log.xz - the kernel logs # journalctl --dmesg > dmesg_new.txt http://users.suse.cz/~mpluskal/bugs/934202/dmesg_new.txt.xz - any notes on what you noticed, including whether any processes were spinning. Issue occured sometime during weekend so I just noticed error in dmesg, there should be minimal activity on nfs mount. - if processes were spinning, a few copies of /proc/PID/stack for those processes. - exact kernel version. # uname -r 4.0.4-3-desktop
Hope it helps -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c6 --- Comment #6 from Neil Brown <nfbrown@suse.com> --- Yes, it does help - thanks. In frame 15164 the client opens a file called "agent-medusa" for read. The server signal success and returns a "read delegation" as well (15165). The client then reads from the file and closes it, still holding the delegation. The client then checks if "agent-medusa-ssh" exists, which it doesn't. So the client wants to hard-link the agent-medusa file to the agent-medusa-ssh name. It knows that doing that will recall the delegation so it pro-actively starts returning the delegation. This involves looking through it's list of open files to see if one was opened, without telling the server, on the strength of the delegation. It seems that it finds one, because (frame 15172) it opens the file again using the delegation, but opens it write-only!! It shouldn't have a write-only file from a read-only delegation. This causes the error, as it should. The LINK then happens, the server recalls the delegation as expected, and the LINK completes. So now I have to find out why the client thinks it has a write-only open on that file.... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c7 --- Comment #7 from Neil Brown <nfbrown@suse.com> --- Unfortunately I cannot find any clue as to why NFS is trying to reclaim a write-only open. If it had one, the sequence of requests would have been quite different. Do you have any idea what program could be creating the ".ssh/agent-medusa-ssh" file? Presumably some KDE password agent or some such.... More traces might help but I don't know. It is very much a mystery. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c9 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(mpluskal@suse.com | |) --- Comment #9 from Neil Brown <nfbrown@suse.com> --- Thanks, there are extra hints in there - particularly the fact that it tries to open with O_WRONLY|_O_CREAT|O_EXCL first. I suspect that is where the strange O_WRONLY open comes from - it shouldn't still be around, but somehow it is. Can you try removing the agent-medusa file, enabling NFS debugging (rpcdebug -m nfs -s all), start a tcpdump capture and an strace capture, and then log in (or whatever triggered that last trace). I want to see exactly how the Netapp responds to the O_CREAT|O_EXCL sequence and whatever follows when it succeeds. I'm guessing that something happens in that case which confuses the client, and then subsequent attempts are handled badly because of that confusion. Thanks. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c12 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |IN_PROGRESS --- Comment #12 from Neil Brown <nfbrown@suse.com> --- Thanks. I can see what is happening now. http://permalink.gmane.org/gmane.linux.nfs/71566 This doesn't necessarily explain the "returned a bad sequence-id error!" error, or the manager thread spinning, but it is a step in the right direction. If I can put together a patch that seems credible I'll ask you to test. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c13 --- Comment #13 from Neil Brown <nfbrown@suse.com> --- Created attachment 638635 --> http://bugzilla.suse.com/attachment.cgi?id=638635&action=edit Patch that should be worth testing. I think this patch is worth testing. I'm not sure how best to build a kernel RPM (I hardly ever do that), but I'll try to sort something out tomorrow unless someone jumps in and does that for me :-) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c14 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #638635|0 |1 is obsolete| | --- Comment #14 from Neil Brown <nfbrown@suse.com> --- Created attachment 638854 --> http://bugzilla.suse.com/attachment.cgi?id=638854&action=edit New patch, better patch. After discussing with upstream, I've discard that patch and have a new one. I've tested this by hacking the Linux NFS server to give out the same sort of delegations as the Netapp. I can reproduce the error in the title, and this patch removes it. I have committed to 'master' and 'opensuse-13.2', so it should appear in kotd and future updates. If you could test a latest kotd in a day of so that would be great. Once that is confirmed I would like this patch to go into SLES12. This *doesn't* address the other error that has been reported:
NFS: nfs4_handle_delegation_recall_error: unhandled error -10026.
and I don't think it addresses the state manager process spinning either. I need more details for those. If they can be reproduced, the logs (like the excellent ones which led to this patch) would help a lot. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c15 --- Comment #15 from Neil Brown <nfbrown@suse.com> --- If you haven't had a chance to find a kernel with the fix yet: http://milos.suse.de/kerneltest/suse/3fdbd1e/ has an x86_64 3.16 based kernel, and http://milos.suse.de/kerneltest/suse/bab8328/ has a 4.1 based kernel, both with that fix. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c16 --- Comment #16 from Martin Pluskal <mpluskal@suse.com> --- Warning do not appear in dmesg with recent kernels (4.1* containing http://kernel.opensuse.org/cgit/kernel/commit/?id=e30b876cd96fb14ce4a94d6fbf... ). As for "state manager process spinning" I am not sure if I am able to help as I am not using nfs home (instead having it mounted as /mnt/home) and using it mainly as backup/easy way to publish stuff on users.suse.cz - and I have not encountered issues which mcaj described - I mean I would like to help but some reliable way to trigger issue would help. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c17 --- Comment #17 from Nathan Cutler <ncutler@suse.com> ---
As for "state manager process spinning" I am not sure if I am able to help as I am not using nfs home (instead having it mounted as /mnt/home) and using it mainly as backup/easy way to publish stuff on users.suse.cz - and I have not encountered issues which mcaj described - I mean I would like to help but some reliable way to trigger issue would help.
I am the one who encountered the spinning state manager process and associated lockups while on NFS home. Since this was on my primary workstation and the lockups were troublesome (to say the least), I had to migrate to local home to be able to work. The bug was easily reproducible (in fact, unavoidable) on my setup, which was: * openSUSE 13.2 * NFS home against nfs.suse.cz (Netapp) * KDE4 * Thunderbird, Pidgin, Firefox I never did narrow down the trigger, but I *suspect* it was clicking on URLs in Thunderbird and Pidgin, so they open in Firefox. At some point (1-4 hours of normal usage after reboot) I would click on a link only to find that the system had become unresponsive because writes to /home were not completing. When this happened, a process called 10.100.2.89-man was visible in 'top' and the load average would rise due to the "spinning". The following message would appear in /var/log/messages:
NFS: nfs4_handle_delegation_recall_error: unhandled error -13.
(note, not -10026) Martin Caj also has openSUSE 13.2, NFS home, and KDE4, but uses a different MUA and a different IRC client, and AFAIK he could not reproduce -- hence my suspicion that the issue is related to Thunderbird+Pidging+Firefox interaction. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c18 --- Comment #18 from Nathan Cutler <ncutler@suse.com> --- Hi Neil: How should I go about installing the patched kernel? Here is what I have tried so far: http://paste2.org/JZwwhB0s I use VirtualBox, and it is upset that it can't find an appropriate version of the virtualbox-host-kmp-default package to go with the kernel. Nathan -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c19 --- Comment #19 from Neil Brown <nfbrown@suse.com> --- (In reply to Nathan Cutler from comment #18)
I use VirtualBox, and it is upset that it can't find an appropriate version of the virtualbox-host-kmp-default package to go with the kernel.
That's unfortunate. I don't know how to build kmps. The only think I can think of is to wait for the new kernel to get into factory - howfully that will automatically include the right kmps. Maybe you could find matching kernel in: http://download.opensuse.org/repositories/Kernel:/openSUSE-13.2/standard/x86... and virtualbox kmp in: http://download.opensuse.org/repositories/Virtualization/openSUSE_13.2/x86_6... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202 http://bugzilla.suse.com/show_bug.cgi?id=934202#c27 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #27 from Neil Brown <nfbrown@suse.com> --- BTW it turns out that this patch has seriously broken. It is fixed by 24311f884189d42d40354a6f38ca218eb9aeb811 upstream. I've added this patch to the 13.2 kernel. see also bug#942178 -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com