[Bug 934202] New: NFS: nfs4_handle_delegation_recall_error: unhandled error -13.
http://bugzilla.suse.com/show_bug.cgi?id=934202 Bug ID: 934202 Summary: NFS: nfs4_handle_delegation_recall_error: unhandled error -13. Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Other Assignee: nfbrown@suse.com Reporter: tchatzimichos@suse.com QA Contact: qa-bugs@suse.de CC: mcaj@suse.com, mpluskal@suse.com, ncutler@suse.com Found By: --- Blocker: --- This bug was reported first in the internal SUSE ticketing system. I am copy-pasting all the current comments from that ticket to this one: **** Nathan Cutler ****: Ever since the migration from login.suse.cz to nfs.suse.cz that took place over the weekend 2015-05-30 and 2015-05-31, my workstation's NFS home (pantograf.suse.cz) in unstable. The problem manifests approx. once every 3-4 hours of usage. Randomly, for no apparent reason. First, I/O to /home partition starts to hang. Load average starts to rise and a mystery process "10.100.2.89-man" appears in "top" listing. This process is unkillable even with SIGKILL. Running programs continue to work as long as they do not attempt I/O to the home partition, then they freeze up. The only fix I know of is to reboot the machine. Referring to the log, the problem appears to be apparmor-related: 2015-06-03T10:52:33.390867+02:00 pantograf kernel: [ 1066.488685] audit: type=1400 audit(1433321553.389:49): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.390885+02:00 pantograf kernel: [ 1066.488720] nfs: RPC call returned error 13 2015-06-03T10:52:33.391849+02:00 pantograf kernel: [ 1066.489192] audit: type=1400 audit(1433321553.390:50): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.391856+02:00 pantograf kernel: [ 1066.489204] nfs: RPC call returned error 13 2015-06-03T10:52:33.414853+02:00 pantograf kernel: [ 1066.512310] audit: type=1400 audit(1433321553.413:51): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5178 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.414863+02:00 pantograf kernel: [ 1066.512326] nfs: RPC call returned error 13 2015-06-03T10:52:33.415875+02:00 pantograf kernel: [ 1066.513919] audit: type=1400 audit(1433321553.414:52): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.415888+02:00 pantograf kernel: [ 1066.513935] nfs: RPC call returned error 13 2015-06-03T10:52:33.416857+02:00 pantograf kernel: [ 1066.514853] audit: type=1400 audit(1433321553.415:53): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.416869+02:00 pantograf kernel: [ 1066.514868] nfs: RPC call returned error 13 2015-06-03T10:52:33.417857+02:00 pantograf kernel: [ 1066.515346] audit: type=1400 audit(1433321553.416:54): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.417867+02:00 pantograf kernel: [ 1066.515358] nfs: RPC call returned error 13 Please advise. Update: I disabled AppArmor so those messages are gone But the I/O lockups continue. After examining the log closer, I believe this is the relevant message: 2015-06-03T16:52:33.508329+02:00 pantograf kernel: [13191.933894] NFS: nfs4_handle_delegation_recall_error: unhandled error -13. **** Neil Brown ****: I'm pretty sure that I've seen exactly this bug already. I just need to remember which one... I don't think this is related to kerberos at all. It is purely an NFSv4 state management issue. It is a bit like 831029, but maybe not exactly the same. I'll have a more thorough look after the weekend. Exactly which kernel do you have on your client? NeilBrown **** Nathan Cutler ****: Linux pantograf 3.16.7-21-desktop #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015 (93c1539) x86_64 x86_64 x86_64 GNU/Linux **** Theo Chatzimichos ****: Just for the record, Martin Caj's PC runs KDE with the same kernel version, and it has mounted the NFS /home since Friday. I just checked its dmesg and didn't see such error messages reported. The machine is now on low load though since Martin is away **** Nathan Cutler ****: It only happens if you actually use the desktop. I can leave KDE running all night, all day and as long as I do not use the machine, or use it only via ssh from home, nothing happens. The I/O freezes only happen when I am sitting physically at the machine, clicking in Thunderbird, Pidgin, Firefox, etc. When Martin gets back I will work with him to reproduce it. Nathan **** Martin Pluskal ****: Hi I have also encountered those issues (on current TW with kernel 4.1.0-rc6-2.g8f2a4d1-desktop) I have captured dmesg after issuing "rpcdebug -m nfs -s all" - see http://users.suse.cz/~mpluskal/bugs/nfs_issues/dmesg.txt.xz - perhaps it contains something useful. Kind regards Martin **** Neil Brown ****: I've had a bit more of a look at this and I think it is a new problem. There have been quite a few different bugs which involve the nn.nn.nn.nn-man thread spinning, but this doesn't seem to fit the pattern of any of them. So maybe it would be best to open a bug on bugzilla. The "unhandled error -13." is an NFS4ERR_ACCES error. When the client tried to open some file, the Netapp gave it a delegation. It presumably then opened the file again without telling the server, on the strength of that delegation. At some point the delegation needed to be returned so the client requested explicit opens from the server. One of those failed with "EACCESS" meaning the user on the client didn't seem to have permission. That is very odd. The state management thread then goes into a spin. This suggests that it was the state management thread which decided that the delegation needed to be returned. Maybe it got some sort of state error. I cannot easily find a path through the code that would make it spin, but it is fairly complex code. One thing that might help is to capture the content of /proc/PID/stack for the threads that is using all the CPU time. Capture that a dozen times or so in quick succession. Or maybe capture hundreds and discard the duplicates. The other thing that could be useful is a tcpdunp trace of NFS (port 2049) traffic between client and server when the problem occurs. So if you can reliably trigger it, that should be easy to get. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
--- Comment #1 from Martin Pluskal
http://bugzilla.suse.com/show_bug.cgi?id=934202
--- Comment #2 from Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
--- Comment #3 from Martin Pluskal
Thanks for the trace. Can you tell me how it corresponds to the symptom? Sure, error from dmesg occured before end of thare Did you start the trace before there was a problem or only after? The trace seems to run for about 4 hours (14585 seconds). from 07:27:30 to 11:30:35, though I'm not sure what timezone. When in that time did the problem occur?
Sure, error from dmesg occurred before end of trace (few minutes), trace was obviously captured before and during time issue occurred, do you assume that capturing trace after issue would be somehow useful? Timezone is CEST. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c5
--- Comment #5 from Martin Pluskal
Could you (or anyone who can reproduce this) please collect both the tcpdump trace and nfs debugging with "rpcdebug -m nfs -s all". Then when a problem occurs, provide: - the tcpdump log # sudo tcpdump port 2049 -w nfs_dump_new.log http://users.suse.cz/~mpluskal/bugs/934202/nfs_dump_new.log.xz - the kernel logs # journalctl --dmesg > dmesg_new.txt http://users.suse.cz/~mpluskal/bugs/934202/dmesg_new.txt.xz - any notes on what you noticed, including whether any processes were spinning. Issue occured sometime during weekend so I just noticed error in dmesg, there should be minimal activity on nfs mount. - if processes were spinning, a few copies of /proc/PID/stack for those processes. - exact kernel version. # uname -r 4.0.4-3-desktop
Hope it helps -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c6
--- Comment #6 from Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c7
--- Comment #7 from Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c9
Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c12
Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c13
--- Comment #13 from Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c14
Neil Brown
NFS: nfs4_handle_delegation_recall_error: unhandled error -10026.
and I don't think it addresses the state manager process spinning either. I need more details for those. If they can be reproduced, the logs (like the excellent ones which led to this patch) would help a lot. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c15
--- Comment #15 from Neil Brown
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c16
--- Comment #16 from Martin Pluskal
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c17
--- Comment #17 from Nathan Cutler
As for "state manager process spinning" I am not sure if I am able to help as I am not using nfs home (instead having it mounted as /mnt/home) and using it mainly as backup/easy way to publish stuff on users.suse.cz - and I have not encountered issues which mcaj described - I mean I would like to help but some reliable way to trigger issue would help.
I am the one who encountered the spinning state manager process and associated lockups while on NFS home. Since this was on my primary workstation and the lockups were troublesome (to say the least), I had to migrate to local home to be able to work. The bug was easily reproducible (in fact, unavoidable) on my setup, which was: * openSUSE 13.2 * NFS home against nfs.suse.cz (Netapp) * KDE4 * Thunderbird, Pidgin, Firefox I never did narrow down the trigger, but I *suspect* it was clicking on URLs in Thunderbird and Pidgin, so they open in Firefox. At some point (1-4 hours of normal usage after reboot) I would click on a link only to find that the system had become unresponsive because writes to /home were not completing. When this happened, a process called 10.100.2.89-man was visible in 'top' and the load average would rise due to the "spinning". The following message would appear in /var/log/messages:
NFS: nfs4_handle_delegation_recall_error: unhandled error -13.
(note, not -10026) Martin Caj also has openSUSE 13.2, NFS home, and KDE4, but uses a different MUA and a different IRC client, and AFAIK he could not reproduce -- hence my suspicion that the issue is related to Thunderbird+Pidging+Firefox interaction. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c18
--- Comment #18 from Nathan Cutler
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c19
--- Comment #19 from Neil Brown
I use VirtualBox, and it is upset that it can't find an appropriate version of the virtualbox-host-kmp-default package to go with the kernel.
That's unfortunate. I don't know how to build kmps. The only think I can think of is to wait for the new kernel to get into factory - howfully that will automatically include the right kmps. Maybe you could find matching kernel in: http://download.opensuse.org/repositories/Kernel:/openSUSE-13.2/standard/x86... and virtualbox kmp in: http://download.opensuse.org/repositories/Virtualization/openSUSE_13.2/x86_6... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=934202
http://bugzilla.suse.com/show_bug.cgi?id=934202#c27
Neil Brown
participants (1)
-
bugzilla_noreply@novell.com