http://bugzilla.suse.com/show_bug.cgi?id=934202 Bug ID: 934202 Summary: NFS: nfs4_handle_delegation_recall_error: unhandled error -13. Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Other Assignee: nfbrown@suse.com Reporter: tchatzimichos@suse.com QA Contact: qa-bugs@suse.de CC: mcaj@suse.com, mpluskal@suse.com, ncutler@suse.com Found By: --- Blocker: --- This bug was reported first in the internal SUSE ticketing system. I am copy-pasting all the current comments from that ticket to this one: **** Nathan Cutler ****: Ever since the migration from login.suse.cz to nfs.suse.cz that took place over the weekend 2015-05-30 and 2015-05-31, my workstation's NFS home (pantograf.suse.cz) in unstable. The problem manifests approx. once every 3-4 hours of usage. Randomly, for no apparent reason. First, I/O to /home partition starts to hang. Load average starts to rise and a mystery process "10.100.2.89-man" appears in "top" listing. This process is unkillable even with SIGKILL. Running programs continue to work as long as they do not attempt I/O to the home partition, then they freeze up. The only fix I know of is to reboot the machine. Referring to the log, the problem appears to be apparmor-related: 2015-06-03T10:52:33.390867+02:00 pantograf kernel: [ 1066.488685] audit: type=1400 audit(1433321553.389:49): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.390885+02:00 pantograf kernel: [ 1066.488720] nfs: RPC call returned error 13 2015-06-03T10:52:33.391849+02:00 pantograf kernel: [ 1066.489192] audit: type=1400 audit(1433321553.390:50): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.391856+02:00 pantograf kernel: [ 1066.489204] nfs: RPC call returned error 13 2015-06-03T10:52:33.414853+02:00 pantograf kernel: [ 1066.512310] audit: type=1400 audit(1433321553.413:51): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5178 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.414863+02:00 pantograf kernel: [ 1066.512326] nfs: RPC call returned error 13 2015-06-03T10:52:33.415875+02:00 pantograf kernel: [ 1066.513919] audit: type=1400 audit(1433321553.414:52): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.415888+02:00 pantograf kernel: [ 1066.513935] nfs: RPC call returned error 13 2015-06-03T10:52:33.416857+02:00 pantograf kernel: [ 1066.514853] audit: type=1400 audit(1433321553.415:53): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.416869+02:00 pantograf kernel: [ 1066.514868] nfs: RPC call returned error 13 2015-06-03T10:52:33.417857+02:00 pantograf kernel: [ 1066.515346] audit: type=1400 audit(1433321553.416:54): apparmor=" DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185 comm="file" laddr=10.100.12.15 lport=830 faddr=10.1 00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6 2015-06-03T10:52:33.417867+02:00 pantograf kernel: [ 1066.515358] nfs: RPC call returned error 13 Please advise. Update: I disabled AppArmor so those messages are gone But the I/O lockups continue. After examining the log closer, I believe this is the relevant message: 2015-06-03T16:52:33.508329+02:00 pantograf kernel: [13191.933894] NFS: nfs4_handle_delegation_recall_error: unhandled error -13. **** Neil Brown ****: I'm pretty sure that I've seen exactly this bug already. I just need to remember which one... I don't think this is related to kerberos at all. It is purely an NFSv4 state management issue. It is a bit like 831029, but maybe not exactly the same. I'll have a more thorough look after the weekend. Exactly which kernel do you have on your client? NeilBrown **** Nathan Cutler ****: Linux pantograf 3.16.7-21-desktop #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015 (93c1539) x86_64 x86_64 x86_64 GNU/Linux **** Theo Chatzimichos ****: Just for the record, Martin Caj's PC runs KDE with the same kernel version, and it has mounted the NFS /home since Friday. I just checked its dmesg and didn't see such error messages reported. The machine is now on low load though since Martin is away **** Nathan Cutler ****: It only happens if you actually use the desktop. I can leave KDE running all night, all day and as long as I do not use the machine, or use it only via ssh from home, nothing happens. The I/O freezes only happen when I am sitting physically at the machine, clicking in Thunderbird, Pidgin, Firefox, etc. When Martin gets back I will work with him to reproduce it. Nathan **** Martin Pluskal ****: Hi I have also encountered those issues (on current TW with kernel 4.1.0-rc6-2.g8f2a4d1-desktop) I have captured dmesg after issuing "rpcdebug -m nfs -s all" - see http://users.suse.cz/~mpluskal/bugs/nfs_issues/dmesg.txt.xz - perhaps it contains something useful. Kind regards Martin **** Neil Brown ****: I've had a bit more of a look at this and I think it is a new problem. There have been quite a few different bugs which involve the nn.nn.nn.nn-man thread spinning, but this doesn't seem to fit the pattern of any of them. So maybe it would be best to open a bug on bugzilla. The "unhandled error -13." is an NFS4ERR_ACCES error. When the client tried to open some file, the Netapp gave it a delegation. It presumably then opened the file again without telling the server, on the strength of that delegation. At some point the delegation needed to be returned so the client requested explicit opens from the server. One of those failed with "EACCESS" meaning the user on the client didn't seem to have permission. That is very odd. The state management thread then goes into a spin. This suggests that it was the state management thread which decided that the delegation needed to be returned. Maybe it got some sort of state error. I cannot easily find a path through the code that would make it spin, but it is fairly complex code. One thing that might help is to capture the content of /proc/PID/stack for the threads that is using all the CPU time. Capture that a dozen times or so in quick succession. Or maybe capture hundreds and discard the duplicates. The other thing that could be useful is a tcpdunp trace of NFS (port 2049) traffic between client and server when the problem occurs. So if you can reliably trigger it, that should be easy to get. -- You are receiving this mail because: You are on the CC list for the bug.