This bug was reported first in the internal SUSE ticketing system. I am
copy-pasting all the current comments from that ticket to this one:

**** Nathan Cutler ****:

Ever since the migration from login.suse.cz to nfs.suse.cz that took
place over the weekend 2015-05-30 and 2015-05-31, my workstation's NFS
home (pantograf.suse.cz) in unstable.

The problem manifests approx. once every 3-4 hours of usage. Randomly,
for no apparent reason. First, I/O to /home partition starts to hang.
Load average starts to rise and a mystery process "10.100.2.89-man"
appears in "top" listing. This process is unkillable even with SIGKILL.
Running programs continue to work as long as they do not attempt I/O to
the home partition, then they freeze up. The only fix I know of is to
reboot the machine.

Referring to the log, the problem appears to be apparmor-related:

2015-06-03T10:52:33.390867+02:00 pantograf kernel: [ 1066.488685] audit:
type=1400 audit(1433321553.389:49): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183
comm="file" laddr=10.100.12.15 lport=830 faddr=10.1
00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.390885+02:00 pantograf kernel: [ 1066.488720] nfs:
RPC call returned error 13
2015-06-03T10:52:33.391849+02:00 pantograf kernel: [ 1066.489192] audit:
type=1400 audit(1433321553.390:50): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5183
comm="file" laddr=10.100.12.15 lport=830 faddr=10.1
00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.391856+02:00 pantograf kernel: [ 1066.489204] nfs:
RPC call returned error 13
2015-06-03T10:52:33.414853+02:00 pantograf kernel: [ 1066.512310] audit:
type=1400 audit(1433321553.413:51): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5178
comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad
dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.414863+02:00 pantograf kernel: [ 1066.512326] nfs:
RPC call returned error 13
2015-06-03T10:52:33.415875+02:00 pantograf kernel: [ 1066.513919] audit:
type=1400 audit(1433321553.414:52): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185
comm="lessopen.sh" laddr=10.100.12.15 lport=830 fad
dr=10.100.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.415888+02:00 pantograf kernel: [ 1066.513935] nfs:
RPC call returned error 13
2015-06-03T10:52:33.416857+02:00 pantograf kernel: [ 1066.514853] audit:
type=1400 audit(1433321553.415:53): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185
comm="file" laddr=10.100.12.15 lport=830 faddr=10.1
00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.416869+02:00 pantograf kernel: [ 1066.514868] nfs:
RPC call returned error 13
2015-06-03T10:52:33.417857+02:00 pantograf kernel: [ 1066.515346] audit:
type=1400 audit(1433321553.416:54): apparmor="
DENIED" operation="sendmsg" profile="/usr/bin/lessopen.sh" pid=5185
comm="file" laddr=10.100.12.15 lport=830 faddr=10.1
00.2.89 fport=2049 family="inet" sock_type="stream" protocol=6
2015-06-03T10:52:33.417867+02:00 pantograf kernel: [ 1066.515358] nfs:
RPC call returned error 13

Please advise.

Update: I disabled AppArmor so those messages are gone

But the I/O lockups continue. After examining the log closer, I believe this is
the relevant message:

2015-06-03T16:52:33.508329+02:00 pantograf kernel: [13191.933894] NFS:
nfs4_handle_delegation_recall_error: unhandled error -13. 


**** Neil Brown ****:

I'm pretty sure that I've seen exactly this bug already.  I just need to
remember which one...

I don't think this is related to kerberos at all.  It is purely an NFSv4
state management issue.

It is a bit like 831029, but maybe not exactly the same.

I'll have a more thorough look after the weekend.  Exactly which kernel do
you have on your client?

NeilBrown

**** Nathan Cutler ****:

Linux pantograf 3.16.7-21-desktop #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015
(93c1539) x86_64 x86_64 x86_64 GNU/Linux


**** Theo Chatzimichos ****:

Just for the record, Martin Caj's PC runs KDE with the same kernel version, and
it has mounted the NFS /home since Friday. I just checked its dmesg and didn't
see such error messages reported. The machine is now on low load though since
Martin is away 


**** Nathan Cutler ****:

It only happens if you actually use the desktop. I can leave KDE running
all night, all day and as long as I do not use the machine, or use it
only via ssh from home, nothing happens. The I/O freezes only happen
when I am sitting physically at the machine, clicking in Thunderbird,
Pidgin, Firefox, etc.

When Martin gets back I will work with him to reproduce it.

Nathan


**** Martin Pluskal ****:

Hi
I have also encountered those issues (on current TW with kernel
4.1.0-rc6-2.g8f2a4d1-desktop) I have captured dmesg after issuing
"rpcdebug -m nfs -s all" - see
http://users.suse.cz/~mpluskal/bugs/nfs_issues/dmesg.txt.xz - perhaps it
contains something useful.

Kind regards

Martin


**** Neil Brown ****:

I've had a bit more of a look at this and I think it is a new problem.

There have been quite a few different bugs which involve the nn.nn.nn.nn-man
thread spinning, but this doesn't seem to fit the pattern of any of them.

So maybe it would be best to open a bug on bugzilla.

The "unhandled error -13." is an NFS4ERR_ACCES error.

When the client tried to open some file, the Netapp gave it a delegation.  It
presumably then opened the file again without telling the server, on the
strength of that delegation.
At some point the delegation needed to be  returned so the client requested
explicit opens from the server.  One of those failed with "EACCESS" meaning the
user on
the client didn't seem to have permission.  That is very odd.

The state management thread then goes into a spin.  This suggests that it was
the state management thread which decided that the delegation needed to be
returned.
Maybe it got some sort of state error.  I cannot easily find a path through the
code that would make it spin, but it is fairly complex code.

One thing that might help is to capture the content of /proc/PID/stack for the
threads that is using all the CPU time.  Capture that a dozen times or so in
quick succession.
Or maybe capture hundreds and discard the duplicates.

The other thing that could be useful is a tcpdunp trace of NFS (port 2049)
traffic between client and server when the problem occurs.  So if you can
reliably trigger it, that should
be easy to get.

Bug ID	934202
Summary	NFS: nfs4_handle_delegation_recall_error: unhandled error -13.
Classification	openSUSE
Product	openSUSE Distribution
Version	13.2
Hardware	Other
OS	Other
Status	NEW
Severity	Normal
Priority	P5 - None
Component	Other
Assignee	nfbrown@suse.com
Reporter	tchatzimichos@suse.com
QA Contact	qa-bugs@suse.de
CC	mcaj@suse.com, mpluskal@suse.com, ncutler@suse.com
Found By	---
Blocker	---