Neil Brown changed bug 934202
What Removed Added
Status NEW CONFIRMED

Comment # 4 on bug 934202 from
No, capturing much after the error (more than a few minutes) wouldn't be likely
to help.  I just need to be sure that I know what I'm looking at - I don't like
to assume.

I didn't realise at first, but this seems to be very different symptoms than
the original problem description.  The error message is certainly different and
I strong suspect the NN.NN.NN.NN-man process didn't go into a spin - is that
correct?

There are a few odd things in the trace, but I think the first odd thing comes
from the client.  In frame 67293 the client tries to close a file which it
opened in frame 28511- the file is called "places.sqlite".
Then immediately after a successful response from the server (well... 600usec
later), the client sends another close request for the same file.
The server accepts this request (which is a little odd, but I'm not certain it
is wrong) but the next sequenced request (OPEN CLOSE LOCK LOCKU are all
'sequenced' requests and have sequence numbers) gets an error: BAD_SEQID.  This
is the CLOSE in frame 67385.
This close is correctly sequenced so the server shouldn't complain, but
presumably it got confused by the earlier double close.

I cannot see how the Linux client could be sending a double close like that.

I guess I need more data....

Could you (or anyone who can reproduce this) please collect both the tcpdump
trace and nfs debugging with "rpcdebug -m nfs -s all".  Then when a problem
occurs, provide:
- the tcpdump log
- the kernel logs
- any notes on what you noticed, including whether any processes were spinning.
- if processes were spinning, a few copies of /proc/PID/stack for those
processes.
- exact kernel version.

Hopefully I will be able to pull all that together.

Thanks.


You are receiving this mail because: