What | Removed | Added |
---|---|---|
Status | NEW | CONFIRMED |
No, capturing much after the error (more than a few minutes) wouldn't be likely to help. I just need to be sure that I know what I'm looking at - I don't like to assume. I didn't realise at first, but this seems to be very different symptoms than the original problem description. The error message is certainly different and I strong suspect the NN.NN.NN.NN-man process didn't go into a spin - is that correct? There are a few odd things in the trace, but I think the first odd thing comes from the client. In frame 67293 the client tries to close a file which it opened in frame 28511- the file is called "places.sqlite". Then immediately after a successful response from the server (well... 600usec later), the client sends another close request for the same file. The server accepts this request (which is a little odd, but I'm not certain it is wrong) but the next sequenced request (OPEN CLOSE LOCK LOCKU are all 'sequenced' requests and have sequence numbers) gets an error: BAD_SEQID. This is the CLOSE in frame 67385. This close is correctly sequenced so the server shouldn't complain, but presumably it got confused by the earlier double close. I cannot see how the Linux client could be sending a double close like that. I guess I need more data.... Could you (or anyone who can reproduce this) please collect both the tcpdump trace and nfs debugging with "rpcdebug -m nfs -s all". Then when a problem occurs, provide: - the tcpdump log - the kernel logs - any notes on what you noticed, including whether any processes were spinning. - if processes were spinning, a few copies of /proc/PID/stack for those processes. - exact kernel version. Hopefully I will be able to pull all that together. Thanks.