[Bug 231845] New: procmail cannot deliver to NFS, loops with "Reiterating kernel-lock"
https://bugzilla.novell.com/show_bug.cgi?id=231845 Summary: procmail cannot deliver to NFS, loops with "Reiterating kernel-lock" Product: openSUSE 10.2 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Network AssignedTo: werner@novell.com ReportedBy: mvidner@novell.com QAContact: qa@suse.de CC: okir@novell.com Since a long time I use procmail (called from fetchmail) to sort mail to mailboxes on my NFS home dir. With the upgrade of workstation to oS-10.2, this stopped working. The procmail process endlessly tries to deliver one message and (with VERBOSE=on) the log says "Reiterating kernel-lock". strace shows that it tries to lock the mailbox using lockfile, fcntl and flock successively. The first 2 methods succeed, the last one fails (see attachment). http://nfs.sourceforge.net/#faq_d10 seems to explain this behavior: fcntl lock succeeds and the subsequent flock (emulated via fcntl) cannot therefore succeed. But then it should have stopped working long before 10.2, with 2.6.12. Apparently there is some detection during procmail builds that enables all three locking methods (see procmail -v) even though we seem to have a patch to disable some of them. I have recompiled procmail with -DNO_flock_LOCK and now it works for me again. I don't know whether the kernel behavior exposed by procmail is correct or not, so I suggest to use -DNO_flock_LOCK and then reassign to kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #1 from mvidner@novell.com 2007-01-04 11:17 MST ------- Created an attachment (id=111546) --> (https://bugzilla.novell.com/attachment.cgi?id=111546&action=view) procmail-strace -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 mvidner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Normal |Major ------- Comment #2 from mvidner@novell.com 2007-01-04 11:21 MST ------- My setup is relatively uncommon (most other people use IMAP), and there is a workaround of using a global LOCKFILE=/non/nfs and deleting the trailing colons from recipe headers, but I still think this qualifies for an online update. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |okir@novell.com ------- Comment #3 from werner@novell.com 2007-01-08 04:21 MST ------- There was absolute no change in procmail nor how to build procmail. I'd like to hear some kernel experts why the locking is broken with the current kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #4 from werner@novell.com 2007-01-08 05:27 MST ------- Martin? Please try out what happens if you're using LOCKINGTEST=100 witin the main Makefile. This should lead to the same result as specifying -DNO_flock_LOCK on the compilers comamnd line. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 okir@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nfbrown@novell.com Status|NEEDINFO |NEW Info Provider|okir@novell.com | ------- Comment #5 from okir@novell.com 2007-01-08 05:59 MST ------- Could be a kernel change, indeed. The curious thing in the trace from comment #2 is that procmail tries to use both posix and bsd locking. The fcntl(SETLK) succeeds, but flock(LOCK_EX|LOCK_NB) returns EAGAIN. Most of the time, this kind of happens to work with the current flock-over-nfs implementation. As posix locks use the process pid in the NLM request, whereas for flock locks, we identify the holder using the task group ID. So when TGID == PID, things kind of work: - fcntl(F_SETLK) is translated into NLM call with owner == PID - Server establishes lock - flock(LOCK_EX) is translated into NLM call with owner == TGID which happens to be the same as PID - Server does nothing - procmail does its stuff - flock(LOCK_UN) is translated into NLM call with owner == TGID - Server removes lock - fcntl(F_SETLK) is translated into NLM call with owner == PID - Server does nothing, as the lock is already gone. (Note the problem during unlock - but that's not the issue here). The most obvious reason for fcntl to fail in this way that I can think of would be that procmail has magically become multithreaded, and now the PID of the thread trying to place the fcntl call is different from the TGID. Is procmail linked against libpthread in 10.2? Neil, do you remember when flock-over-NFS was added to the kernel? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #6 from werner@novell.com 2007-01-08 06:20 MST ------- The procmail binary does not know about libpthread. From the strace it seems that after using fcntl(<fd>, F_SETLKW, &{flck.type=F_WRLCK,...}) the flock(2) system call fails with -EAGAIN which leads procmail to execute fcntl(<fd>, F_SETLKW, &{flck.type=F_UNLCK,...} and retry using fcntl(2) followed by flock(2) as done in fdlock() in procmail-3.22/src/locking.c . -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |aj@novell.com ------- Comment #7 from werner@novell.com 2007-01-08 06:28 MST ------- Andreas? I'd like to get a SWAMPID for this bugzilla for an update of procmail on 10.2. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #8 from okir@novell.com 2007-01-08 06:57 MST ------- Re comment #6 - this is what I said :-) First, it does a POSIX lock: fcntl64(7, F_SETLKW64, {type=F_WRLCK, }) = 0 Then it tries a BSD lock: flock(7, LOCK_EX|LOCK_NB) = -1 EAGAIN Then it gives up and drops the POSIX lock: fcntl64(7, F_SETLK64, {type=F_UNLCK, ...}) = 0 Martin, as you can reproduce this problem, can you please check /proc/pid/status of the hung procmail process and check whether TGID matches the PID or not? If it matches, we may have a different kind of kernel problem here which needs investigating. I think it's too early to call for an update of procmail yet. Let's first find out what the issue really is. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #9 from richard.bos@xs4all.nl 2007-01-08 13:47 MST ------- If it is nfs related, please have a look at: https://bugzilla.novell.com/show_bug.cgi?id=231234 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #10 from nfbrown@novell.com 2007-01-08 17:53 MST ------- Olaf: I'm confused by your references to TGID. As far as I can see, the 'owner' for flock locks is the address of the 'struct file' (see nfs_flock in fs/nfs/file.c). Where does the TGID come in? This would imply that getting both an flock and a fcntl lock on the same file over NFS would not have worked since 2.6.12 when flock-over-nfs was introduced. I assume the same version of NFS is being used with 10.2??. A tcpdump of NFS and LOCK traffic while procmail tries to lock might help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #11 from okir@novell.com 2007-01-09 00:54 MST ------- Sorry, I was blabbing incoherently. I was reading the code the wrong way round. Please disregard my comments above. Yes, a tcpdump would be helpful. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 aj@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Info Provider|aj@novell.com |mvidner@novell.com ------- Comment #12 from aj@novell.com 2007-01-09 08:31 MST ------- Martin, can you provide such a tcpdump, please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #13 from ro@novell.com 2007-01-23 06:54 MST ------- ping -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #14 from werner@novell.com 2007-01-23 07:00 MST ------- ditto -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 mvidner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|mvidner@novell.com | ------- Comment #15 from mvidner@novell.com 2007-01-31 09:50 MST ------- Created an attachment (id=116610) --> (https://bugzilla.novell.com/attachment.cgi?id=116610&action=view) procmail-231845.tgz I did a test run first with the buggy version from 10.2 and then one with -DNO_flock_LOCK. I used "tcpdump -u -s0 'port 2049 || port 32770 || port 53107'" where the numbers correspond to services nfs and nlockmgr from rpcinfo -p nfs.suse.cz. Tarball contents: pink-elephant.procmailrc: testing rc file pink-elephant-log: procmail log, was also on NFS pink-elephant.mail: a testing message pink-elephant-dump-bad: tcpdump of the unsuccessful attempt pink-elephant-dump-ok: tcpdump of the successful attempt -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #16 from werner@novell.com 2007-02-01 03:49 MST ------- IMHO the procmail isn't buggy but the kernel is. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |werner@novell.com AssignedTo|werner@novell.com |kernel-maintainers@forge.provo.novell.com Severity|Major |Critical Component|Network |Kernel ------- Comment #17 from werner@novell.com 2007-04-26 07:54 MST ------- This one seems to be missed by the network kernel people. Please (re)enable both lock mechanism even if used in a mixed but ordered way. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 jeffm@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |nfbrown@novell.com |maintainers@forge.provo.nove| |ll.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 nfbrown@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #18 from nfbrown@novell.com 2007-04-29 19:43 MST ------- Unfortuantely the tcpdump is useless - it is always best to use "-w filename" to create a binary dump, and use "-s 0" to capture full packets. I doesn't really matter though - I managed to do some testing myself and think I can fix it. I have posted a discussion of the problem and a possible solution to the nfs community mailing list to get feedback from people who are more familiar with the code. The patch below fixes the symptom and I think it is safe, but it is the safety that I want feedback on. Stay tuned... diff .prev/fs/lockd/clntproc.c ./fs/lockd/clntproc.c --- .prev/fs/lockd/clntproc.c 2007-04-30 10:41:01.000000000 +1000 +++ ./fs/lockd/clntproc.c 2007-04-30 11:10:08.000000000 +1000 @@ -134,7 +134,7 @@ static void nlmclnt_setlockargs(struct n lock->oh.len = snprintf(req->a_owner, sizeof(req->a_owner), "%u@%s", (unsigned int)fl->fl_u.nfs_fl.owner->pid, utsname()->nodename); - lock->svid = fl->fl_u.nfs_fl.owner->pid; + lock->svid = 42; lock->fl.fl_start = fl->fl_start; lock->fl.fl_end = fl->fl_end; lock->fl.fl_type = fl->fl_type; -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 nfbrown@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |werner@novell.com ------- Comment #19 from nfbrown@novell.com 2007-04-30 00:52 MST ------- Initial feedback is not positive. In fact, that patch is completely broken, and fixing would be a very substantial effort and probably would not be accepted up-stream. So while I agree that the NFS client is doing the wrong thing, I think that by a long way the easiest fix to the overall problem is to compile procmail with LOCKINGTEST=100 I notice that this is what is used in the latest version, though 10.2 and 10.0 (at least) have LOCKINGTEST=101 which will cause this problem. So, much as I would like to fix the kernel, I think this problem has to be resolved either by waiting for 10.3, or getting LOCKINGTEST=100 back-ported to 10.2. Werner? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #20 from werner@novell.com 2007-04-30 02:28 MST ------- I've no problem to port back this workaround, nevertheless, the kernel should be fixed to accept mixed locks. Btw: is there a test which could be done to determine if both locking schemes can be used together? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845
Dr. Werner Fink
https://bugzilla.novell.com/show_bug.cgi?id=231845#c21
Frank Steiner
https://bugzilla.novell.com/show_bug.cgi?id=231845#c22
Dr. Werner Fink
https://bugzilla.novell.com/show_bug.cgi?id=231845#c23
--- Comment #23 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=231845#c24
--- Comment #24 from Frank Steiner
I would be very surprised if this bug affects SLES10-SP1 but not SLES10-GA (at least if it is a kernel bug).
That's for sure. We still have a SuSE 10.1 server running (and as far as I can see, the kernels for 10.1 and SLES10 GA are more or less identical?), and there we still have the original procmail from 10.1 (both procmail-3.22-56 on 10.1 and SLES10), and it works fine with the locks. On the SLES10 server it fails with the original SLES10 procmail after upgrading to SP1. I recompiled procmail now with LOCKINGTEST=100 to make it work. By your explanation it sound reasonable that the locking test fails, so I guess it would be a good idea to release an updated procmail for SLES10 and SuSE 10.1/2. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c25
--- Comment #25 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=231845#c26
Dr. Werner Fink
https://bugzilla.novell.com/show_bug.cgi?id=231845#c28
Anja Stock
https://bugzilla.novell.com/show_bug.cgi?id=231845#c29
--- Comment #29 from Anja Stock
https://bugzilla.novell.com/show_bug.cgi?id=231845#c30
Neil Brown
participants (1)
-
bugzilla_noreply@novell.com