[Bug 231845] New: procmail cannot deliver to NFS, loops with "Reiterating kernel-lock"
https://bugzilla.novell.com/show_bug.cgi?id=231845 Summary: procmail cannot deliver to NFS, loops with "Reiterating kernel-lock" Product: openSUSE 10.2 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Network AssignedTo: werner@novell.com ReportedBy: mvidner@novell.com QAContact: qa@suse.de CC: okir@novell.com Since a long time I use procmail (called from fetchmail) to sort mail to mailboxes on my NFS home dir. With the upgrade of workstation to oS-10.2, this stopped working. The procmail process endlessly tries to deliver one message and (with VERBOSE=on) the log says "Reiterating kernel-lock". strace shows that it tries to lock the mailbox using lockfile, fcntl and flock successively. The first 2 methods succeed, the last one fails (see attachment). http://nfs.sourceforge.net/#faq_d10 seems to explain this behavior: fcntl lock succeeds and the subsequent flock (emulated via fcntl) cannot therefore succeed. But then it should have stopped working long before 10.2, with 2.6.12. Apparently there is some detection during procmail builds that enables all three locking methods (see procmail -v) even though we seem to have a patch to disable some of them. I have recompiled procmail with -DNO_flock_LOCK and now it works for me again. I don't know whether the kernel behavior exposed by procmail is correct or not, so I suggest to use -DNO_flock_LOCK and then reassign to kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #1 from mvidner@novell.com 2007-01-04 11:17 MST ------- Created an attachment (id=111546) --> (https://bugzilla.novell.com/attachment.cgi?id=111546&action=view) procmail-strace -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 mvidner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Normal |Major ------- Comment #2 from mvidner@novell.com 2007-01-04 11:21 MST ------- My setup is relatively uncommon (most other people use IMAP), and there is a workaround of using a global LOCKFILE=/non/nfs and deleting the trailing colons from recipe headers, but I still think this qualifies for an online update. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |okir@novell.com ------- Comment #3 from werner@novell.com 2007-01-08 04:21 MST ------- There was absolute no change in procmail nor how to build procmail. I'd like to hear some kernel experts why the locking is broken with the current kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #4 from werner@novell.com 2007-01-08 05:27 MST ------- Martin? Please try out what happens if you're using LOCKINGTEST=100 witin the main Makefile. This should lead to the same result as specifying -DNO_flock_LOCK on the compilers comamnd line. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 okir@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nfbrown@novell.com Status|NEEDINFO |NEW Info Provider|okir@novell.com | ------- Comment #5 from okir@novell.com 2007-01-08 05:59 MST ------- Could be a kernel change, indeed. The curious thing in the trace from comment #2 is that procmail tries to use both posix and bsd locking. The fcntl(SETLK) succeeds, but flock(LOCK_EX|LOCK_NB) returns EAGAIN. Most of the time, this kind of happens to work with the current flock-over-nfs implementation. As posix locks use the process pid in the NLM request, whereas for flock locks, we identify the holder using the task group ID. So when TGID == PID, things kind of work: - fcntl(F_SETLK) is translated into NLM call with owner == PID - Server establishes lock - flock(LOCK_EX) is translated into NLM call with owner == TGID which happens to be the same as PID - Server does nothing - procmail does its stuff - flock(LOCK_UN) is translated into NLM call with owner == TGID - Server removes lock - fcntl(F_SETLK) is translated into NLM call with owner == PID - Server does nothing, as the lock is already gone. (Note the problem during unlock - but that's not the issue here). The most obvious reason for fcntl to fail in this way that I can think of would be that procmail has magically become multithreaded, and now the PID of the thread trying to place the fcntl call is different from the TGID. Is procmail linked against libpthread in 10.2? Neil, do you remember when flock-over-NFS was added to the kernel? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #6 from werner@novell.com 2007-01-08 06:20 MST ------- The procmail binary does not know about libpthread. From the strace it seems that after using fcntl(<fd>, F_SETLKW, &{flck.type=F_WRLCK,...}) the flock(2) system call fails with -EAGAIN which leads procmail to execute fcntl(<fd>, F_SETLKW, &{flck.type=F_UNLCK,...} and retry using fcntl(2) followed by flock(2) as done in fdlock() in procmail-3.22/src/locking.c . -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |aj@novell.com ------- Comment #7 from werner@novell.com 2007-01-08 06:28 MST ------- Andreas? I'd like to get a SWAMPID for this bugzilla for an update of procmail on 10.2. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #8 from okir@novell.com 2007-01-08 06:57 MST ------- Re comment #6 - this is what I said :-) First, it does a POSIX lock: fcntl64(7, F_SETLKW64, {type=F_WRLCK, }) = 0 Then it tries a BSD lock: flock(7, LOCK_EX|LOCK_NB) = -1 EAGAIN Then it gives up and drops the POSIX lock: fcntl64(7, F_SETLK64, {type=F_UNLCK, ...}) = 0 Martin, as you can reproduce this problem, can you please check /proc/pid/status of the hung procmail process and check whether TGID matches the PID or not? If it matches, we may have a different kind of kernel problem here which needs investigating. I think it's too early to call for an update of procmail yet. Let's first find out what the issue really is. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #9 from richard.bos@xs4all.nl 2007-01-08 13:47 MST ------- If it is nfs related, please have a look at: https://bugzilla.novell.com/show_bug.cgi?id=231234 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #10 from nfbrown@novell.com 2007-01-08 17:53 MST ------- Olaf: I'm confused by your references to TGID. As far as I can see, the 'owner' for flock locks is the address of the 'struct file' (see nfs_flock in fs/nfs/file.c). Where does the TGID come in? This would imply that getting both an flock and a fcntl lock on the same file over NFS would not have worked since 2.6.12 when flock-over-nfs was introduced. I assume the same version of NFS is being used with 10.2??. A tcpdump of NFS and LOCK traffic while procmail tries to lock might help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #11 from okir@novell.com 2007-01-09 00:54 MST ------- Sorry, I was blabbing incoherently. I was reading the code the wrong way round. Please disregard my comments above. Yes, a tcpdump would be helpful. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 aj@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Info Provider|aj@novell.com |mvidner@novell.com ------- Comment #12 from aj@novell.com 2007-01-09 08:31 MST ------- Martin, can you provide such a tcpdump, please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #13 from ro@novell.com 2007-01-23 06:54 MST ------- ping -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #14 from werner@novell.com 2007-01-23 07:00 MST ------- ditto -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 mvidner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|mvidner@novell.com | ------- Comment #15 from mvidner@novell.com 2007-01-31 09:50 MST ------- Created an attachment (id=116610) --> (https://bugzilla.novell.com/attachment.cgi?id=116610&action=view) procmail-231845.tgz I did a test run first with the buggy version from 10.2 and then one with -DNO_flock_LOCK. I used "tcpdump -u -s0 'port 2049 || port 32770 || port 53107'" where the numbers correspond to services nfs and nlockmgr from rpcinfo -p nfs.suse.cz. Tarball contents: pink-elephant.procmailrc: testing rc file pink-elephant-log: procmail log, was also on NFS pink-elephant.mail: a testing message pink-elephant-dump-bad: tcpdump of the unsuccessful attempt pink-elephant-dump-ok: tcpdump of the successful attempt -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #16 from werner@novell.com 2007-02-01 03:49 MST ------- IMHO the procmail isn't buggy but the kernel is. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 werner@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |werner@novell.com AssignedTo|werner@novell.com |kernel-maintainers@forge.provo.novell.com Severity|Major |Critical Component|Network |Kernel ------- Comment #17 from werner@novell.com 2007-04-26 07:54 MST ------- This one seems to be missed by the network kernel people. Please (re)enable both lock mechanism even if used in a mixed but ordered way. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 jeffm@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |nfbrown@novell.com |maintainers@forge.provo.nove| |ll.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 nfbrown@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #18 from nfbrown@novell.com 2007-04-29 19:43 MST ------- Unfortuantely the tcpdump is useless - it is always best to use "-w filename" to create a binary dump, and use "-s 0" to capture full packets. I doesn't really matter though - I managed to do some testing myself and think I can fix it. I have posted a discussion of the problem and a possible solution to the nfs community mailing list to get feedback from people who are more familiar with the code. The patch below fixes the symptom and I think it is safe, but it is the safety that I want feedback on. Stay tuned... diff .prev/fs/lockd/clntproc.c ./fs/lockd/clntproc.c --- .prev/fs/lockd/clntproc.c 2007-04-30 10:41:01.000000000 +1000 +++ ./fs/lockd/clntproc.c 2007-04-30 11:10:08.000000000 +1000 @@ -134,7 +134,7 @@ static void nlmclnt_setlockargs(struct n lock->oh.len = snprintf(req->a_owner, sizeof(req->a_owner), "%u@%s", (unsigned int)fl->fl_u.nfs_fl.owner->pid, utsname()->nodename); - lock->svid = fl->fl_u.nfs_fl.owner->pid; + lock->svid = 42; lock->fl.fl_start = fl->fl_start; lock->fl.fl_end = fl->fl_end; lock->fl.fl_type = fl->fl_type; -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 nfbrown@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |werner@novell.com ------- Comment #19 from nfbrown@novell.com 2007-04-30 00:52 MST ------- Initial feedback is not positive. In fact, that patch is completely broken, and fixing would be a very substantial effort and probably would not be accepted up-stream. So while I agree that the NFS client is doing the wrong thing, I think that by a long way the easiest fix to the overall problem is to compile procmail with LOCKINGTEST=100 I notice that this is what is used in the latest version, though 10.2 and 10.0 (at least) have LOCKINGTEST=101 which will cause this problem. So, much as I would like to fix the kernel, I think this problem has to be resolved either by waiting for 10.3, or getting LOCKINGTEST=100 back-ported to 10.2. Werner? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 ------- Comment #20 from werner@novell.com 2007-04-30 02:28 MST ------- I've no problem to port back this workaround, nevertheless, the kernel should be fixed to accept mixed locks. Btw: is there a test which could be done to determine if both locking schemes can be used together? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231845 Dr. Werner Fink <werner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|werner@novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c21 Frank Steiner <steiner-reg@bio.ifi.lmu.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |steiner-reg@bio.ifi.lmu.de --- Comment #21 from Frank Steiner <steiner-reg@bio.ifi.lmu.de> 2007-07-04 08:43:53 MST --- This bug was introduced on SLES10 SP1 now. With SLES10 procmail and locking worked fine, after upgrading the NFS server and NFS client (running the procmail) to SLES10 SP1, the same problem with procmail and NFS locking shows up. There is no update for procmail itself in SP1, so the bug must have been caused by some kernel patch between 2.6.16.27-0.9 (latest SLES10) and 2.6.16.46-0.14 (latest SLES10 SP1). Hope that helps... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c22 Dr. Werner Fink <werner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |nfbrown@novell.com --- Comment #22 from Dr. Werner Fink <werner@novell.com> 2007-07-04 08:54:12 MST --- Is there any kernel wizard around knowning which patch causes the breakage? As this also occurs within 10.2 it could be also a backport of newer kernel which introduces this bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c23 --- Comment #23 from Neil Brown <nfbrown@novell.com> 2007-07-08 19:59:58 MST --- I would be very surprised if this bug affects SLES10-SP1 but not SLES10-GA (at least if it is a kernel bug). The problem is that since 2.6.12, the Linux NFS client simulates "flock" locks using "fcntl" locks on the server. This works fine *unless* a client tries to take both an "flock" and an "fcntl lock" on the same file at the same time and expects them both to succeed. This cannot work with the current Linux NFS implementation as the server will reject the second lock as it sees a conflict (for local filesystems, the locks are in separate address spaces, and no conflict is seen) So this will affect any kernel since 2.6.12 (which includes SLES10) running with any program that tries to get both a flock lock and an fcntl lock, which procmail does when compiled with LOCKINGTEST=101, but not when compiled with LOCKINGTEST=100. The best way to fix this problem for any given release is to change procmail to use LOCKINGTEST=100. To answer comment #20, you simply need to test if you can lock a file both ways at once. Something like: #include <stdlib.h> #include <stdio.h> #include <sys/file.h> #include <unistd.h> main() { int fd = open("afile", O_RDWR|O_CREAT,0600); if (flock(fd, LOCK_EX) < 0) { perror("Cannot get flock at all!!"); exit(2); } if (lockf(fd, F_TLOCK, 0) < 0) { perror("Cannot get lockf while holding flock"); exit(1); } printf("Successfully got both locks\n"); exit(0); } Tested and seems to work. When run in an NFS directory I get: Cannot get lockf while holding flock: Resource temporarily unavailable -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c24 --- Comment #24 from Frank Steiner <steiner-reg@bio.ifi.lmu.de> 2007-07-09 01:37:18 MST --- (In reply to comment #23 from Neil F Brown)
I would be very surprised if this bug affects SLES10-SP1 but not SLES10-GA (at least if it is a kernel bug).
That's for sure. We still have a SuSE 10.1 server running (and as far as I can see, the kernels for 10.1 and SLES10 GA are more or less identical?), and there we still have the original procmail from 10.1 (both procmail-3.22-56 on 10.1 and SLES10), and it works fine with the locks. On the SLES10 server it fails with the original SLES10 procmail after upgrading to SP1. I recompiled procmail now with LOCKINGTEST=100 to make it work. By your explanation it sound reasonable that the locking test fails, so I guess it would be a good idea to release an updated procmail for SLES10 and SuSE 10.1/2. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c25 --- Comment #25 from Neil Brown <nfbrown@novell.com> 2007-07-12 22:18:42 MST --- I figured out why there wasn't a problem on SLES10-GA. There was a bug in the flock handling (fixed for bug #182783) that made the fcntl lock and the flock look the same to the NFS server. That was fixed is Mar this year, so it was in -SP1 but not -GA. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c26 Dr. Werner Fink <werner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Info Provider|nfbrown@novell.com |ast@novell.com --- Comment #26 from Dr. Werner Fink <werner@novell.com> 2007-07-27 09:34:42 MST --- I've submitted a changed procmail for SLES10-SP1 (now) and openSuSE 10.2 (last one since Janurary, the 8th 2007) ... for this a SWAMID is required. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c28 Anja Stock <ast@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |NTS_Public --- Comment #28 from Anja Stock <ast@novell.com> 2007-08-01 04:25:35 MST --- setting keyword -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c29 --- Comment #29 from Anja Stock <ast@novell.com> 2007-08-29 03:26:55 MST --- released -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=231845#c30 Neil Brown <nfbrown@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #30 from Neil Brown <nfbrown@novell.com> 2007-11-29 18:00:03 MST --- Resolving as FIXED as it is not only fixed but released. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com