[Bug 467161] New: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere
https://bugzilla.novell.com/show_bug.cgi?id=467161 Summary: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere Classification: openSUSE Product: openSUSE 11.2 Version: unspecified Platform: x86-64 OS/Version: SuSE Other Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: harbaugh@ncifcrf.gov QAContact: qa@suse.de Found By: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14 when nscd is not runnning, even simple LAM-MPI 'hello world' program segfaults within the call 'getpwuid(getuid())', which should not happen. strace shows that segfault happens after receiving EINPROGRESS return from ldap communication. after restarting nscd, the LAM-MPI 'hello world' program runs without error. The problem appears not to be in LAM-MPI itself, but in libc somewhere. Reproducible: Always Steps to Reproduce: 1. stop nscd if it is running ('/etc/init.d/nscd stop') 2. set coredump rlimit to unlimited ('ulimit -c unlimited') 3. lamboot -v 4. hcc -o lam_hello lam_hello.c -lmpi 5. mpirun -np 1 ./lam_hello step 5 produces core dump 6. mpirun -np 1 strace ./lam_hello > lam_hello.out 2>&1 step 6 captures strace output 7. lamhalt Actual Results: coredump $ gdb ./lam_hello core . . . Loaded symbols for /lib64/libnss_dns.so.2 Core was generated by `./lam_hello'. Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit Quitting: You can't do that without a process to debug. Expected Results: after restarting nscd, 'hello world' runs: $ sudo /etc/init.d/nscd start Starting Name Service Cache Daemon done $ mpirun -np 1 ./lam_hello
From process: 0 out of 1, Hello World!
here is the tail of the strace output from step 6 of the 'steps to reproduce', showing communication with the ldap server $ tail -30 lam_hello.out connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.85")}, 16) = 0 getsockname(4, {sa_family=AF_INET, sin_port=htons(43102), sin_addr=inet_addr("129.43.63.154")}, [16]) = 0 close(4) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLOUT}]) getpeername(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, [16]) = 0 fcntl(4, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(4, F_SETFL, O_RDWR) = 0 write(4, "0\35\2\1\1w\30\200\0261.3.6.1.4.1.1466.20037", 31) = 31 poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "0\f\2\1\1x\7\n", 8) = 8 read(4, "\1\0\4\0\4\0", 6) = 6 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV (core dumped) +++ ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ----------------------------------------------------------------------------- Again, here is the stack trace of the core dump Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User harbaugh@ncifcrf.gov added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c1 Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |harbaugh@ncifcrf.gov --- Comment #1 from Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> 2009-01-17 07:28:11 MST --- This happens in SLES 11 RC1 also, so hopefully we can get it patched before SLES 11 is GA? Thanks, Toni -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P1 - Urgent -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 Cyril Hrubis <chrubis@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bnc-team-screening@forge.pr |pbaudis@novell.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User pbaudis@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c2 Petr Baudis <pbaudis@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pbaudis@novell.com AssignedTo|pbaudis@novell.com |rhafer@novell.com --- Comment #2 from Petr Baudis <pbaudis@novell.com> 2009-02-13 04:07:32 MST --- nss_ldap -> Ralf -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User milisav.radmanic@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c4 --- Comment #4 from Milisav Radmanic <milisav.radmanic@novell.com> 2009-03-02 06:22:53 MST --- (In reply to comment #1)
This happens in SLES 11 RC1 also, so hopefully we can get it patched before SLES 11 is GA?
Thanks, Toni
How did you test this for SLES 11 RC1? There is no maintained lam-package available for SLES 11. Furthermore the code branch for openSUSE 11.2 isn't even in Alpha state, yet. And on OpenSUSE 11.1 (where the lam-package is available) the incidence can't be reproduced as described. regards Milisav -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User harbaugh@ncifcrf.gov added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c5 --- Comment #5 from Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> 2009-03-02 06:37:00 MST --- I used openmpi on SLES instead of lam, with the same results I disabled ldap and used 'plain' /etc/passwd, with the same results. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User harbaugh@ncifcrf.gov added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c6 --- Comment #6 from Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> 2009-03-02 06:37:51 MST --- I have not tested SLES 11 since RC1. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User milisav.radmanic@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c7 --- Comment #7 from Milisav Radmanic <milisav.radmanic@novell.com> 2009-03-02 07:21:04 MST --- (In reply to comment #5)
I used openmpi on SLES instead of lam, with the same results
I disabled ldap and used 'plain' /etc/passwd, with the same results.
How do you use mpirun without lam? Can you please describe how to now reproduce the error? If I use mpicc to compile a hello world example like this: /* * Sample hello world MPI program for testing MPI. */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char **argv) { int rank, size; /* Start up MPI */ MPI_Init(&argc, &argv); /* Get some info about MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* Print out the canonical "hello world" message */ printf("Hello, world! I am %d of %d\n", rank, size); /* All done */ MPI_Finalize(); return 0; } I can run it with mpirun without the error above. Thanks Milisav -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User harbaugh@ncifcrf.gov added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c8 --- Comment #8 from Toni Harbaugh-Blackford <harbaugh@ncifcrf.gov> 2009-03-02 07:39:18 MST --- steps to reproduce with openmpi 1) make sure openmpi binaries are in $PATH and libs are in $LD_LIBRARY_PATH 2) /etc/init.d/nscd stop 3) mpicc -o hello hello.c 4) mpirun -np 1 ./hello You *must* stop nscd; if nscd is running the program will work. If nscd is not running the program will segfault in getpwuid_r() Again, I have not tested this for SLES 11 RC4, just RC1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 User jjolly@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=467161#c10 --- Comment #10 from John Jolly <jjolly@novell.com> 2009-03-16 07:24:27 MST --- This seems to be a problem with getpwuid, but only with the OpenMPI build. I am trying to track down the problem with the build right now. This seems to be fixed in the 1.3 OpenMPI, but as this problem is in SLES10SP2, I am unable to update to a new version. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161 John Jolly <jjolly@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com