https://bugzilla.novell.com/show_bug.cgi?id=467161 Summary: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere Classification: openSUSE Product: openSUSE 11.2 Version: unspecified Platform: x86-64 OS/Version: SuSE Other Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: harbaugh@ncifcrf.gov QAContact: qa@suse.de Found By: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14 when nscd is not runnning, even simple LAM-MPI 'hello world' program segfaults within the call 'getpwuid(getuid())', which should not happen. strace shows that segfault happens after receiving EINPROGRESS return from ldap communication. after restarting nscd, the LAM-MPI 'hello world' program runs without error. The problem appears not to be in LAM-MPI itself, but in libc somewhere. Reproducible: Always Steps to Reproduce: 1. stop nscd if it is running ('/etc/init.d/nscd stop') 2. set coredump rlimit to unlimited ('ulimit -c unlimited') 3. lamboot -v 4. hcc -o lam_hello lam_hello.c -lmpi 5. mpirun -np 1 ./lam_hello step 5 produces core dump 6. mpirun -np 1 strace ./lam_hello > lam_hello.out 2>&1 step 6 captures strace output 7. lamhalt Actual Results: coredump $ gdb ./lam_hello core . . . Loaded symbols for /lib64/libnss_dns.so.2 Core was generated by `./lam_hello'. Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit Quitting: You can't do that without a process to debug. Expected Results: after restarting nscd, 'hello world' runs: $ sudo /etc/init.d/nscd start Starting Name Service Cache Daemon done $ mpirun -np 1 ./lam_hello
From process: 0 out of 1, Hello World!
here is the tail of the strace output from step 6 of the 'steps to reproduce', showing communication with the ldap server $ tail -30 lam_hello.out connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.85")}, 16) = 0 getsockname(4, {sa_family=AF_INET, sin_port=htons(43102), sin_addr=inet_addr("129.43.63.154")}, [16]) = 0 close(4) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLOUT}]) getpeername(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, [16]) = 0 fcntl(4, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(4, F_SETFL, O_RDWR) = 0 write(4, "0\35\2\1\1w\30\200\0261.3.6.1.4.1.1466.20037", 31) = 31 poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "0\f\2\1\1x\7\n", 8) = 8 read(4, "\1\0\4\0\4\0", 6) = 6 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV (core dumped) +++ ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ----------------------------------------------------------------------------- Again, here is the stack trace of the core dump Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.