[Bug 467161] New: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere
https://bugzilla.novell.com/show_bug.cgi?id=467161 Summary: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere Classification: openSUSE Product: openSUSE 11.2 Version: unspecified Platform: x86-64 OS/Version: SuSE Other Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: harbaugh@ncifcrf.gov QAContact: qa@suse.de Found By: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14 when nscd is not runnning, even simple LAM-MPI 'hello world' program segfaults within the call 'getpwuid(getuid())', which should not happen. strace shows that segfault happens after receiving EINPROGRESS return from ldap communication. after restarting nscd, the LAM-MPI 'hello world' program runs without error. The problem appears not to be in LAM-MPI itself, but in libc somewhere. Reproducible: Always Steps to Reproduce: 1. stop nscd if it is running ('/etc/init.d/nscd stop') 2. set coredump rlimit to unlimited ('ulimit -c unlimited') 3. lamboot -v 4. hcc -o lam_hello lam_hello.c -lmpi 5. mpirun -np 1 ./lam_hello step 5 produces core dump 6. mpirun -np 1 strace ./lam_hello > lam_hello.out 2>&1 step 6 captures strace output 7. lamhalt Actual Results: coredump $ gdb ./lam_hello core . . . Loaded symbols for /lib64/libnss_dns.so.2 Core was generated by `./lam_hello'. Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit Quitting: You can't do that without a process to debug. Expected Results: after restarting nscd, 'hello world' runs: $ sudo /etc/init.d/nscd start Starting Name Service Cache Daemon done $ mpirun -np 1 ./lam_hello
From process: 0 out of 1, Hello World!
here is the tail of the strace output from step 6 of the 'steps to reproduce', showing communication with the ldap server $ tail -30 lam_hello.out connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.85")}, 16) = 0 getsockname(4, {sa_family=AF_INET, sin_port=htons(43102), sin_addr=inet_addr("129.43.63.154")}, [16]) = 0 close(4) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLOUT}]) getpeername(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, [16]) = 0 fcntl(4, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(4, F_SETFL, O_RDWR) = 0 write(4, "0\35\2\1\1w\30\200\0261.3.6.1.4.1.1466.20037", 31) = 31 poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "0\f\2\1\1x\7\n", 8) = 8 read(4, "\1\0\4\0\4\0", 6) = 6 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV (core dumped) +++ ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ----------------------------------------------------------------------------- Again, here is the stack trace of the core dump Program terminated with signal 11, Segmentation fault. #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 (gdb) bt #0 0x00007f723ff85e3a in ?? () from /lib64/libc.so.6 #1 0x00007f723ff86e38 in realloc () from /lib64/libc.so.6 #2 0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8 #3 0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8 #4 0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8 #5 0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8 #6 0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8 #7 0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8 #8 0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8 #9 0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2 #10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2 #11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2 #12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2 #13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2 #14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2 #15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2 #16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2 #17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2 #18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2 #19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6 #20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6 #21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0 #22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0 #23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0 #24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0 #25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0 #26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0 #27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0 #28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0 #29 0x0000000000400898 in main () (gdb) quit -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161
User harbaugh@ncifcrf.gov added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c1
Toni Harbaugh-Blackford
https://bugzilla.novell.com/show_bug.cgi?id=467161
Toni Harbaugh-Blackford
https://bugzilla.novell.com/show_bug.cgi?id=467161
Cyril Hrubis
https://bugzilla.novell.com/show_bug.cgi?id=467161
User pbaudis@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c2
Petr Baudis
https://bugzilla.novell.com/show_bug.cgi?id=467161
User milisav.radmanic@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c4
--- Comment #4 from Milisav Radmanic
This happens in SLES 11 RC1 also, so hopefully we can get it patched before SLES 11 is GA?
Thanks, Toni
How did you test this for SLES 11 RC1? There is no maintained lam-package available for SLES 11. Furthermore the code branch for openSUSE 11.2 isn't even in Alpha state, yet. And on OpenSUSE 11.1 (where the lam-package is available) the incidence can't be reproduced as described. regards Milisav -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=467161
User harbaugh@ncifcrf.gov added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c5
--- Comment #5 from Toni Harbaugh-Blackford
https://bugzilla.novell.com/show_bug.cgi?id=467161
User harbaugh@ncifcrf.gov added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c6
--- Comment #6 from Toni Harbaugh-Blackford
https://bugzilla.novell.com/show_bug.cgi?id=467161
User milisav.radmanic@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c7
--- Comment #7 from Milisav Radmanic
I used openmpi on SLES instead of lam, with the same results
I disabled ldap and used 'plain' /etc/passwd, with the same results.
How do you use mpirun without lam? Can you please describe how to now reproduce
the error? If I use mpicc to compile a hello world example like this:
/*
* Sample hello world MPI program for testing MPI.
*/
#include
https://bugzilla.novell.com/show_bug.cgi?id=467161
User harbaugh@ncifcrf.gov added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c8
--- Comment #8 from Toni Harbaugh-Blackford
https://bugzilla.novell.com/show_bug.cgi?id=467161
User jjolly@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=467161#c10
--- Comment #10 from John Jolly
https://bugzilla.novell.com/show_bug.cgi?id=467161
John Jolly
participants (1)
-
bugzilla_noreply@novell.com