We are having a problem with nodes crashing on a cluster with 32 dual Opteron compute nodes with 5GB memory each and Xeon management and storage nodes. We have Myrinet, the shared disk is formatted with LVM, and we are using SUSE-8 Enterprise. Typical error messages are attached at the end of this note. The problem seems to happen so far when: compiling GCC 3.4.3 and infrequently when building large (10GB+) files and when running a certain C++ non-distributed memory intensive program. It has happened now on at least 4 different nodes. ****************************** NODE-002 ***************************************************** Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Northbridge Machine Check exception b60000010005001b 0 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Uncorrectable condition Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Unrecoverable condition Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: NB status: unrecoverable Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Error uncorrected Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Address: 00000000051f0000 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: CPU 1: Machine Check Exception: 0000000000000000 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Kernel panic: Unable to continue ------------------------------------------------------------------------ Steven Naron, PMP, Executive Consultant Public Sector Architecture, IBM Global Services Voice, page or fax (301)803-6852