New subject: [SLE] Dual Opteron SuSIE enterprise cluster -- nodes failing with "Northbridge Machine Check exception"

8 Dec 2004

      We are having a problem with nodes crashing on a cluster with 32 dual
Opteron compute nodes with 5GB memory each and Xeon management and storage
nodes.  We have Myrinet, the shared disk is formatted with LVM, and we are
using SUSE-8 Enterprise.   Typical error messages are attached at the end
of this note.

The problem seems to happen so far when: compiling GCC 3.4.3 and
infrequently when building large (10GB+) files and when running a certain
C++ non-distributed memory intensive program.  It has happened now on at
least 4 different nodes.

****************************** NODE-002
*****************************************************

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Northbridge Machine Check exception b60000010005001b 0

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Uncorrectable condition

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Unrecoverable condition

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: NB status: unrecoverable

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Error uncorrected

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Address: 00000000051f0000

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: CPU 1: Machine Check Exception: 0000000000000000

Message from syslogd@node002 at Thu Nov  4 17:24:37 2004 ... node002
kernel: Kernel panic: Unable to continue
------------------------------------------------------------------------
Steven Naron, PMP, Executive Consultant
Public Sector Architecture, IBM Global Services
Voice, page or fax (301)803-6852

Dual Opteron SuSIE enterprise cluster -- nodes failing with "Northbridge Machine Check exception"

Steven Naron

Peter B Van Campen

Dylan

Fred A. Miller

Randall R Schulz

tags

participants (5)