On Wednesday 08 December 2004 1:13 pm, Steven Naron wrote:
We are having a problem with nodes crashing on a cluster with 32 dual Opteron compute nodes with 5GB memory each and Xeon management and storage nodes. We have Myrinet, the shared disk is formatted with LVM, and we are using SUSE-8 Enterprise. Typical error messages are attached at the end of this note.
The problem seems to happen so far when: compiling GCC 3.4.3 and infrequently when building large (10GB+) files and when running a certain C++ non-distributed memory intensive program. It has happened now on at least 4 different nodes.
****************************** NODE-002 *****************************************************
Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Northbridge Machine Check exception b60000010005001b 0
Hi Steven , Altho I've been away from IBM for 14 yrs, I'll venture that the phrase "Machine Check" is a very strong indication of a hardware problem. The text also narrows down the scope to one of the two major 'bridge' chipsets of the system boards. Your setup as a multi-node array is a perfect environment for component swapping as a trouble-shooting, problem determination technique. Please let us know how this comes out. PeterB