Dual Opteron SuSIE enterprise cluster -- nodes failing with "Northbridge Machine Check exception"
We are having a problem with nodes crashing on a cluster with 32 dual Opteron compute nodes with 5GB memory each and Xeon management and storage nodes. We have Myrinet, the shared disk is formatted with LVM, and we are using SUSE-8 Enterprise. Typical error messages are attached at the end of this note. The problem seems to happen so far when: compiling GCC 3.4.3 and infrequently when building large (10GB+) files and when running a certain C++ non-distributed memory intensive program. It has happened now on at least 4 different nodes. ****************************** NODE-002 ***************************************************** Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Northbridge Machine Check exception b60000010005001b 0 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Uncorrectable condition Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Unrecoverable condition Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: NB status: unrecoverable Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Error uncorrected Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Address: 00000000051f0000 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: CPU 1: Machine Check Exception: 0000000000000000 Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Kernel panic: Unable to continue ------------------------------------------------------------------------ Steven Naron, PMP, Executive Consultant Public Sector Architecture, IBM Global Services Voice, page or fax (301)803-6852
On Wednesday 08 December 2004 1:13 pm, Steven Naron wrote:
We are having a problem with nodes crashing on a cluster with 32 dual Opteron compute nodes with 5GB memory each and Xeon management and storage nodes. We have Myrinet, the shared disk is formatted with LVM, and we are using SUSE-8 Enterprise. Typical error messages are attached at the end of this note.
The problem seems to happen so far when: compiling GCC 3.4.3 and infrequently when building large (10GB+) files and when running a certain C++ non-distributed memory intensive program. It has happened now on at least 4 different nodes.
****************************** NODE-002 *****************************************************
Message from syslogd@node002 at Thu Nov 4 17:24:37 2004 ... node002 kernel: Northbridge Machine Check exception b60000010005001b 0
Hi Steven , Altho I've been away from IBM for 14 yrs, I'll venture that the phrase "Machine Check" is a very strong indication of a hardware problem. The text also narrows down the scope to one of the two major 'bridge' chipsets of the system boards. Your setup as a multi-node array is a perfect environment for component swapping as a trouble-shooting, problem determination technique. Please let us know how this comes out. PeterB
On Wednesday 08 Dec 2004 21:56 pm, Peter B Van Campen wrote:
Hi Steven ,
Altho I've been away from IBM for 14 yrs, I'll venture that the phrase "Machine Check" is a very strong indication of a hardware problem.
The *same* fault on four separate machines at the same time? Dylan -- "I see your Schwartz is as big as mine" -Dark Helmet
On Wed December 8 2004 5:09 pm, Dylan wrote:
On Wednesday 08 Dec 2004 21:56 pm, Peter B Van Campen wrote:
Hi Steven ,
Altho I've been away from IBM for 14 yrs, I'll venture that the phrase "Machine Check" is a very strong indication of a hardware problem.
The *same* fault on four separate machines at the same time?
Absolutely! If a chipset is bad or "marginal," it's very possible. Not that long ago, I tried getting 9.1 to run on all 10 of 10 Gateways. I was hesitant to even try it as I KNOW most of their systems are less than spec. I was right.......6 of 10 had the SAME problem - a chipset problem. EVEN XP didn't like them very much. Fred -- "As Internet technology itself vaults into new areas, so too does the Microsoft monopoly and its tried-and-true bag of tricks." -US Senator Orrin Hatch, (R) Utah
Fred, Steve, On Thursday 09 December 2004 06:32, Fred A. Miller wrote:
On Wed December 8 2004 5:09 pm, Dylan wrote:
On Wednesday 08 Dec 2004 21:56 pm, Peter B Van Campen wrote:
Hi Steven ,
Altho I've been away from IBM for 14 yrs, I'll venture that the phrase "Machine Check" is a very strong indication of a hardware problem.
"Machine Check" here is an Intel / Pentium thing, not an IBM thing.
The *same* fault on four separate machines at the same time?
Absolutely! If a chipset is bad or "marginal," it's very possible. Not that long ago, I tried getting 9.1 to run on all 10 of 10 Gateways. I was hesitant to even try it as I KNOW most of their systems are less than spec. I was right.......6 of 10 had the SAME problem - a chipset problem. EVEN XP didn't like them very much.
Occam's Razor, dude! It's more likely there was some network activity, perhaps a malformed (possibly even malicious) broadcast packet that triggered a kernel bug / vulnerability. There was just a kernel update to fix a vulnerability of this sort: E.g., from "SUSE Security Summary Report SUSE-SR:2004:003" from the 7th of this month: -==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==- - kernel Several problems have been found in the Linux 2.4 and 2.6 kernels: ... - Several overflow checks in the smbfs handling of both Linux 2.4 and 2.6 were found missing by Stefan Esser. This is tracked by the Mitre CVE Id CAN-2004-0883. -==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==-
Fred
Randall Schulz
participants (5)
-
Dylan
-
Fred A. Miller
-
Peter B Van Campen
-
Randall R Schulz
-
Steven Naron