LSI Megaraid 320-2x corruption

I am running SuSE 9.3 on a dual Opteron 246 system (Tyan Thunder K8SR (S2881)). This machine has been running fine for several weeks using the on board SCSI controllers & SW RAID. We installed the LSI MegaRAID 320-2X controller with the TBBU03 battery backup and reinstalled SuSE 9.3. The installation did not go flawlessly: - Towards the end of the NFS installation, copying stopped. I switched consoles and noticed that I couldn't ping any other addresses in our network. I brought the ethernet interface down then back up and the installation resumed. - After the install, both megaraid and megaraid_mbox modules were loaded. I removed megaraid from INITRD_MODULES and ran mkinitrd. At this point we considered the installation suspect, but pressed on with testing. Now the corruption: One user was compiling code, another user was loading a database. After some time, g++ started getting internal compiler errors in cc1plus. I compared the checksum of this program with another install and they were different. I reinstalled gcc, verified checksums and everything worked for a while. Then internal compiler errors and bogus checksum again. I did not see anything alarming in the log files. So I booted into rescue mode and ran reiserfsck --check without any problems. Flashed the controller with the latest firmware and booted again. This time I set up a 'make clean; make' loop and watched it for about two hours without any problems. Then I started creating 4GB files with dd and deleting them and within 15 minutes another internal compiler error. Questions: - Does anyone run a similar setup without problems? - Has anyone seen similar problems? - Can anyone provide me some direction in tracking down this problem? Thanks, -K

----- Original Message ----- From: "Kelly Burkhart" <kelly@tradebotsystems.com> To: <suse-amd64@suse.com> Sent: Tuesday, August 09, 2005 1:50 PM Subject: [suse-amd64] LSI Megaraid 320-2x corruption
I am running SuSE 9.3 on a dual Opteron 246 system (Tyan Thunder K8SR (S2881)).
This machine has been running fine for several weeks using the on board SCSI controllers & SW RAID. We installed the LSI MegaRAID 320-2X controller with the TBBU03 battery backup and reinstalled SuSE 9.3.
The installation did not go flawlessly: - Towards the end of the NFS installation, copying stopped. I switched consoles and noticed that I couldn't ping any other addresses in our network. I brought the ethernet interface down then back up and the installation resumed. - After the install, both megaraid and megaraid_mbox modules were loaded. I removed megaraid from INITRD_MODULES and ran mkinitrd.
At this point we considered the installation suspect, but pressed on with testing.
Now the corruption:
One user was compiling code, another user was loading a database. After some time, g++ started getting internal compiler errors in cc1plus. I compared the checksum of this program with another install and they were different. I reinstalled gcc, verified checksums and everything worked for a while. Then internal compiler errors and bogus checksum again.
I did not see anything alarming in the log files. So I booted into rescue mode and ran reiserfsck --check without any problems. Flashed the controller with the latest firmware and booted again.
This time I set up a 'make clean; make' loop and watched it for about two hours without any problems. Then I started creating 4GB files with dd and deleting them and within 15 minutes another internal compiler error.
Questions:
- Does anyone run a similar setup without problems? - Has anyone seen similar problems? - Can anyone provide me some direction in tracking down this problem?
I assume you are using some type of RAID configuration. Are you running the megaraid monitoring software? If so any errors in the megaserv.log file? I am not sure how doing a compile has anything to do with file corruption. This sounds more like a memory problem to me. Have you done a complete memory test with memtest? That would be the first thing I would do. Also is the card in the middle slot or the left outside slot? Brad Dameron Systems Administrator SeaTab Software www.seatab.com

On Tue, 2005-08-09 at 14:45 -0700, Brad Dameron wrote:
I assume you are using some type of RAID configuration. Are you running the megaraid monitoring software? If so any errors in the megaserv.log file? I am not sure how doing a compile has anything to do with file corruption. This sounds more like a memory problem to me. Have you done a complete memory test with memtest? That would be the first thing I would do.
A memtest is ongoing. I am not running the megaraid monitoring software. The linux downloads from the lsi web site are: ut_gam_linux_6.02-21.zip ut_linux_megarc_1.11.zip ut_linux_mgr_5.20.zip The megarc and mgr do not seem to be monitoring software. The gam zip contains a bunch of RPMs which I assumed were intended for some other variety of Linux so I didn't bother installing. To what monitoring software are you referring?
Also is the card in the middle slot or the left outside slot?
The server has a 1U chassis; only one PCI slot is available to us. I don't know which one is used but can check later today. What affect could that have? -K

----- Original Message ----- From: "Kelly Burkhart" <kelly@tradebotsystems.com> To: "Brad Dameron" <brad@seatab.com> Cc: <suse-amd64@suse.com> Sent: Wednesday, August 10, 2005 6:22 AM Subject: Re: [suse-amd64] LSI Megaraid 320-2x corruption
On Tue, 2005-08-09 at 14:45 -0700, Brad Dameron wrote:
I assume you are using some type of RAID configuration. Are you running the megaraid monitoring software? If so any errors in the megaserv.log file? I am not sure how doing a compile has anything to do with file corruption. This sounds more like a memory problem to me. Have you done a complete memory test with memtest? That would be the first thing I would do.
A memtest is ongoing.
I am not running the megaraid monitoring software. The linux downloads from the lsi web site are:
ut_gam_linux_6.02-21.zip ut_linux_megarc_1.11.zip ut_linux_mgr_5.20.zip
The megarc and mgr do not seem to be monitoring software. The gam zip contains a bunch of RPMs which I assumed were intended for some other variety of Linux so I didn't bother installing. To what monitoring software are you referring?
Also is the card in the middle slot or the left outside slot?
The server has a 1U chassis; only one PCI slot is available to us. I don't know which one is used but can check later today. What affect could that have?
-K
I posted a HOW-TO on the monitoring software about a week ago. Just look at the list archives for this month with my name on it. Brad Dameron Systems Administrator SeaTab Software www.seatab.com

Brad Dameron wrote:
I posted a HOW-TO on the monitoring software about a week ago.
Just look at the list archives for this month with my name on it.
I have a megaraid adapter in my opteron and would be interested in using the monitoring software. I looked around for your HOW-TO but could not find it. Which archive is it in? At first I thought it might be in the X86_64 archive but I don't see it. Or maybe I'm looking in the wrong place. Thanks. Mark

First off get this package from LSI. http://www.lsilogic.com/downloads/license.do?id=2000&did=7776&pid=2411 It is the driver package for SuSe 9.1. In that drive package is a directory called Utilities/MegaMON. In MegaMON is a file named lsi_v35.tgz. Do a tar -xzvf lsi_v35.tgz. This is the monitor app you will need. Then do a ./install -Suse. This will install the binaries and install the startup script of raidmon in /etc/init.d and set itself to start on the runlevels 2,3,4,5. I have modified my raidmon file to include some other items. Below is mine: ------------------------------------------------------------ #!/bin/sh # # chkconfig: 2345 20 80 # description: RAIDMon is a daemon that monitors the RAID subsystem # And generates e-mail to root # processname: MegaServ. # source function library . /lib/lsb/init-functions case "$1" in start) megadevice="megadev0" rm -f /dev/$megadevice 2>/dev/null megamajor=`cat /proc/devices|gawk '/megadev/{print$1}' ` mknod /dev/$megadevice c $megamajor 0 2>/dev/null # New check: 10-31-01: Does node exist if [ ! -c /dev/$megadevice ] then echo " Character Device Node /dev/$megadevice does not exist. Raid Monitor could not be started " exit 1 fi echo -n 'Starting RAID Monitor:' startproc /usr/sbin/MegaCtrl -start > /dev/null sleep 1 ; MegaCtrl -disMail touch /var/lock/subsys/raidmon MegaCtrl -enChkCon # check consistency on a Saturday at 01:00 very 4 weeks MegaCtrl -cons -h01 -w4 -d6 echo ;; stop) echo -n 'Stopping RAID Monitor:' startproc /usr/sbin/MegaCtrl -stop megadevice="megadev0" rm -f /dev/$megadevice 2>/dev/null rm -f /var/lock/subsys/raidmon 2>/dev/null echo ;; restart|reload) $0 stop $0 start ;; *) echo "RAID Monitor is not Started/Stopped" echo "Usage: raidmon {start|stop|restart}" exit 1 esac exit 0 ----------------------------------------------------------------------- Then just start the raidmon. The above changes tell my system to do a consistency check once a month at 1:00am. This makes sure my RAID5 is working like it should. Also note there will be a log file of /var/log/megaserv.log. I use this setup on about a dozen production machines. It works perfectly. This will work on 32bit and 64bit machines. Brad Dameron Systems Administrator SeaTab Software www.seatab.com

On Tue, 09 Aug 2005 15:50:29 -0500 Kelly Burkhart <kelly@tradebotsystems.com> wrote:
This time I set up a 'make clean; make' loop and watched it for about two hours without any problems. Then I started creating 4GB files with dd and deleting them and within 15 minutes another internal compiler error.
gcc failing like this is usually caused by hardware problems (e.g. bad DIMMs). gcc is a very good memory tester. If you have "memory remapping" enabled in the BIOS turn it off, also make sure to run the latest BIOS. If that doesn't help it's likely broken hardware. -Andi

On Tue, 2005-08-09 at 23:53 +0200, Andi Kleen wrote:
On Tue, 09 Aug 2005 15:50:29 -0500 Kelly Burkhart <kelly@tradebotsystems.com> wrote:
This time I set up a 'make clean; make' loop and watched it for about two hours without any problems. Then I started creating 4GB files with dd and deleting them and within 15 minutes another internal compiler error.
gcc failing like this is usually caused by hardware problems (e.g. bad DIMMs). gcc is a very good memory tester.
If you have "memory remapping" enabled in the BIOS turn it off, also make sure to run the latest BIOS. If that doesn't help it's likely broken hardware.
I'll start a memcheck, check memory mapping and BIOS version. In the mean time, is it likely that a memory problem would corrupt files which are not accessed? In the log below, nothing else is happening on this machine at the time. In a previous run of this test, the same three files were corrupted. -K tradebot@server96:~/bigfile> md5sum --check gcc-md5.txt /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/cc1: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/cc1plus: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/collect2: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbegin.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbeginS.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbeginT.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtend.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtendS.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/jc1: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/jvgenmain: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/libgcc.a: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/libgcc_eh.a: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/specs: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/SYSCALLS.c.X: OK tradebot@server96:~/bigfile> while true; do rm bigfile.dd ; dd if=/dev/zero of=bigfile.dd bs=8k count=500000; md5sum --check gcc-md5.txt; sleep 20; done 500000+0 records in 500000+0 records out 4096000000 bytes (4.1 GB) copied, 43.8051 seconds, 93.5 MB/s /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/cc1: FAILED /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/cc1plus: FAILED /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/collect2: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbegin.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbeginS.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtbeginT.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtend.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/crtendS.o: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/jc1: FAILED /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/jvgenmain: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/libgcc.a: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/libgcc_eh.a: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/specs: OK /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/SYSCALLS.c.X: OK md5sum: WARNING: 3 of 14 computed checksums did NOT match

On Wed, 2005-08-10 at 08:05 -0500, Kelly Burkhart wrote:
On Tue, 2005-08-09 at 23:53 +0200, Andi Kleen wrote:
On Tue, 09 Aug 2005 15:50:29 -0500 Kelly Burkhart <kelly@tradebotsystems.com> wrote:
This time I set up a 'make clean; make' loop and watched it for about two hours without any problems. Then I started creating 4GB files with dd and deleting them and within 15 minutes another internal compiler error.
gcc failing like this is usually caused by hardware problems (e.g. bad DIMMs). gcc is a very good memory tester.
If you have "memory remapping" enabled in the BIOS turn it off, also make sure to run the latest BIOS. If that doesn't help it's likely broken hardware.
I'll start a memcheck, check memory mapping and BIOS version.
memcheck showed no issues. BIOS version has been updated to most recent. BOIS MTRR Mapping setting has been changed from 'Discrete' to 'Continuous'. The problem persists. All I have to do to produce it is: - Use yast to force update of gcc files - run dd to create a 4G file After running dd, two or three of the gcc files will fail 'md5sum --check' -K

On Tue, 2005-08-09 at 15:50 -0500, Kelly Burkhart wrote:
I am running SuSE 9.3 on a dual Opteron 246 system (Tyan Thunder K8SR (S2881)).
This machine has been running fine for several weeks using the on board SCSI controllers & SW RAID. We installed the LSI MegaRAID 320-2X controller with the TBBU03 battery backup and reinstalled SuSE 9.3.
RMAd the controller; the replacement works. -K
participants (4)
-
Andi Kleen
-
Brad Dameron
-
Kelly Burkhart
-
Mark Horton