-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 We're running SLES 8 on a Tyan 2800 BIOS 2.01l with 2x Opteron 244, 6GB RAM. Linux db64 2.4.19-SMP #1 SMP Mon Mar 31 23:48:08 UTC 2003 x86_64 unknown Last time I tried to upgrade the kernel the new kernel didn't like something about the zero channel raid and wouldn't mount the root disk. :(, same with the SLES Service pack 2, so we started from scratch install, no extra patches. we've got a couple of issues... 1) Linux won't boot with the DRAM ECC enabled in the BIOS.. so we're running w/o ECC. 2) I ran a "memory eater" script, that ate up all the memory, the machine didn't swap.. it crashed... spit out some errors... below is the script, top output right after the crash, and the error messages. Any ideas? bad ram? Kris. - --- snip --- export PARALLEL=1 #!/bin/bash2 # # memtest.sh # # Shell script to help isolate memory failures under linux # # Author: Doug Ledford + contributors # # (C) Copyright 2000-2002 Doug Ledford; Red Hat, Inc. # This shell script is released under the terms of the GNU General # Public License Version 2, June 1991. If you do not have a copy # of the GNU General Public License Version 2, then one may be # retrieved from http://people.redhat.com/dledford/GPL.html # # Note, this needs bash2 for the wait command support. # This is where we will run the tests at if [ -z "$TEST_DIR" ]; then ~ TEST_DIR=/usr/zap/tmp fi # The location of the linux kernel source file we will be using if [ -z "$SOURCE_FILE" ]; then ~ SOURCE_FILE=$TEST_DIR/linux.tar.gz fi if [ ! -f "$SOURCE_FILE" ]; then ~ echo "Missing source file $SOURCE_FILE" ~ exit 1 fi # How many passes to run of this test, higher numbers are better if [ -z "$NR_PASSES" ]; then ~ NR_PASSES=40 fi # Guess how many megs the unpacked archive is if [ -z "$MEG_PER_COPY" ]; then ~ MEG_PER_COPY=$(ls -l $SOURCE_FILE | awk '{print int($5/1024/1024) * 4}') fi # How many trees do we have to unpack in order to make our trees be larger # than physical RAM? If we don't unpack more data than memory can hold # before we start to run the diff program on the trees then we won't # actually flush the data to disk and force the system to reread the data # from disk. Instead, the system will do everything in RAM. That doesn't # work (as far as the memory test is concerned). It's the simultaneous # unpacking of data in memory and the read/writes to hard disk via DMA that # breaks the memory subsystem in most cases. Doing everything in RAM without # causing disk I/O will pass bad memory far more often than when you add # in the disk I/O. if [ -z "$NR_SIMULTANEOUS" ]; then ~ NR_SIMULTANEOUS=$(free | awk -v meg_per_copy=$MEG_PER_COPY 'NR == 2 {print int($2*1.5/1024/meg_per_ copy + (($2/1024)%meg_per_copy >= (meg_per_copy/2)) + (($2/1024/32) < 1))}') fi # Should we unpack/diff the $NR_SIMULTANEOUS trees in series or in parallel? if [ ! -z "$PARALLEL" ]; then ~ PARALLEL="yes" else ~ PARALLEL="no" fi if [ ! -z "$JUST_INFO" ]; then ~ echo "TEST_DIR: $TEST_DIR" ~ echo "SOURCE_FILE: $SOURCE_FILE" ~ echo "NR_PASSES: $NR_PASSES" ~ echo "MEG_PER_COPY: $MEG_PER_COPY" ~ echo "NR_SIMULTANEOUS: $NR_SIMULTANEOUS" ~ echo "PARALLEL: $PARALLEL" ~ echo ~ exit fi cd $TEST_DIR # Remove any possible left over directories from a cancelled previous run rm -fr linux linux.orig linux.pass.* # Unpack the one copy of the source tree that we will be comparing against tar -xzf $SOURCE_FILE mv linux linux.orig i=0 while [ "$i" -lt "$NR_PASSES" ]; do ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir $j) & ~ else ~ tar -xzf $SOURCE_FILE ~ mv linux linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir $j) & ~ else ~ tar -xzf $SOURCE_FILE ~ mv linux linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (diff -U 3 -rN linux.orig linux.pass.$j; rm -fr linux.pass.$j) & ~ else ~ diff -U 3 -rN linux.orig linux.pass.$j ~ rm -fr linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ i=`expr $i + 1` done # Clean up after ourselves rm -fr linux linux.orig linux.pass.* ~ 8:50pm up 3:06, 2 users, load average: 54.12, 51.66, 33.69 207 processes: 206 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.1% user, 0.0% system, 0.0% nice, 99.4% idle CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle Mem: 5806792K av, 5791816K used, 14976K free, 0K shrd, 437508K buff Swap: 1052248K av, 0K used, 1052248K free 4640676K cached ~ PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND ~ 1738 zappos 15 0 1260 1260 844 R 0.1 0.0 0:05 top ~ 1 root 15 0 240 240 188 S 0.0 0.0 0:04 init ~ 2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0 ~ 3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1 ~ 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd ~ 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0 ~ 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1 ~ 7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kswapd ~ 8 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush ~ 9 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated ~ 10 root 25 0 0 0 0 SW 0.0 0.0 0:00 kinoded ~ 12 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd ~ 16 root 15 0 0 0 0 SW 0.0 0.0 0:00 kreiserfsd ~ 73 root 0 -20 0 0 0 SW< 0.0 0.0 0:00 lvm-mpd ~ 420 root 15 0 660 660 524 S 0.0 0.0 0:00 syslogd ~ 423 root 15 0 1376 1376 444 S 0.0 0.0 0:00 klogd ~ 459 root 24 0 0 0 0 SW 0.0 0.0 0:00 khubd ~ 612 bin 25 0 460 460 360 S 0.0 0.0 0:00 portmap ~ 634 root 23 0 2860 2860 1488 S 0.0 0.0 0:00 snmpd ~ 678 root 15 0 1904 1904 1736 S 0.0 0.0 0:00 sshd ~ 876 root 15 0 1828 1828 1388 S 0.0 0.0 0:00 master ~ 885 postfix 15 0 2020 2020 1532 S 0.0 0.0 0:00 qmgr ~ 900 at 16 0 608 608 488 S 0.0 0.0 0:00 atd ~ 915 root 15 0 696 696 552 S 0.0 0.0 0:00 cron ~ 1012 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1013 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1014 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1015 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1016 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1017 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1018 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1031 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1032 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1033 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1034 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1035 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1036 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1320 postfix 15 0 1724 1724 1312 S 0.0 0.0 0:00 pickup ~ 1366 root 15 0 2548 2548 2368 S 0.0 0.0 0:00 sshd ~ 1368 zappos 15 0 2652 2652 2424 S 0.0 0.0 0:00 sshd Message from syslogd@db64 at Thu Oct 16 20:49:20 2003 ... db64 kernel: MCG_STATUS: unrecoverable memtest.bash: line 108: 1627 Segmentation fault tar -xzf $SOURCE_FILE -C $j Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Northbridge Machine Check exception b40000000005001b 0 Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Uncorrectable condition Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Unrecoverable condition Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Error uncorrected Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Address: 0000000009470000 Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: MCE at EIP ffffffffa001412e ESP 10025659d98 Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: CPU 0: Machine Check Exception: 0000000000000000 Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Kernel panic: Unable to continue Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Kernel BUG at journal:3092 Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: invalid operand: 0000 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQE/j25WuBLfyXibQuYRAlCmAJwK7J81+donOr3xnJwW5EUfiDSZmACePLHC jGzmj8K//nK7Fi7275meFLs= =Dx+U -----END PGP SIGNATURE-----
Kristoffer Ongbongan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
We're running SLES 8 on a Tyan 2800 BIOS 2.01l with 2x Opteron 244, 6GB RAM. Linux db64 2.4.19-SMP #1 SMP Mon Mar 31 23:48:08 UTC 2003 x86_64 unknown
Last time I tried to upgrade the kernel the new kernel didn't like something about the zero channel raid and wouldn't mount the root disk.
Which RAID?
:(, same with the SLES Service pack 2, so we started from scratch install, no extra patches.
we've got a couple of issues...
1) Linux won't boot with the DRAM ECC enabled in the BIOS.. so we're running w/o ECC.
This looks like a hardware problem. Or do you have any information that this is Linux' fault?
2) I ran a "memory eater" script, that ate up all the memory, the machine didn't swap.. it crashed... spit out some errors...
below is the script, top output right after the crash, and the error messages. Any ideas? bad ram?
The original SLES8 kernel had some bugs in the MCE code, this got only fixed for SP2. So, the error message while pointing to bad memory might still be wrong. I would still try to get SP2 running. I'm also uploading our current kernel, you might want to try that one. It's the -127 kernel at ftp.suse.com/pub/people/aj/Kernel-AMD64/ Andreas
Kris.
- --- snip ---
export PARALLEL=1
#!/bin/bash2 # # memtest.sh # # Shell script to help isolate memory failures under linux # # Author: Doug Ledford + contributors # # (C) Copyright 2000-2002 Doug Ledford; Red Hat, Inc. # This shell script is released under the terms of the GNU General # Public License Version 2, June 1991. If you do not have a copy # of the GNU General Public License Version 2, then one may be # retrieved from http://people.redhat.com/dledford/GPL.html # # Note, this needs bash2 for the wait command support.
# This is where we will run the tests at if [ -z "$TEST_DIR" ]; then ~ TEST_DIR=/usr/zap/tmp fi
# The location of the linux kernel source file we will be using if [ -z "$SOURCE_FILE" ]; then ~ SOURCE_FILE=$TEST_DIR/linux.tar.gz fi
if [ ! -f "$SOURCE_FILE" ]; then ~ echo "Missing source file $SOURCE_FILE" ~ exit 1 fi
# How many passes to run of this test, higher numbers are better if [ -z "$NR_PASSES" ]; then ~ NR_PASSES=40 fi
# Guess how many megs the unpacked archive is if [ -z "$MEG_PER_COPY" ]; then ~ MEG_PER_COPY=$(ls -l $SOURCE_FILE | awk '{print int($5/1024/1024) * 4}') fi
# How many trees do we have to unpack in order to make our trees be larger # than physical RAM? If we don't unpack more data than memory can hold # before we start to run the diff program on the trees then we won't # actually flush the data to disk and force the system to reread the data # from disk. Instead, the system will do everything in RAM. That doesn't # work (as far as the memory test is concerned). It's the simultaneous # unpacking of data in memory and the read/writes to hard disk via DMA that # breaks the memory subsystem in most cases. Doing everything in RAM without # causing disk I/O will pass bad memory far more often than when you add # in the disk I/O. if [ -z "$NR_SIMULTANEOUS" ]; then ~ NR_SIMULTANEOUS=$(free | awk -v meg_per_copy=$MEG_PER_COPY 'NR == 2 {print int($2*1.5/1024/meg_per_ copy + (($2/1024)%meg_per_copy >= (meg_per_copy/2)) + (($2/1024/32) < 1))}') fi
# Should we unpack/diff the $NR_SIMULTANEOUS trees in series or in parallel? if [ ! -z "$PARALLEL" ]; then ~ PARALLEL="yes" else ~ PARALLEL="no" fi
if [ ! -z "$JUST_INFO" ]; then ~ echo "TEST_DIR: $TEST_DIR" ~ echo "SOURCE_FILE: $SOURCE_FILE" ~ echo "NR_PASSES: $NR_PASSES" ~ echo "MEG_PER_COPY: $MEG_PER_COPY" ~ echo "NR_SIMULTANEOUS: $NR_SIMULTANEOUS" ~ echo "PARALLEL: $PARALLEL" ~ echo ~ exit fi
cd $TEST_DIR
# Remove any possible left over directories from a cancelled previous run rm -fr linux linux.orig linux.pass.*
# Unpack the one copy of the source tree that we will be comparing against tar -xzf $SOURCE_FILE mv linux linux.orig
i=0 while [ "$i" -lt "$NR_PASSES" ]; do ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir $j) & ~ else ~ tar -xzf $SOURCE_FILE ~ mv linux linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir $j) & ~ else ~ tar -xzf $SOURCE_FILE ~ mv linux linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ j=0 ~ while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do ~ if [ $PARALLEL = "yes" ]; then ~ (diff -U 3 -rN linux.orig linux.pass.$j; rm -fr linux.pass.$j) & ~ else ~ diff -U 3 -rN linux.orig linux.pass.$j ~ rm -fr linux.pass.$j ~ fi ~ j=`expr $j + 1` ~ done ~ wait ~ i=`expr $i + 1` done
# Clean up after ourselves rm -fr linux linux.orig linux.pass.*
~ 8:50pm up 3:06, 2 users, load average: 54.12, 51.66, 33.69 207 processes: 206 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.1% user, 0.0% system, 0.0% nice, 99.4% idle CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle Mem: 5806792K av, 5791816K used, 14976K free, 0K shrd, 437508K buff Swap: 1052248K av, 0K used, 1052248K free 4640676K cached
~ PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND ~ 1738 zappos 15 0 1260 1260 844 R 0.1 0.0 0:05 top ~ 1 root 15 0 240 240 188 S 0.0 0.0 0:04 init ~ 2 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU0 ~ 3 root 0K 0 0 0 0 SW 0.0 0.0 0:00 migration_CPU1 ~ 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd ~ 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0 ~ 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1 ~ 7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kswapd ~ 8 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush ~ 9 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated ~ 10 root 25 0 0 0 0 SW 0.0 0.0 0:00 kinoded ~ 12 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd ~ 16 root 15 0 0 0 0 SW 0.0 0.0 0:00 kreiserfsd ~ 73 root 0 -20 0 0 0 SW< 0.0 0.0 0:00 lvm-mpd ~ 420 root 15 0 660 660 524 S 0.0 0.0 0:00 syslogd ~ 423 root 15 0 1376 1376 444 S 0.0 0.0 0:00 klogd ~ 459 root 24 0 0 0 0 SW 0.0 0.0 0:00 khubd ~ 612 bin 25 0 460 460 360 S 0.0 0.0 0:00 portmap ~ 634 root 23 0 2860 2860 1488 S 0.0 0.0 0:00 snmpd ~ 678 root 15 0 1904 1904 1736 S 0.0 0.0 0:00 sshd ~ 876 root 15 0 1828 1828 1388 S 0.0 0.0 0:00 master ~ 885 postfix 15 0 2020 2020 1532 S 0.0 0.0 0:00 qmgr ~ 900 at 16 0 608 608 488 S 0.0 0.0 0:00 atd ~ 915 root 15 0 696 696 552 S 0.0 0.0 0:00 cron ~ 1012 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1013 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1014 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1015 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1016 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1017 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1018 root 15 0 800 800 620 S 0.0 0.0 0:00 nscd ~ 1031 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1032 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1033 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1034 root 20 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1035 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1036 root 21 0 512 512 428 S 0.0 0.0 0:00 mingetty ~ 1320 postfix 15 0 1724 1724 1312 S 0.0 0.0 0:00 pickup ~ 1366 root 15 0 2548 2548 2368 S 0.0 0.0 0:00 sshd ~ 1368 zappos 15 0 2652 2652 2424 S 0.0 0.0 0:00 sshd
Message from syslogd@db64 at Thu Oct 16 20:49:20 2003 ... db64 kernel: MCG_STATUS: unrecoverable memtest.bash: line 108: 1627 Segmentation fault tar -xzf $SOURCE_FILE -C $j
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Northbridge Machine Check exception b40000000005001b 0
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Uncorrectable condition
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Unrecoverable condition
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Error uncorrected
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Address: 0000000009470000
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: MCE at EIP ffffffffa001412e ESP 10025659d98
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: CPU 0: Machine Check Exception: 0000000000000000
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Kernel panic: Unable to continue
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: Kernel BUG at journal:3092
Message from syslogd@db64 at Thu Oct 16 20:49:22 2003 ... db64 kernel: invalid operand: 0000
Andreas -- Andreas Jaeger, aj@suse.de, http://www.suse.de/~aj SuSE Linux AG, Deutschherrnstr. 15-19, 90429 Nürnberg, Germany GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andreas Jaeger wrote: |>Last time I tried to upgrade the kernel the new kernel didn't like |>something about the zero channel raid and wouldn't mount the root | | disk. | | Which RAID? the LSI zero channel (320-0) version 1Z19. We built one volume with 6x 70GB drives. 3 separate RAID 1's spanned together to build 1 volume of 210GB. so the / partition is on the RAID. I'll try the upgrade again tomorrow since the machine is down for the count anyways with the RAM thing and try to record where it dies again. Just looked at our other opteron, same setup, went into swap without a problem.. I'm thinking it's bad RAM or something. 2 servers, identical hardware purchased 3 months apart showed same results. | > | |>:(, same with the SLES Service pack 2, so we started from scratch |>install, no extra patches. |> |>we've got a couple of issues... |> |>1) Linux won't boot with the DRAM ECC enabled in the BIOS.. so we're |>running w/o ECC. | | | This looks like a hardware problem. Or do you have any information | that this is Linux' fault? | | I'm going to pull the RAM and stick it into an Intel serverboard and run the intel diags on it this week sometime. I hope it's bad RAM. Sure would love to get those 2GB dimms, anyone with experience running the tyan 2880 with 12GB on SLES 8? Kris. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQE/j6CruBLfyXibQuYRAvW0AJ99orf6o+gHhQ2shT3fARStVv8dyACghyUi U6hjj9MaFWXGB2DsUFIov1s= =5oym -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 ECC issues aside, I'd like to get this box running.. I installed the k_numa kernel per advice of suse business support... and this is what I got.. scsi0: Found a MegaRAID controller at 0x1b000, IRQ: 26 scsi0: Enabling 64 bit support megaraid: [1Z26:G112] detected 1 logical drives megaraid: supports extended CDBs. megaraid: channel[1] is raid. megaraid: channel[2] is raid. scsi0: LSI Logic MegaRAID 1Z26 254 commands 16 targs 5 chans 7 luns scsi0: scanning virtual channel 0 for logical drives. ~ Vendor: MegaRAID Model: LD0 RAID1 10018R Red: 1Z26 ~ Type: Direct Access ANSI SCSI revision: 02 scsi0: scanning virtual channel 1 for logical drives. scsi0: scanning virtual channel 2 for logical drives. scsi0: scanning virtual channel 3 for logical drives. scsi0: scanning physical channel 0 for devices. scsi0: scanning physical channel 1 for devices. sd: allocated major 8 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 430116864 512-byte hdwr sectors (220220 MB) Partition check: ~ sda: sda1 sda2 sda3 Loading module reiserfs ... Using /lib/modules/2.4.19-NUMA/kernel/fs/reiserfs/reiserfs.o sh-2021: reiserfs_read_super: can not find reiserfs on sd(8,1) Kernel panic: VFS: Unable to mount root fs on 08:01 anyone know how I could boot this? Andreas Jaeger wrote: | |>2) I ran a "memory eater" script, that ate up all the memory, the |>machine didn't swap.. it crashed... spit out some errors... |> |>below is the script, top output right after the crash, and the error |>messages. Any ideas? bad ram? | | | The original SLES8 kernel had some bugs in the MCE code, this got only | fixed for SP2. So, the error message while pointing to bad memory | might still be wrong. I would still try to get SP2 running. I'm | also uploading our current kernel, you might want to try that one. | | It's the -127 kernel at ftp.suse.com/pub/people/aj/Kernel-AMD64/ | | Andreas | | |>Kris. |> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQE/j+XvuBLfyXibQuYRApmIAJ4kQmhDw5OyKSIJwpXyNv6tvl8jZACeO9vK 2Jcs80KdT+fELRhv5K+bi+Q= =+EL7 -----END PGP SIGNATURE-----
I am using the regular 2.4.21-smp kernel from SuSE on our dual-Opteron with onboard LSI controller using the mpt driver and a RAID array (multiple) using either of the two megaraid.o or megaraid2.o (we generally stick with the megaraid2.o). At one point the array was to large(?) to use reiserfs (it failed during file system creation). We changed the array type, which lowered the size and cobbled some arrays together using LVM. These have reiserfs on them and it is working fine. The arrays are all on an LSI 320-4x in a 64bit slot on a Tyan S2880 w/ 4GB RAM. Have you tried calling LSI support? Santiago
-----Original Message----- From: Kristoffer Ongbongan [mailto:kris@vfrogs.com] Sent: Friday, October 17, 2003 5:52 AM To: Andreas Jaeger Cc: suse-amd64@suse.com Subject: Re: [suse-amd64] It didn't swap?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
ECC issues aside, I'd like to get this box running..
I installed the k_numa kernel per advice of suse business support... and this is what I got..
scsi0: Found a MegaRAID controller at 0x1b000, IRQ: 26 scsi0: Enabling 64 bit support megaraid: [1Z26:G112] detected 1 logical drives megaraid: supports extended CDBs. megaraid: channel[1] is raid. megaraid: channel[2] is raid. scsi0: LSI Logic MegaRAID 1Z26 254 commands 16 targs 5 chans 7 luns scsi0: scanning virtual channel 0 for logical drives. ~ Vendor: MegaRAID Model: LD0 RAID1 10018R Red: 1Z26 ~ Type: Direct Access ANSI SCSI revision: 02 scsi0: scanning virtual channel 1 for logical drives. scsi0: scanning virtual channel 2 for logical drives. scsi0: scanning virtual channel 3 for logical drives. scsi0: scanning physical channel 0 for devices. scsi0: scanning physical channel 1 for devices. sd: allocated major 8 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 430116864 512-byte hdwr sectors (220220 MB) Partition check: ~ sda: sda1 sda2 sda3 Loading module reiserfs ... Using /lib/modules/2.4.19-NUMA/kernel/fs/reiserfs/reiserfs.o sh-2021: reiserfs_read_super: can not find reiserfs on sd(8,1) Kernel panic: VFS: Unable to mount root fs on 08:01
anyone know how I could boot this?
Andreas Jaeger wrote: | |>2) I ran a "memory eater" script, that ate up all the memory, the |>machine didn't swap.. it crashed... spit out some errors... |> |>below is the script, top output right after the crash, and the error |>messages. Any ideas? bad ram? | | | The original SLES8 kernel had some bugs in the MCE code, this got only | fixed for SP2. So, the error message while pointing to bad memory | might still be wrong. I would still try to get SP2 running. I'm | also uploading our current kernel, you might want to try that one. | | It's the -127 kernel at ftp.suse.com/pub/people/aj/Kernel-AMD64/ | | Andreas | | |>Kris. |>
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQE/j+XvuBLfyXibQuYRApmIAJ4kQmhDw5OyKSIJwpXyNv6tvl8jZACeO9vK 2Jcs80KdT+fELRhv5K+bi+Q= =+EL7 -----END PGP SIGNATURE-----
-- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com
Kristoffer Ongbongan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
ECC issues aside, I'd like to get this box running..
I installed the k_numa kernel per advice of suse business support... and
Who told you that?
this is what I got..
Which kernel is this? If it's the SP2 one, please use either the megaraid2 module, or use the latest one from the maintenance web. The SP2 one was broken. Andreas
scsi0: Found a MegaRAID controller at 0x1b000, IRQ: 26 scsi0: Enabling 64 bit support megaraid: [1Z26:G112] detected 1 logical drives megaraid: supports extended CDBs. megaraid: channel[1] is raid. megaraid: channel[2] is raid. scsi0: LSI Logic MegaRAID 1Z26 254 commands 16 targs 5 chans 7 luns scsi0: scanning virtual channel 0 for logical drives. ~ Vendor: MegaRAID Model: LD0 RAID1 10018R Red: 1Z26 ~ Type: Direct Access ANSI SCSI revision: 02 scsi0: scanning virtual channel 1 for logical drives. scsi0: scanning virtual channel 2 for logical drives. scsi0: scanning virtual channel 3 for logical drives. scsi0: scanning physical channel 0 for devices. scsi0: scanning physical channel 1 for devices. sd: allocated major 8 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 430116864 512-byte hdwr sectors (220220 MB) Partition check: ~ sda: sda1 sda2 sda3 Loading module reiserfs ... Using /lib/modules/2.4.19-NUMA/kernel/fs/reiserfs/reiserfs.o sh-2021: reiserfs_read_super: can not find reiserfs on sd(8,1) Kernel panic: VFS: Unable to mount root fs on 08:01
anyone know how I could boot this?
Andreas Jaeger wrote: | |>2) I ran a "memory eater" script, that ate up all the memory, the |>machine didn't swap.. it crashed... spit out some errors... |> |>below is the script, top output right after the crash, and the error |>messages. Any ideas? bad ram? | | | The original SLES8 kernel had some bugs in the MCE code, this got only | fixed for SP2. So, the error message while pointing to bad memory | might still be wrong. I would still try to get SP2 running. I'm | also uploading our current kernel, you might want to try that one. | | It's the -127 kernel at ftp.suse.com/pub/people/aj/Kernel-AMD64/ | | Andreas | | |>Kris. |>
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQE/j+XvuBLfyXibQuYRApmIAJ4kQmhDw5OyKSIJwpXyNv6tvl8jZACeO9vK 2Jcs80KdT+fELRhv5K+bi+Q= =+EL7 -----END PGP SIGNATURE-----
Andreas -- Andreas Jaeger, aj@suse.de, http://www.suse.de/~aj SuSE Linux AG, Deutschherrnstr. 15-19, 90429 Nürnberg, Germany GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 17 October 2003 06:47 am, Andreas Jaeger wrote:
Kristoffer Ongbongan
writes: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
ECC issues aside, I'd like to get this box running..
I installed the k_numa kernel per advice of suse business support...
and
Who told you that?
Michael Krapp, ticket # 20031017430000641
this is what I got..
Which kernel is this? If it's the SP2 one, please use either the megaraid2 module, or use the latest one from the maintenance web. The SP2 one was broken.
Forgive me, I'm relavitely new to Linux, been working with BSD & Sun this whole time.. what do I need to pass to grub to use the megaraid2 instead of the megaraid module? /boot/vmlinuz root=/dev/sda1 vga=normal I've also downloaded your -127 kernel, once I get it booted up, I can install it.
Andreas
- -- Kristoffer Ongbongan kris@vfrogs.com Venture Frogs Incubator (415) 345-6210 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQE/j/0XuBLfyXibQuYRAubBAKCx8zjvZtZip7EWi6jFv3oSIflt4QCdEh1w 2/vQtk4+cyOKedVdAikbGqs= =Hc8N -----END PGP SIGNATURE-----
Kris Ongbongan
On Friday 17 October 2003 06:47 am, Andreas Jaeger wrote:
Kristoffer Ongbongan
writes: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
ECC issues aside, I'd like to get this box running..
I installed the k_numa kernel per advice of suse business support...
and
Who told you that?
Michael Krapp, ticket # 20031017430000641
Thanks, I'll talk with him.
this is what I got..
Which kernel is this? If it's the SP2 one, please use either the megaraid2 module, or use the latest one from the maintenance web. The SP2 one was broken.
Forgive me, I'm relavitely new to Linux, been working with BSD & Sun this whole time.. what do I need to pass to grub to use the megaraid2 instead of the megaraid module?
/boot/vmlinuz root=/dev/sda1 vga=normal
I've also downloaded your -127 kernel, once I get it booted up, I can install it.
You need to change this in /etc/sysconfig/kernel and then regenerte the initial ramdisk with /sbin/mkinitrd, Andreas -- Andreas Jaeger, aj@suse.de, http://www.suse.de/~aj SuSE Linux AG, Deutschherrnstr. 15-19, 90429 Nürnberg, Germany GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
1) Linux won't boot with the DRAM ECC enabled in the BIOS.. so we're running w/o ECC.
2) I ran a "memory eater" script, that ate up all the memory, the machine didn't swap.. it crashed... spit out some errors...
Some of the boards are rather picky on what DIMMs they accept in which slots etc. I would recommend to first get a BIOS upgrade from Tyan (maybe you have an early BIOS that programs the memory controller not well enough) and if that doesn't help try replacing the DIMMs or putting them into different slots. This is usually a BIOS/hardware problem; Linux is not involved in how the memory is configured and doesn't know anything about ECC, DIMMs or slots. -Andi
participants (5)
-
Andi Kleen
-
Andreas Jaeger
-
Kris Ongbongan
-
Kristoffer Ongbongan
-
Santiago Flores