Switching p655's and a p690 to SuSe?
Hi folks. We have 7 p655's with 8 CPU's and 1 p690 with 32 CPU's. These machines are linked with a 1 gigabit, low-latency switch for MPI computations using a switch from IBM called an "SP2 switch", AKA a "Colony switch". We're currently running AIX 5.1. IBM is telling us that AIX 5.3 will never support our SP2 switch, so our upgrade path seems to have disappeared shortly after purchasing the equipment. Is there any possibility of SuSe for pSeries supporting this hardware effectively, especially the SP2 switch (now or in the future)? Would we have to toss the SP2 switch and go with (much higher latency) gigabit ethernet, or myrinet or something for our MPI computations? How's the SMP support on pSeries? Is it up to handling 32 CPU's well yet? One of the reasons the guy with the grant money for this compute cluster wanted AIX, was for the AIX compilers - so I need to ask: Will SuSe for pSeries be able to use xlc/xlC/xlf/xlf90/xlf95 and such? Also, is loadleveler available for SuSe on pSeries? That's the queuing system we're using now, and it's also the queuing system many of the freely redistributable climatology programs we need are written to work with... Thanks!
Hi all, I've been trying to install SUSE SLES 9 on a RS6000 H50. With the default kernel everything works ok. My system has 4 604e CPU's, so I wanted to try the SMP kernel. When I boot with yaboot it gives me a default catch error: boot: Please wait, loading kernel... Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok I did a display of the registers: .registers Client's Fix Pt Regs: 00 00000000 01bbf200 01bbf800 ffffffff 00000d00 f0000000 01b19b40 00000000 08 00000003 00c1aa58 00000014 00000010 ba0a7000 00000000 00000000 00000000 10 00000000 01c00000 00220000 00c29b60 00220000 02002000 00001000 00000000 18 00c00c00 00c18000 00c01000 00c017e0 00c04000 902001e4 00c007fc 00c00400 Special Regs: %IV: 00000300 %SRR0: 00c194fc %SRR1: 00003030 %CR: 24004024 %LR: 00c1ab34 %CTR: 00c4ca78 %XER: 20000000 %DAR: 00c017e0 %DSISR: 42000000 %SDR1: 01be0000 Is there anybody who has a clue why this happens? Any ideas on why this is, or tips on how to debug this are very welcome. I'm a novice linux user, but eager to learn :) Erik Janssen. -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.7/60 - Release Date: 28-7-2005
On Sun, Jul 31, Erik Janssen wrote:
Hi all,
I've been trying to install SUSE SLES 9 on a RS6000 H50. With the default kernel everything works ok. My system has 4 604e CPU's, so I wanted to try the SMP kernel. When I boot with yaboot it gives me a default catch error:
boot: Please wait, loading kernel...
Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok
This is outside of yaboot, likely inside the firmware. What firmware version do you have? Its displayed when the SMS is active.
Olaf, <System Information> Serial Number 440152A Firmware Level WIL04197 This is the latest firmware available for this machine 7026-H50. See also: http://techsupport.services.ibm.com/server/mdownload2/7026H50F.html or for a comlete list: http://techsupport.services.ibm.com/server/mdownload/ Erik. Olaf Hering wrote:
On Sun, Jul 31, Erik Janssen wrote:
Hi all,
I've been trying to install SUSE SLES 9 on a RS6000 H50. With the default kernel everything works ok. My system has 4 604e CPU's, so I wanted to try the SMP kernel. When I boot with yaboot it gives me a default catch error:
boot: Please wait, loading kernel...
Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok
This is outside of yaboot, likely inside the firmware. What firmware version do you have? Its displayed when the SMS is active.
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.8/61 - Release Date: 1-8-2005
On Wed, Aug 03, Erik Janssen wrote:
Olaf,
<System Information> Serial Number 440152A Firmware Level WIL04197
This is the latest firmware available for this machine 7026-H50. See also:
Ok, can you try the /lib/lilo/chrp/yaboot.debug binary? It will produce lots of output, having a serial console will help to capture it all.
ok, did a boot from installation cdrom 1, at yaboot did: install32 noinitrd root=/dev/sda1 after startup logged in as root and did: dd if=/lib/lilo/chrp/yaboot.chrp.debug of=/dev/sda1 bs=512 after this reboot. After reboot just start linux from yaboot. After all kind of other debug messages the 'read_disk_block' start and go to default catch: <all other messages> 3123752960 read_disk_block - Reading 4096 bytes, starting at block 564422, disk offset 3123757056 read_disk_block - Reading 4096 bytes, starting at block 564423, disk offset 3123761152 read_disk_block - Reading 4096 bytes, starting at block 564424, disk offset 3123765248 read_disk_block - Reading 4096 bytes, starting at block 564425, disk offset 3123769344 read_disk_block - Reading 4096 bytes, starting at block 564426, disk offset 3123773440 read_disk_block - Reading 4096 bytes, starting at block 564427, disk offset 3123777536 read_disk_block - Reading 4096 bytes, starting at block 564428, disk offset 3123781632 read_disk_block - Reading 4096 bytes, starting at block 564429, disk offset 3123785728 read_disk_block - Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok 0 > Seems like some lack of memory / storage... Any idea's? Erik. Olaf Hering wrote:
On Wed, Aug 03, Erik Janssen wrote:
Olaf,
<System Information> Serial Number 440152A Firmware Level WIL04197
This is the latest firmware available for this machine 7026-H50. See also:
Ok, can you try the /lib/lilo/chrp/yaboot.debug binary? It will produce lots of output, having a serial console will help to capture it all.
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.8/61 - Release Date: 1-8-2005
Damn, install32 noinitrd root=/dev/sda3 that is of course... Sorry. Erik. Erik Janssen wrote:
ok,
did a boot from installation cdrom 1, at yaboot did:
install32 noinitrd root=/dev/sda1
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.8/61 - Release Date: 1-8-2005
On Wed, Aug 03, Erik Janssen wrote:
read_disk_block - Reading 4096 bytes, starting at block 564429, disk offset 3123785728 read_disk_block - Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok 0 >
Seems like some lack of memory / storage...
It dies in a printf "read_disk_block" and "Reading.." are 2 separate calls, likely a firmware bug. Can you paste the fdisk -l output?
linux:~ # fdisk -l Disk /dev/sda: 4512 MB, 4512701440 bytes 139 heads, 62 sectors/track, 1022 cylinders Units = cylinders of 8618 * 512 = 4412416 bytes Device Boot Start End Blocks Id System /dev/sda1 1 1 4278 41 PPC PReP Boot /dev/sda2 5 184 775620 82 Linux swap /dev/sda3 185 1022 3610942 83 Linux There is a valid AIX label on this disk. Unfortunately Linux cannot handle these disks at the moment. Nevertheless some advice: 1. fdisk will destroy its contents on write. 2. Be sure that this disk is NOT a still vital part of a volume group. (Otherwise you may erase the other disks as well, if unmirrored.) 3. Before deleting this physical volume be sure to remove the disk logically from your AIX machine. (Otherwise you become an AIXpert). Disk /dev/sdb: 4512 MB, 4512701440 bytes 256 heads, 63 sectors/track, 546 cylinders Units = cylinders of 16128 * 512 = 8257536 bytes Device Boot Start End Blocks Id System Disk /dev/sdc: 4512 MB, 4512701440 bytes 139 heads, 62 sectors/track, 1022 cylinders Units = cylinders of 8618 * 512 = 4412416 bytes Device Boot Start End Blocks Id System /dev/sdc1 1 1022 4403767 fd Linux raid autodetect Disk /dev/sdd: 4512 MB, 4512701440 bytes 139 heads, 62 sectors/track, 1022 cylinders Units = cylinders of 8618 * 512 = 4412416 bytes Device Boot Start End Blocks Id System /dev/sdd1 1 1022 4403767 fd Linux raid autodetect linux:~ # Olaf Hering wrote:
It dies in a printf "read_disk_block" and "Reading.." are 2 separate calls, likely a firmware bug. Can you paste the fdisk -l output?
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.8/61 - Release Date: 1-8-2005
On Wed, Aug 03, Erik Janssen wrote:
linux:~ # fdisk -l
Disk /dev/sda: 4512 MB, 4512701440 bytes 139 heads, 62 sectors/track, 1022 cylinders Units = cylinders of 8618 * 512 = 4412416 bytes
Device Boot Start End Blocks Id System /dev/sda1 1 1 4278 41 PPC PReP Boot /dev/sda2 5 184 775620 82 Linux swap /dev/sda3 185 1022 3610942 83 Linux
There should be enough room for a FAT partition. Please add the global option 'force_fat' to your lilo.conf. Remove the '-s 8' option from /sbin/lilo and see if loading the kernel from FAT works better. grep -n mkfs /sbin/lilo 615: mkfs.msdos -s 8 $OPTION_BOOT || exit 2
Olaf, When I use force_fat and run lilo it says: ERROR: !!! unknown option force_fat !!! I must admit, it is the first time I run lilo myself and I know see that it is lilo that gives an error message from yast when I do the installation. I've just ignored the error and for the normal (non smp) kernel it never gave me problems... Lilo gives a warning about shrinking the prep partition to prevent firmware confusion. Not sure if this is default or could be a reason for the bootproblems. This is the lilo output after removing the '-s 8': linux:~ # lilo running on chrp Boot target is /dev/sda Warning: Shrinking PReP boot partition /dev/sda1, avoiding firmware confusion Installing /lib/lilo/chrp/yaboot.chrp onto /dev/sda1 Converted /etc/lilo.conf to /etc/yaboot.conf setting open firmware variable boot-device to '/pci@fef00000/scsi@c/sd@8,0' No common Block found No common Block found old boot-file (contains addition OF boot args for kernel, but breaks yaboot): No common Block found No common Block found linux:~ # Erik. Olaf Hering wrote:
On Wed, Aug 03, Erik Janssen wrote:
linux:~ # fdisk -l
Disk /dev/sda: 4512 MB, 4512701440 bytes 139 heads, 62 sectors/track, 1022 cylinders Units = cylinders of 8618 * 512 = 4412416 bytes
Device Boot Start End Blocks Id System /dev/sda1 1 1 4278 41 PPC PReP Boot /dev/sda2 5 184 775620 82 Linux swap /dev/sda3 185 1022 3610942 83 Linux
There should be enough room for a FAT partition. Please add the global option 'force_fat' to your lilo.conf. Remove the '-s 8' option from /sbin/lilo and see if loading the kernel from FAT works better.
grep -n mkfs /sbin/lilo 615: mkfs.msdos -s 8 $OPTION_BOOT || exit 2
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.0/63 - Release Date: 3-8-2005
Hi Olaf and other listeners, Success!!! After installing SP2 (which I could download after activating my support key, duhh...) and the SMP kernel, it boots without a problem! I've changed the /sbin/lilo back to the default ('mkfs.msdos -s 8 $OPTION_BOOT || exit 2') and rerun lilo: linux:/etc # lilo running on chrp Boot target is /dev/sda Installing /lib/lilo/chrp/yaboot.chrp onto /dev/sda1 Converted /etc/lilo.conf to /etc/yaboot.conf Prepending '/pci@fef00000/scsi@c/sd@8,0' to open firmware variable boot-device No common Block found No common Block found No common Block found Warning: old boot-file (contains addition OF boot args for kernel, but breaks ya boot): No common Block found No common Block found linux:/etc # So after installing SP2 the following message is gone: Warning: Shrinking PReP boot partition /dev/sda1, avoiding firmware confusion Not sure what patch changed that, but tell the programmer who wrote it: it works! Just as a little proof my /proc/cpuinfo: linux:/ # more /proc/cpuinfo processor : 0 cpu : 604ev clock : 332MHz revision : 1.0 (pvr 000a 0100) bogomips : 330.75 processor : 1 cpu : 604ev clock : 332MHz revision : 1.0 (pvr 000a 0100) bogomips : 330.75 processor : 2 cpu : 604ev clock : 332MHz revision : 1.0 (pvr 000a 0100) bogomips : 330.75 processor : 3 cpu : 604ev clock : 332MHz revision : 1.0 (pvr 000a 0100) bogomips : 330.75 total bogomips : 1323.00 machine : CHRP IBM,7026-H50 Ok, now I'm ready to install the hercules mainframe emulator and try and get it to use the four cpu's (which started this quest.) Thanks for the great support. It sure gave me a crashcourse in linux. Erik Janssen. Olaf Hering wrote:
On Thu, Aug 04, Erik Janssen wrote:
Olaf,
When I use force_fat and run lilo it says: ERROR: !!! unknown option force_fat !!!
You need SP2 for that, also the SP2 kernel can (hopefully) write nvram on your system.
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.0/63 - Release Date: 3-8-2005
Sorry, Forgot to mention: ROM Level ag030603 ROS Level ag030603 also latest... Erik. Erik Janssen wrote:
Olaf,
<System Information> Serial Number 440152A Firmware Level WIL04197
This is the latest firmware available for this machine 7026-H50. See also: http://techsupport.services.ibm.com/server/mdownload2/7026H50F.html
or for a comlete list:
http://techsupport.services.ibm.com/server/mdownload/
Erik.
Olaf Hering wrote:
On Sun, Jul 31, Erik Janssen wrote:
Hi all,
I've been trying to install SUSE SLES 9 on a RS6000 H50. With the default kernel everything works ok. My system has 4 604e CPU's, so I wanted to try the SMP kernel. When I boot with yaboot it gives me a default catch error:
boot: Please wait, loading kernel...
Unexpected Firmware Error: DEFAULT CATCH!, code=fff00300 at %SRR0: 00c194fc %SRR1: 00003030 ok
This is outside of yaboot, likely inside the firmware. What firmware version do you have? Its displayed when the SMS is active.
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.9.8/61 - Release Date: 1-8-2005
On Fri, Jul 29, Dan Stromberg wrote:
We're currently running AIX 5.1. IBM is telling us that AIX 5.3 will never support our SP2 switch, so our upgrade path seems to have disappeared shortly after purchasing the equipment.
I dont know if the kernel will boot on the SP2 switch, did you already try it?
How's the SMP support on pSeries? Is it up to handling 32 CPU's well yet?
It appers to work well with > 32 cpus in SLES9.
One of the reasons the guy with the grant money for this compute cluster wanted AIX, was for the AIX compilers - so I need to ask: Will SuSe for pSeries be able to use xlc/xlC/xlf/xlf90/xlf95 and such?
There are xlc products available, from IBM.
On Tue, 2005-08-02 at 23:23 +0200, Olaf Hering wrote:
On Fri, Jul 29, Dan Stromberg wrote:
We're currently running AIX 5.1. IBM is telling us that AIX 5.3 will never support our SP2 switch, so our upgrade path seems to have disappeared shortly after purchasing the equipment.
I dont know if the kernel will boot on the SP2 switch, did you already try it?
No, we haven't tried it. The only pSeries boxes I have access to are in fulltime production mode. I'd've guessed that SuSe would boot, but would ignore the SP2 switch. Is there some reason why the kernel shouldn't boot on a system with an SP2 switch?
How's the SMP support on pSeries? Is it up to handling 32 CPU's well yet?
It appers to work well with > 32 cpus in SLES9.
Cool. :)
One of the reasons the guy with the grant money for this compute cluster wanted AIX, was for the AIX compilers - so I need to ask: Will SuSe for pSeries be able to use xlc/xlC/xlf/xlf90/xlf95 and such?
There are xlc products available, from IBM.
Great.
On Tue, Aug 02, Dan Stromberg wrote:
Is there some reason why the kernel shouldn't boot on a system with an SP2 switch?
http://penguinppc.org/ppc64/machines.php lists it as not supported. Perhaps you give it a try and see how it fails. I dont know anything about it, if it has serial console, how the kernel has to be loaded etc etc. Loading the CD1/install file via network is likely the way to go.
participants (3)
-
Dan Stromberg
-
Erik Janssen
-
Olaf Hering