HP DL 380 system with SuSE Pro 9.3 freezes up
Hi All, I have a HP ProLiant DL 380 G4 server, running SuSE Pro 9.3, kernel version 2.6.11.4-21.10. The server has all the drive bays (6) filled with SCSI disks and has Compaq Smart Array 6i RAID controller builtin. I have the disks with SCSI ID 0 and 1 (capacity 72 GB each) with a RAID 1 (mirroring) and this volume has the following partitions / (ext3) /boot (ext3) /usr/local (ext3) swap (swap) The remaining 4 disks (2, 3, 4 and 5 - with 300 GB each) form a RAID 0 volume and has /home partition in it. Randomly the server freezes up. Network card will respond to pings, but none of the login services would work (like SSH, rsh etc.). Also the RiLO connectivity would work to the extent that I am able to login to the RiLO card remotely, but I cannot login to the OS itself. Actually I cannot type anything into the console at all. What I have invariably noticed is that the system is completely busy with the SCSI disks 0 and 1, for some reason. The disk activity is 100% in this RAID volume. I don't see anything in the logs (of course, after the manual reboot, as there is no way I can get into the system itself when the problem exists). And I have no clue how to debug this. This has happened about 5 times within the last 6 months. I have enabled a cron which collects data from top, ps, w, iostat and mpstat, but nothing strange there either. Has anyone seen this before? It would be a great help if you can share your thoughts and ideas about how to resolve this. TIA, Prakash
On Tuesday 14 March 2006 14:01, Prakash Velayutham wrote:
Has anyone seen this before? It would be a great help if you can share your thoughts and ideas about how to resolve this.
Hi Prakash, Just out of curiosity, why have you decided to run a non-certified Linux on a nice 'box' like that? I mean, don't get me wrong... SUSE retail Linux is great. In fact, I run 9.2, 9.3 and 10.0 on my and family and friend's and even on client's 'home brew' systems; even small servers. But, if this were *my* problem and anything close to 365 x 24 availability were important, I'd seriously consider paying the extra money and installing SLES on that system. There *is* a certified SLES version for that hardware, isn't there? regards, Carl
On Tuesday 14 March 2006 3:04 pm, Carl Hartung wrote:
On Tuesday 14 March 2006 14:01, Prakash Velayutham wrote:
Has anyone seen this before? It would be a great help if you can share your thoughts and ideas about how to resolve this.
Hi Prakash,
Just out of curiosity, why have you decided to run a non-certified Linux on a nice 'box' like that?
I mean, don't get me wrong... SUSE retail Linux is great. In fact, I run 9.2, 9.3 and 10.0 on my and family and friend's and even on client's 'home brew' systems; even small servers. But, if this were *my* problem and anything close to 365 x 24 availability were important, I'd seriously consider paying the extra money and installing SLES on that system. There *is* a certified SLES version for that hardware, isn't there? Carl, You are correct. Here is the HP Linux Certification web site: http://www.hp.com/go/linuxcertification Click on Proliant servers and the Novell tab. You will find that the DL380 is fully supported by both SuSE and by HP for both SLES 9 and Novell Open Enterprise Server. Note that the DL380 is available as a 32-bit x86 and a 64-bit with either AMD-64 or Intel EM64T chips.
--
Jerry Feldman
On Tuesday 14 March 2006 14:01, Prakash Velayutham wrote:
Has anyone seen this before? It would be a great help if you can share your thoughts and ideas about how to resolve this.
Hi Prakash,
Just out of curiosity, why have you decided to run a non-certified Linux on a nice 'box' like that?
I mean, don't get me wrong... SUSE retail Linux is great. In fact, I run 9.2, 9.3 and 10.0 on my and family and friend's and even on client's 'home brew' systems; even small servers. But, if this were *my* problem and anything close to 365 x 24 availability were important, I'd seriously consider paying the extra money and installing SLES on that system. There *is* a certified SLES version for that hardware, isn't there?
regards,
Carl Thanks Carl. Myself and my team just did not feel that we needed to go
Carl Hartung wrote: the Enterprise way. We really do not care for 24 X 7 availability, but it would be nice to know what went wrong, when it does go down. We will take a look at it again if we have to. Prakash
On Tuesday 14 March 2006 15:44, Prakash Velayutham wrote:
Thanks Carl. Myself and my team just did not feel that we needed to go the Enterprise way. We really do not care for 24 X 7 availability, but it would be nice to know what went wrong, when it does go down. We will take a look at it again if we have to.
I *was* going to recommend, as a start, to make certain the memory has been fully stress tested and to also confirm that the power supply isn't being sometimes overworked. These are typical underlying causes of random seizes like the one you're describing. But I then looked at the specs for that system and deduced that those are probably less likely issues... unless, of course, you got your hands on a 'bare bones' configuration and upgraded it yourself using less comprehensively tested components. If this is the case, you need to do some basic tests of all the parts you've added to rule out incompatible, marginal or even failing hardware. Don't forget: DOA memory and hard disks aren't really all that uncommon; field failures can happen as late as 90 days out once parts have finally been 'landed' and stressed by real world applications. On the same vein, regarding the two '100% busy' disks: Anything unusual there? Have you run any diagnostics on them? Have you thought about swapping out the data cables? Are there diagnostics you can run on the RAID controller? Carl
Thanks Carl. Myself and my team just did not feel that we needed to go the Enterprise way. We really do not care for 24 X 7 availability, but it would be nice to know what went wrong, when it does go down. We will take a look at it again if we have to.
I won't be of any help, just wanted to let you know that I've SuSe Pro 9.0 and 9.3 running on several DL (both G3 and G4, but also HP Lxr, Lc, Lh) and never got a problem. Just wanted to let you know. We will for sure use the enterprise when we move stuff that really "needs" it, like Oracle Applications. Maybe a firmware's upgrade? Actually I had the kind of problem you have but with using virtual machines (ESX server) but does not seem to be your case anyway. Regards, Gaël
On Tue, 2006-03-14 at 14:01 -0500, Prakash Velayutham wrote:
Randomly the server freezes up. Network card will respond to pings, but none of the login services would work (like SSH, rsh etc.). Also the RiLO connectivity would work to the extent that I am able to login to the RiLO card remotely, but I cannot login to the OS itself. Actually I cannot type anything into the console at all.
Hi, I had also strange behaviour on a HP DL380-G4. System was installed with 9.2, although it was not used for anything. Tried to install 10.0 (official media) 10.1-B1 and B3 which failed completely: frozen during installation without clear indication what or why. Failed media were check before installation and none problem detected. Media were used afterwards to install DL360 and DL-320 satisfactory. Feared of bad mem/cpu but all test ran without error. Installation of 10.1-B6 went without a hitch. And trying to do some XEN-tests on it. Hans. -- pgp-id: 926EBB12 pgp-fingerprint: BE97 1CBF FAC4 236C 4A73 F76E EDFC D032 926E BB12 Registered linux user: 75761 (http://counter.li.org)
Hans Witvliet wrote:
I had also strange behaviour on a HP DL380-G4. System was installed with 9.2, although it was not used for anything. Tried to install 10.0 (official media) 10.1-B1 and B3 which failed completely: frozen during installation without clear indication what or why.
I've got one or two reports open with SUSE/Novell regarding installation on a Proliant box. An outdated PL1600 that I use purely for testing - beta3 had problems with hardware detection (I had to use hwprobe=-braille), and now beta6 locks up solidly in the first boot after installation completes.
Installation of 10.1-B6 went without a hitch. And trying to do some XEN-tests on it.
I've also installed B6 on a ML570 - no problems whatsoever. /Per Jessen, Zürich
participants (6)
-
Carl Hartung
-
Gaël Lams
-
Hans Witvliet
-
Jerry Feldman
-
Per Jessen
-
Prakash Velayutham