RE: [SLE] Problems with initrd after mkinitrd
-----Original Message----- From: Carl Hartung [mailto:suselinux@cehartung.com]
[...]
On Saturday 24 December 2005 14:34, Patrick Freeman wrote: <snippage>
...But now with up to 40 machines all getting the same treatment, between 30 and 60 % will hang on initrd after I run mkinitrd.
Does this mean 12 have exhibited this behavior (30% of 40) or 24 (60% of 40)?
It means that, at various times we have run mkinitrd, and we have seen between 30% and 60% of failures with a maximum sample set of 40 machines at a time over the past year (the 60% is more typical in smaller sample sets (5 units), but 30% may be more realistic).
If I were going to investigate the drive subsystems, I would:
- verify the quality of the power supplies and the data cables.
Done.
- check for poorly designed cases not allowing proper ventilation.
Heat may be a contributor, at least on some, but I believe they are normally operating in spec (and haven't found a temperature correlation). Do you think that one-time heat problems may cause this problem at a later time?
- ensure all drives have the same *current* firmware installed.
The drive microcode is up-to-date (can't remember off hand *what* version that is) and the drive revisions are of two sets -- the problem does not correlate to one or the other.
- ditto the mainboards and add-in controllers
Done.
- plan on a realistic 2% to 3% field failure rate in the first 30 to 90 days. (no OEM can afford to burn-in and stress-test every single drive)
These machines have all passed our burn-in process, and infant mortality is not a factor -- in fact, the symptom does not seem to be related to any kind of drive failure (as determined by later failure of the drives).
FYI: I used to sell between $1 and $2 Million in drives a year into
the
RISC/UNIX market. I've never purchased a WD drive, nor have I strayed from Seagate. I've never regretted that decision.
Thanks for that input -- I don't really have a choice in this matter, and given our company's position, we believe in the new products that Western Digital is producing, which are focused on the enterprise market (Raptors and Raid Edition) -- we are agnostic on the particular drives with which we are booting. Truthfully, we have seen few issues besides expected bumps with new products (for the Raptors and Raid Editions), and WD has responded extremely well on those fronts. On the 2.5" drive, they have recommended that we not use it in that application. But, again, since I have seen this problem with some of the 3.5" (WD400JB, 40GB), I am inclined to think it is some kind of process related thing or a bug in *some* software.
Finally, I think you're going to have a real hard time pointing
fingers
back at WD since you haven't clearly ruled out the HighPoint driver that you're compiling.
This is interesting (the part about the HighPoint driver)... my problem is that I don't understand how I can boot with this very same driver on one machine which is *identical* (since they were originally imaged from the same machine, and all the same things were done to the set) to another ... and hang on the other one. Do you think the driver might have some kind of race condition which would cause it to break in one initrd and not the other (and not show up as an intermittent problem on each machine)? I would be happy to hear that, in the sense it would push me in one direction over the other, but I can't see that -- your experience may be better than mine on this, however.
HTH & regards,
Carl -- thank you -- it has helped me reinforce my thinking. Patrick.
On Monday 26 December 2005 04:48, Patrick Freeman wrote: <snip>
... I am inclined to think it is some kind of process related thing or a bug in *some* software.
At this point, I concur.
This is interesting (the part about the HighPoint driver)... my problem is that I don't understand how I can boot with this very same driver on one machine which is *identical* (since they were originally imaged from the same machine, and all the same things were done to the set) to another ... and hang on the other one.
"Identical" is a matter of granularity. At some "magnification" differences begin to appear. Clearly, these systems cannot actually be "identical" or they'd all fall over or boot correctly. Maybe the boot failures are just a symptom of a problem being introduced at write time. If the process somehow pushes the drives and/or subsystems right up to their performance limit, could it be that some are drifting into the margin and writing flaky images? Could the driver be contributing to this (is it running on the system you're using to write the images?) regards, - Carl
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Monday 2005-12-26 at 12:16 -0500, Carl Hartung wrote:
Maybe the boot failures are just a symptom of a problem being introduced at write time. If the process somehow pushes the drives and/or subsystems right up to their performance limit, could it be that some are drifting into the margin and writing flaky images?
I would think the image (some of them, at least) is written at a place of the disk where the boot loader has difficulties in reading from. Remember that at that point it has to use the bios services, the kernel is not running yet. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDsEkRtTMYHG2NR9URAqRDAJ9WIpGEV7m/i6zhRUJ9BB2JYYZRRwCgjpLj 0QUKgpY03pLBn3SlQDZvdSY= =aJEl -----END PGP SIGNATURE-----
participants (3)
-
Carl Hartung
-
Carlos E. R.
-
Patrick Freeman