New subject: [SLE] Problems with initrd after mkinitrd

26 Dec 2005

      ...
-----Original Message-----
From: Carl Hartung [mailto:suselinux@cehartung.com]
[...]
...
On Saturday 24 December 2005 14:34, Patrick Freeman wrote:
<snippage>
...
...But now with up to 40 machines all getting the same treatment,
between 30 and 60 % will hang on initrd after I run mkinitrd.
Does this mean 12 have exhibited this behavior (30% of 40) or 24 (60%
of
40)?
It means that, at various times we have run mkinitrd, and we have seen
between 30% and 60% of failures with a maximum sample set of 40 machines
at a time over the past year (the 60% is more typical in smaller sample
sets (5 units), but 30% may be more realistic).
...
If I were going to investigate the drive subsystems, I would:
- verify the quality of the power supplies and the data cables.
Done.
...
- check for poorly designed cases not allowing proper ventilation.
Heat may be a contributor, at least on some, but I believe they are
normally operating in spec (and haven't found a temperature
correlation).  Do you think that one-time heat problems may cause this
problem at a later time?
...
- ensure all drives have the same *current* firmware installed.
The drive microcode is up-to-date (can't remember off hand *what*
version that is) and the drive revisions are of two sets -- the problem
does not correlate to one or the other.
...
- ditto the mainboards and add-in controllers
Done.
...
- plan on a realistic 2% to 3% field failure rate in the first 30 to
90
days.
   (no OEM can afford to burn-in and stress-test every single drive)
These machines have all passed our burn-in process, and infant mortality
is not a factor -- in fact, the symptom does not seem to be related to
any kind of drive failure (as determined by later failure of the
drives).
...
FYI: I used to sell between $1 and $2 Million in drives a year into
the
...
RISC/UNIX market. I've never purchased a WD drive, nor have I strayed
from
Seagate. I've never regretted that decision.
Thanks for that input -- I don't really have a choice in this matter,
and given our company's position, we believe in the new products that
Western Digital is producing, which are focused on the enterprise market
(Raptors and Raid Edition) -- we are agnostic on the particular drives
with which we are booting.  Truthfully, we have seen few issues besides
expected bumps with new products (for the Raptors and Raid Editions),
and WD has responded extremely well on those fronts.  On the 2.5" drive,
they have recommended that we not use it in that application.  But,
again, since I have seen this problem with some of the 3.5" (WD400JB,
40GB), I am inclined to think it is some kind of process related thing
or a bug in *some* software.
...
Finally, I think you're going to have a real hard time pointing
fingers
...
back
at WD since you haven't clearly ruled out the HighPoint driver that
you're
compiling.
This is interesting (the part about the HighPoint driver)... my problem
is that I don't understand how I can boot with this very same driver on
one machine which is *identical* (since they were originally imaged from
the same machine, and all the same things were done to the set) to
another ... and hang on the other one.  Do you think the driver might
have some kind of race condition which would cause it to break in one
initrd and not the other (and not show up as an intermittent problem on
each machine)?  I would be happy to hear that, in the sense it would
push me in one direction over the other, but I can't see that -- your
experience may be better than mine on this, however.
...
HTH & regards,
Carl -- thank you -- it has helped me reinforce my thinking.

Patrick.

RE: [SLE] Problems with initrd after mkinitrd

Patrick Freeman

Carl Hartung

Carlos E. R.

tags

participants (3)