RE: [SLE] Problems with initrd after mkinitrd
Carl, Thanks again -- I just wanted to fill you in on the information that was missing, but have not yet finished my experiment. Thanks, again -- see below.
-----Original Message-----
<snip>
I have to acknowledge this problem exist(s/ed) on older hardware, but it is also important to point out that modern drives and BIOSes set to use LBA (which Grub and Linux 'speak') cleanly eliminates this specific limitation. AIUI, under LBA, the current "interesting" addressing barrier is 137GB.
(BTW, thanks for the input, Terry) Well, interestingly, all of the machines were booting at cylinder 2082 prior to the mkinitrd problem -- because I wasn't concerned with it (see the board info below). My previous practice had been to use a 100 MB /boot partition, but I was doing much different things then -- and on older equipment.
And Patrick,
I've studied the entire thread again, top to bottom, to see if we have overlooked something, and I'll share these thoughts:
1. We don't know the manufacturer or age of the machines we're discussing... no mainboard or CPU models/numbers etc... the only 'clue' is that
you've
seen this problem "to some degree" under SuSE 9.1. The ages and make/model of these systems are important clues.
All machines: -- AIC RMC2E2 chassis with hotpluggable SATA backplane. -- Intel SE7520BD2 motherboards with 6GB Registered ECC DDR RAM. -- Dual 2.8GHz (Low Voltage) Xeon's -- Ages range from ~1 year to ~2 months.
2. Is the 2.5" to 3.5" "adapter" just a mechanical carrier, or is it also involved in the (data) cabling? This adapter (coupled with WD's specific recommendation that you *not* use this drive in your application) *and* the inclusion of the highpoint driver are big enough 'atypical' factors that they deserve special attention. You need to be certain that these have been properly ruled out and aren't involved/contributing/confusing matters.
The adapter is passive. My main problem with a hardware based resolution, is that things work 100% of the time with installation and probably something like 95-98% of the time with imaging (the only failures being the grub hangs which are usually corrected by grub re-install -- which I think are actually the same symptom (retrospectively) except it is hanging loading the kernel and before printing to the screen??).
3. We should not overlook this one important clue: "... requires re- installing grub which sometimes produces a hang on loading stage1.5."
All things being equal, I think Carlos' hypothesis is likely correct. I'll be looking forward to your "report", so have fun and good luck!
<snip> Thanks -- my initial results: I resized reiserfs on three machines that were hanging (to move swap to the inside of the drive) -- created an ext2 formatted partition between cylinders 1 and 900 (which, predictably solved the problem). I am now waiting to allocate 20 machines (10 control, 10 experimental) to perform initrd updates on and reboot where the control group has a single / partition and the experimental group would have the /boot partition as described above. I'll keep you all updated -- thanks again. Patrick
Hi Patrick, More fodder to keep you up at nights ;-) On Tuesday 27 December 2005 13:16, Patrick Freeman wrote: <snip>
(BTW, thanks for the input, Terry) Well, interestingly, all of the machines were booting at cylinder 2082 prior to the mkinitrd problem --
Each cloned drive should theoretically have /boot and initrd "land" in the same physical location when the image is written, correct? Yet, sometimes and for unknown reasons, the clone doesn't completely "take" and you have to reinstall Grub to make the system bootable. Hmm... interesting clue... Please tell me you meant "sector" as in LBA and *not* "cylinder" as in CHS addressing? <shiver!><Ugh!> (Pardon my flashbacks...!)
All machines: -- AIC RMC2E2 chassis with hotpluggable SATA backplane. -- Intel SE7520BD2 motherboards with 6GB Registered ECC DDR RAM. -- Dual 2.8GHz (Low Voltage) Xeon's -- Ages range from ~1 year to ~2 months.
Your 40GB drives, mainboards and the BIOS are all contemporary enough to support LBA, meaning the physical location of /boot and initrd is not an issue... unless, of course, you accepted "Auto" in the BIOS' IDE drive setup and the default just happens to be CHS (required by M$haft.) Under that scenario, there *is* a possibility that the BIOS could be calculating (and using) a geometry for the 'transplants' that doesn't perfectly match the imaged partition table.
The adapter is passive. My main problem with a hardware based resolution, is that things work 100% of the time with installation
Under this scenario, the drive starts it's service life out being prepped and used in situ under the same BIOS. Have any of *these* systems experienced boot failures following mkinitrd?
and probably something like 95-98% of the time with imaging (the only failures being the grub hangs which are usually corrected by grub re-install -- which I think are actually the same symptom (retrospectively)
What is really missing in all of these discussions are the results of meaningful forensics on all of the drives that are failing to boot. I think this approach would bear fruit more quickly than a twenty system lab experiment (maybe it isn't as much *fun* but that's another topic...) regards, - Carl
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Tuesday 2005-12-27 at 19:18 -0500, Carl Hartung wrote:
More fodder to keep you up at nights ;-)
On Tuesday 27 December 2005 13:16, Patrick Freeman wrote: <snip>
(BTW, thanks for the input, Terry) Well, interestingly, all of the machines were booting at cylinder 2082 prior to the mkinitrd problem --
Each cloned drive should theoretically have /boot and initrd "land" in the same physical location when the image is written, correct? Yet, sometimes and for unknown reasons, the clone doesn't completely "take" and you have to reinstall Grub to make the system bootable. Hmm... interesting clue...
Please tell me you meant "sector" as in LBA and *not* "cylinder" as in CHS addressing? <shiver!><Ugh!> (Pardon my flashbacks...!)
No, it is track, ie, cylinder. That doesn't imply that the disk are using CHS addressing, but only that he is partitioning using track numbers, as is usual in fdisk. Also, I gather he is talking of mkinitrd images, ie, ramdisk images, not the disk cloning images. It is something different. The systems crash sometimes after changing the initrd image. I suggested that there could be a problem, in his case, when the image was placed above some track number, perhaps 1024, and he is testing if that hypothesis is correct by creating separate /boot partitions at low track numbers. The base for my idea is that grub and lilo have to use bios calls at first to read the disk, and the bios might have problems with those disks. Its a wild, educated guess ;-) Let's wait and see! I'm curious :-) - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDseratTMYHG2NR9URAgFEAJ0Sfkiq3BWjKZS2lvekpZ2KV1l7EgCgg+s5 cWwWDPGpe5p7JuiPyeGShzg= =5X4x -----END PGP SIGNATURE-----
On Tuesday 27 December 2005 20:31, Carlos E. R. wrote:
No, it is track, ie, cylinder. That doesn't imply that the disk are using CHS addressing, but only that he is partitioning using track numbers, as is usual in fdisk.
Hi Carlos, As I'm sure you're aware ;-) roughly a minute after you posted this, Patrick wrote that he meant "sector 2082" and not "cylinder." (I'm not convinced, however, that /he's/ convinced, so I'm afraid we'll both have to stay tuned...) Also, did you know that fdisk can be used in units of 'cylinders' *or* 'sectors'... your choice! man fdisk ;-) I don't know that Patrick was actually /using/ fdisk, only that sectors are a natural metric for partitioning drives that use LBA.
Also, I gather he is talking of mkinitrd images, ie, ramdisk images, not the disk cloning images. It is something different. The systems crash sometimes after changing the initrd image.
In retrospect, I can see how my choice of words might have confused you. Let me clarify my understanding: Patrick says he is experiencing two 'flavors' of boot failure: - One, less frequent, is the failure of some of the cloned drives to boot immediately after they've been created and installed. He is presently overcoming this 'flavor' of boot failure by reinstalling Grub. - The second 'flavor,' which is occurring more frequently, is a failure to boot after modifying the normally running system and creating a new initrd (with mkinitrd.) IOW, these cloned drives have /not/ failed to boot, initially, and the systems have been running normally for some time. Then, after installing updates, he's run mkinitrd and the systems suddenly fail to boot. He is presently overcoming this 'flavor' by tarring up the drive contents, repartitioning the drive and restoring the contents. In both cases, the boot failures are occurring *only* on the drives that have been cloned. He is not experiencing either type of boot failure in those systems where the drives have been installed raw and the systems have been built and upgraded from scratch. Is the scope of his problem (and my comprehension of it) now clearer?
I suggested that there could be a problem, in his case, when the image was placed above some track number, perhaps 1024, and he is testing if that hypothesis is correct by creating separate /boot partitions at low track numbers.
But he is building new systems using contemporary components that support Logical Block Addressing. As I understand it, these systems should have no difficulty booting from any location on a 40GB disk.
The base for my idea is that grub and lilo have to use bios calls at first to read the disk, and the bios might have problems with those disks. Its a wild, educated guess ;-)
I agree that the BIOS limitation you're alluding to is still a common problem, but I think it only concerns older hardware than what Patrick is dealing with. I am increasingly confident that the error lies somewhere in the realm of drive address calculations and/or translations. That is /my/ educated guess. I am still thinking about ways to test the drives for proof, though. Any ideas? regards, - Carl
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Wednesday 2005-12-28 at 02:24 -0500, Carl Hartung wrote:
On Tuesday 27 December 2005 20:31, Carlos E. R. wrote:
No, it is track, ie, cylinder. That doesn't imply that the disk are using CHS addressing, but only that he is partitioning using track numbers, as is usual in fdisk.
Hi Carlos,
As I'm sure you're aware ;-) roughly a minute after you posted this, Patrick wrote that he meant "sector 2082" and not "cylinder." (I'm not convinced, however, that /he's/ convinced, so I'm afraid we'll both have to stay tuned...)
Yes, I saw it when I was going off. I connect by modem, I would have to reconect again to have answered that - and anyway, later he said cylinders again ;-)
Also, did you know that fdisk can be used in units of 'cylinders' *or* 'sectors'... your choice! man fdisk ;-)
Yeah, I know... I'm old fashioned in this. Cylinder numbers are easier to handle, smaller, and some software still complains if the partitions end at the middle of a track, even if "track" no longer has a true physical meaning. A disk can have three real heads and report a dozen or a hundred! It justs makes life easier ;-)
I don't know that Patrick was actually /using/ fdisk, only that sectors are a natural metric for partitioning drives that use LBA.
True.
Also, I gather he is talking of mkinitrd images, ie, ramdisk images, not the disk cloning images. It is something different. The systems crash sometimes after changing the initrd image.
In retrospect, I can see how my choice of words might have confused you. Let me clarify my understanding:
Yes, it clarifies the situation a lot, I hadn't noticed the first type of failure you mention.
Patrick says he is experiencing two 'flavors' of boot failure:
- One, less frequent, is the failure of some of the cloned drives to boot immediately after they've been created and installed. He is presently overcoming this 'flavor' of boot failure by reinstalling Grub.
- The second 'flavor,' which is occurring more frequently, is a failure to boot after modifying the normally running system and creating a new initrd (with mkinitrd.) IOW, these cloned drives have /not/ failed to boot, initially, and the systems have been running normally for some time. Then, after installing updates, he's run mkinitrd and the systems suddenly fail to boot. He is presently overcoming this 'flavor' by tarring up the drive contents, repartitioning the drive and restoring the contents.
In both cases, the boot failures are occurring *only* on the drives that have been cloned. He is not experiencing either type of boot failure in those systems where the drives have been installed raw and the systems have been built and upgraded from scratch.
I think he said later that this was not completely true. However, the word "clone" gives me ideas for the first type of problem. How were they cloned? The thing is, moderm HD are not "born equal", even if they are of the same model. The number of sectors vary - so says Seagate, I read it. It is due to the fact that some sectors fail after manufacture, and are just mapped out, or moved to the reserved space for bad block, so the user never knows. Now, how does that affect cloning? If the cloning software attemps to create an identical image in the same positions, couldn't some sectors fall where they shouldn't? Just a wild idea. Some imaging software, like ghost, that allow resizing the cloned image, would overcome this problem: it doesn't create an exact image. Other methods, like dd done on the device, would perhaps not - then I might be wrong.
Is the scope of his problem (and my comprehension of it) now clearer?
Yeap :-)
I suggested that there could be a problem, in his case, when the image was placed above some track number, perhaps 1024, and he is testing if that hypothesis is correct by creating separate /boot partitions at low track numbers.
But he is building new systems using contemporary components that support Logical Block Addressing. As I understand it, these systems should have no difficulty booting from any location on a 40GB disk.
Correct. That's the theory. ;-) For example, in this system of 2000 vintage I installed a new 160 GB HD. I can boot old msdos 6.0 from it. But grub has problems with it! Even more, if I place the swap partition on it, suspend/awake is not reliable. I know the bios is old, but then, msdos does boot, and it shouldn't.
The base for my idea is that grub and lilo have to use bios calls at first to read the disk, and the bios might have problems with those disks. Its a wild, educated guess ;-)
I agree that the BIOS limitation you're alluding to is still a common problem, but I think it only concerns older hardware than what Patrick is dealing with. I am increasingly confident that the error lies somewhere in the realm of drive address calculations and/or translations.
Yes, that should be the case, however... I wonder.
That is /my/ educated guess. I am still thinking about ways to test the drives for proof, though. Any ideas?
Mmmm... no... - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDsoSztTMYHG2NR9URAqpzAJ4o1XLDnUptx5JFn3cVviBZtL8YHACfZHzt SEmSA3jwJ755+LgKKFc4uQk= =S3qF -----END PGP SIGNATURE-----
participants (3)
-
Carl Hartung
-
Carlos E. R.
-
Patrick Freeman