-----Original Message----- From: Carl Hartung [mailto:suselinux@cehartung.com]
<snip>
As I'm sure you're aware ;-) roughly a minute after you posted this, Patrick wrote that he meant "sector 2082" and not "cylinder." (I'm not convinced, however, that /he's/ convinced, so I'm afraid we'll both have to stay tuned...)
Ok so I looked -- that is cylinder 2082 (sorry about being a dolt but I have a tendancy to *fuzz* up that which I don't absolutely need to know after I've checked it out -- all I really knew, was that there should be no issues with BIOS access to the disk. I appreciate your gentle nudges, however, instead of calling me an idiot (which might have been warranted here). I am very interested in what you had said in another post about C,H,S being defaulted in the BIOS, I am going to look through the Intel docs on this to see if there is any reference. [after looking: this document: ftp://download.intel.com/support/motherboards/server/se7520bd2/sb/se7520 bd2_server_board_tps_r23.pdf on page 62 indicates that LBA is the default in the BIOS for devices that support it]. <snip>
I don't know that Patrick was actually /using/ fdisk, only that sectors are a natural metric for partitioning drives that use LBA.
Yes, I was using fdisk, but I had long since memorized the number for scripts or hand editing, and hadn't looked at it in a while -- here is an 'fdisk -l /dev/hda' for the curious: Disk /dev/hda: 40.0 GB, 40007761920 bytes 16 heads, 63 sectors/track, 77520 cylinders Units = cylinders of 1008 * 512 = 516096 bytes Device Boot Start End Blocks Id System /dev/hda1 1 2081 1048792+ 82 Linux swap / Solaris /dev/hda2 * 2082 77520 38021256 83 Linux <snip>
- One, less frequent, is the failure of some of the cloned drives to boot immediately after they've been created and installed. He is presently overcoming this 'flavor' of boot failure by reinstalling Grub.
Correct.
- The second 'flavor,' which is occurring more frequently, is a failure to boot after modifying the normally running system and creating a new initrd (with mkinitrd.) IOW, these cloned drives have /not/ failed to boot, initially, and the systems have been running normally for some time. Then, after installing updates, he's run mkinitrd and the systems suddenly fail to boot. He is presently overcoming this 'flavor' by tarring up the drive contents, repartitioning the drive and restoring the contents.
Correct.
In both cases, the boot failures are occurring *only* on the drives that have been cloned. He is not experiencing either type of boot failure in those systems where the drives have been installed raw and the systems have been built and upgraded from scratch.
I'm sorry -- I haven't been clear enough. The second flavor occurs on *all* drives after updates (where *all* means that the sample set can be of either type (fresh-install or imaged) and produces the same 30%ish hang rate after mkinitrd). I've sort of been thinking-out-loud in these posts -- and in the process have been a bit confusing -- I apologize. Some posts have made me realize I didn't think about some thing or another, and so I look back at it -- I think the response was to the thought that there might be a some kind of problem with BIOS accessing the drive so -- I assumed that fresh installs are going to write near the front of the disk first and subsequent updates would potentially get written elsewhere. Then I later remembered that the partition started at cylinder 2082 (but then confused cylinder and sector) and continued to be perplexed. So my state at that point was: I thought that the description of the problem (BIOS not able to read part of the disk where the initrd resides) fit very well with the behavior I was seeing, even though I *knew* that that particular problem *shouldn't* affect me and thus came to the same hypothesis that Carlos did, which was: That some potentially *unknown* BIOS problem that causes part of the disk to not be accessible might be the issue. The main behavior is intermittent, and while it could be hardware related (BTW WD recommended us not to use the drive because the drive was designed for a desktop (not server) duty cycle). Subsequent updates on drives that hung previously sometimes hang and sometimes do not. In my swamped-ness, I have looked now and again at *what* I was doing and could find no *reason*. The most recent hang was due to an updated xfs driver -- a minor modification of the source that we made in house and compiled on a different machine. The module loads fine on a running system and in this case was updated in the initrd without issue on 2 out of 5 machines. <snip>
But he is building new systems using contemporary components that support Logical Block Addressing. As I understand it, these systems should have no difficulty booting from any location on a 40GB disk.
Exactly why I hadn't considered BIOS to be at issue.
I agree that the BIOS limitation you're alluding to is still a common problem, but I think it only concerns older hardware than what Patrick is dealing with. I am increasingly confident that the error lies somewhere in the realm of drive address calculations and/or translations.
What you state here (about calculations/translations), could explain the problem quite nicely (and actually, I think, is a great fit for what Carlos is saying).
That is /my/ educated guess. I am still thinking about ways to test the drives for proof, though. Any ideas?
I have the used blocks ( -D ) output from debugreiserfs from both a working and a hung system. I think it has enough information to tell what part of the filesystem the initrd is in -- I am not sure if it indicates the LBA sectors. If it does, I will be able to glean that in a *very* tedious process. But this still wouldn't answer your question, I think, since it wouldn't tell us if the BIOS can properly address and read those blocks. So, I am curious, too (but I am also very grateful for the help I have already received and do not want to appear greedy). <snip> Thanks, Patrick
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Wednesday 2005-12-28 at 04:50 -0500, Patrick Freeman wrote:
I have the used blocks ( -D ) output from debugreiserfs from both a working and a hung system. I think it has enough information to tell what part of the filesystem the initrd is in -- I am not sure if it indicates the LBA sectors. If it does, I will be able to glean that in a *very* tedious process. But this still wouldn't answer your question, I think, since it wouldn't tell us if the BIOS can properly address and read those blocks. So, I am curious, too (but I am also very grateful for the help I have already received and do not want to appear greedy).
I suppose it should give partition relative sector number, or some thing similar, but I have to say I haven't read the man page. Well, I had brief a look at it ;-) - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDsnkAtTMYHG2NR9URAkyMAJ9PhLL5lFpwrcNaXjy+KYpW/l/c8QCfeRG7 w/+IiiyCb6/VxAImelZS/Nw= =YyOG -----END PGP SIGNATURE-----
On Wednesday 28 December 2005 04:50, Patrick Freeman wrote:
Ok so I looked -- that is cylinder 2082 (sorry about being a dolt but I have a tendancy to *fuzz* up that which I don't absolutely need to know after I've checked it out
Not a problem, Patrick... I tend to priority-focus on minute details, too.
I am very interested in what you had said in another post about C,H,S being defaulted in the BIOS...
Just to clarify what you've paraphrased, here: I actually only proposed it as a possibility worth investigating. From the spec sheet: "4MB Flash ROM with AMI* BIOS, Multiboot BBS (BIOS Boot Specification) [with] IDE drive auto-configure" I've been tripped up by these built-in 'auto' IDE configuration utilities before, specifically in the area of CHS<>LBA address translations. If it is suspect, it deserves looking at if only to rule it out.
I'm sorry -- I haven't been clear enough. The second flavor occurs on *all* drives after updates (where *all* means that the sample set can be of either type (fresh-install or imaged)
This contradicts facts that I believed we'd already established. It is a proverbial "monkey wrench" that fundamentally changes the equation. :-/ I *thought* the purpose of dividing up your test systems into "cloned" vs. "native installed" was to compare those susceptible to the 'flavor 2' failures against the "healthy." Now, there is no "healthy!" If *every* system, native installed and cloned, is susceptible to the post mkinitrd boot failures, the fact that a drive started out as a clone might be *exacerbating* the problem, somehow, but the cloning itself *can't* be the root cause or even a prerequisite. This has two ramifications: - it brings the possibility back to life that these drives (or the entire IDE subsystems, for that matter) have an inherent but unidentified susceptibility. IOW, all of the hardware related possibilities are back on the table, unless and until specifically tested, vigorously, and ruled out. - it also greatly increases the liklihood that the software you're compiling and installing... or the process you're using to install it... is at fault. The only obvious nexus I can see is your locally compiled driver. It *is* tied into the *storage* subsystem (the point of failure,) isn't it? Look, Patrick, I know it seems confusing when you can make the same changes on many systems and have some succumb and others not, but my previous point concerning "magnification" comes to play... these systems cannot be exactly identical, or they'd all fall over or all run. From that perspective, studying the differences between 'flavor 1' and 'flavor 2' boot failures is evasive because it leaves the major questions unanswered... Is it possible for you to just rip out the locally compiled driver, substitute pieces of the storage subsystem as needed and run some trials? If the boot failures disappear, you've at least isolated the problem. That is the first step in identifying and fixing it. Alternatively, you could dual purpose these trials as preliminary work towards migrating to less problematic hardware... hopefully to components that won't need the custom driver.
I have the used blocks ( -D ) output from debugreiserfs from both a working and a hung system. ...
I think comparing a healthy 'native' drive to one each of the drives that don't boot would provide the most fruitful forensic data. Time to recruit the assistance of a real filesystems expert, maybe even a hard drive engineer...
... But this still wouldn't answer your question, I think, since it wouldn't tell us if the BIOS can properly address and read those blocks.
In my mind, the liklihood that a BIOS setting or limitation is at fault has greatly diminished. It would be nice to rule it out, but that is easy enough to with the right knowledge... :-) See my comment, above. OK, Patrick, that's the extent of my brain capacity on this problem at this time. I'll keep abreast of your progress by following this thread, but unless I see some additional and definitive clues, or maybe some real test results, there isn't much more that I can add. Have fun and good luck! regards, - Carl
participants (3)
-
Carl Hartung
-
Carlos E. R.
-
Patrick Freeman