On Thursday 14 October 2004 06:35 pm, Jerome Lyles wrote:
On Thursday 14 October 2004 04:53 pm, Patrick Shanahan wrote:
* Jerome Lyles <susemail@hawaii.rr.com> [10-14-04 21:39]:
Tried it, no change. Unless you have another idea I think I'll wipe this partition and load a backup from June.
I'm wiped out. ////// :^(
gud luk,
Thanks anyway, it was a good try. Jerome
The Problem: fsck failed. Please repair manually and reboot. Please note that the root file system is currently mounted read-only. To remount it read-write: # mount -n -o remount,rw / CONTROL-D will exit from this shell and REBOOT the system. (none):~# So I bought a new hard drive , loaded 9.1 from the DVD, did the 'online update' and rebuilt my home directory from backup. Everything worked perfectly, except rebooting during the installation. I was dumpted into a grub shell so I booted the system from the install DVD and finished the installation. Everything was working perfectly. Rebooting no problem. Then I decided to do what got me into trouble once before-I did a system update using Yast. After the system update I wasn't able to boot because 'fsck failed' again. I rebooted into rescue mode and ran reiserfsck-'no corruptions found'. So I thought maybe I should do the clean install and the online update and the system update before I restored my home directory from backup this time. But before I did that I asked myself once again: If the filesystem checks out clean why is fsck failing? It's not because the file system is corrupt; so what could it be?? So I did this: gg:"fsck failed" and got 1710 hits. So I checked them out. It turns out this problem has been around since 1996 on the web, has occurred on different architectures and flavors of *nix. Here are a few comments: "I've encounter this several times. Sometimes fsck will "fix" it; sometimes it won't. If it won't, the only solution is reformat the bad partition. Be sure to do a badblock check, the one that actually writes and reads to every block on the partition. It's very slow, but thorough. It is important to take the time to do this. Bad blocks can wreak havoc with your filesystem. After doing all that, reinstall what was on the bad partition, if possible. (Fortunately, you can get your data off.)" "So I do exactly as it says and it does the same thing next reboot." "I also tried changing the disk names in fstab... like put them all to the letters they should be. i.e hdg5 to hdc5 but to no avail. It just did the same thing." "After rebooting the machine to 1.0 fsck failed during boot (of course) So I killed the shell went multi user and proceeded to recompile fsck from the 1.0 sources. Everything *seems* back to normal now." "i downloaded the root-fs from several mirrors .. always the same .. so i don´t think that the file was corrupt or something like that .." "Subject ext3 errors Hi all, I have installed a new server, and created ext3 (data=ordered) filesystems on RAID1 partitions. I finished the installation a few weeks ago, and today I tried to install in in the place of the old one. I have worked a few hours on the server, mainly in the /etc directory, and copied a few gig's to the /home dir, which was on a separate partition than root. After that I have rebooted, to check that everything going up OK. Well, nothing was OK. The fsck failed with many "Freeing blocks not in datazone" and "Journal aborting" errors, and I came up with a read-only root partition. I manually e2fsck'd it, saying yes to all questions, and rebooted again. Next time it was the same story: fsck always failed, found many errors, the lost+found growed... AND, the real interesting story: all of my modifications in the /etc directory, which I made at least an hour or more, was lost! Just as if nothing has been written to the disk from the journal for at least an hour. Also, there were some corrupted files, and also, some obvious fs errors after an fsck, for example, accessing /etc/apache-ssl/apache.conf resulted some errors like this: Jun 6 21:56:25 zeratul kernel: attempt to access beyond end of device Jun 6 21:56:25 zeratul kernel: 09:01: rw=0, want=1085624632, limit=20972736 The end of the story: I booted from CD, copied the whole root partition's contents off-disk, reformatted the RAID1 partition with mke2fs -j, copied back the contents, and it seems that now everything is good (so far). I don't think that the problem is in hardware, because of the RAID1, and also, these are brand-new 80G Maxtor SCSI drives. What can cause an error like this? This is a linux 2.4.26 on a PIV-2666, debian testing dist. Thanks, Ferenc Engard" So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck. This is the relevant code: # do fsck and start sulogin, if it fails. # FSCK_RETURN=0 # on first startup of a system with a lvm root device lvm /dev entries # may not exist at this time, so skip fsck in this case ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null` ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% } ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* } DES_OK=0 if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" ; then DES_OK=1 fi if test ! -f /fastboot -a -z "$fastboot" -a $DES_OK -eq 1 ; then FSCK_FORCE="" test -f /forcefsck && FSCK_FORCE="-f" And this is the heart of the problem: if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" As I understand it this is a three way comparison: The -a option means "-n "$ROOTFS_BLKDEV" " equals "$ROOTFS_BLKDEV" and "/" equals "-a -b "$ROOTFS_BLKDEV"" And then the "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" comparison makes three. One or more of these comparasions is failing and the code assumes it's means a file system failure. But does it? Reiserfsck says evertything is fine. Now "ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null`" tells fsck to check something (some metadata) but not do do an actual filesystem check. I edited out '2>/dev/null' but I didn't see the error output on screen or in /var/log/boot.msg. I believe there is an error in the process fsck uses to check the meta data or an error in the meta data itself; an error that reiserfsck does not recognize as an error. I ran echo $ROOTFS_BLKDEV and test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" but only blank lines were returned. Unanswered questions: What are the differences these three names represent: ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null` ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% } ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* } Is there a way to see the values they represent? What do these modifiers do: %%;##* Is there a way to see the values " test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" represent? Does 'DES_OK' represent the meta data? What is the meta data? Thanks, Jerome