Re: [SLE] Booting Problem

8 Nov 2004

      On Thursday 14 October 2004 06:35 pm, Jerome Lyles wrote:
...
On Thursday 14 October 2004 04:53 pm, Patrick Shanahan wrote:
...
* Jerome Lyles <susemail@hawaii.rr.com> [10-14-04 21:39]:
...
Tried it, no change.  Unless you have another idea I think I'll wipe
this partition and load a backup from June.
I'm wiped out.  //////   :^(
gud luk,
Thanks  anyway, it was a good try.
Jerome
The Problem:
fsck failed.  Please repair manually and reboot.  Please note
that the root file system is currently mounted read-only.  To
remount it read-write:

   # mount -n -o remount,rw /

CONTROL-D will exit from this shell and REBOOT the system.

(none):~#

So I bought a new hard drive , loaded 9.1 from the DVD, did the 'online 
update' and rebuilt my home directory from backup.  Everything worked 
perfectly, except rebooting during the installation.  I was dumpted into a 
grub shell so I booted the system from the install DVD and finished the 
installation.  Everything was working perfectly.  Rebooting no problem.

 Then I decided to do what got me into trouble once before-I did a system 
update using Yast.  After the system update I wasn't able to boot because 
'fsck failed' again.  I rebooted into rescue mode and ran reiserfsck-'no 
corruptions found'.  So I thought maybe I should do the clean install and the 
online update and the system update before I restored my home directory from 
backup this time.  But before I did that I asked myself once again: If the 
filesystem checks out clean why is fsck failing?  It's not because the file 
system is corrupt; so what could it be??

So I did this: gg:"fsck failed" and got 1710 hits.  So I checked them out.
It turns out this problem has been around since 1996 on the web, has occurred 
on different architectures and flavors of *nix.

Here are a few comments:

"I've encounter this several times. Sometimes fsck will "fix" it; 
 sometimes it won't. If it won't, the only solution is reformat the bad 
 partition. Be sure to do a badblock check, the one that actually 
 writes and reads to every block on the partition. It's very slow, but 
 thorough. It is important to take the time to do this. Bad blocks can 
 wreak havoc with your filesystem. After doing all that, reinstall what 
 was on the bad partition, if possible. (Fortunately, you can get your 
 data off.)" 

"So I do exactly as it says and it does the same thing next reboot."

"I also tried changing the disk names in fstab... like put them all to the 
letters they should be. i.e hdg5 to hdc5 but to no avail. It just did the 
same thing."

"After rebooting the machine to 1.0 fsck failed during boot (of course)
So I killed the shell went multi user and proceeded to recompile fsck from
the 1.0 sources. Everything *seems* back to normal now."

"i downloaded the root-fs from several mirrors .. always the same .. so
i don´t think that the file was corrupt or something like that .."
"Subject
ext3 errors

Hi all,

I have installed a new server, and created ext3 (data=ordered)
filesystems on RAID1 partitions. I finished the installation a few weeks
ago, and today I tried to install in in the place of the old one.

I have worked a few hours on the server, mainly in the /etc directory,
and copied a few gig's to the /home dir, which was on a separate
partition than root. After that I have rebooted, to check that
everything going up OK. Well, nothing was OK.

The fsck failed with many "Freeing blocks not in datazone" and "Journal
aborting" errors, and I came up with a read-only root partition. I
manually e2fsck'd it, saying yes to all questions, and rebooted again.
Next time it was the same story: fsck always failed, found many errors,
the lost+found growed... 

AND, the real interesting story: all of my modifications in the /etc
directory, which I made at least an hour or more, was lost! Just as if
nothing has been written to the disk from the journal for at least an
hour. Also, there were some corrupted files, and also, some obvious fs
errors after an fsck, for example, accessing /etc/apache-ssl/apache.conf
resulted some errors like this:

Jun  6 21:56:25 zeratul kernel: attempt to access beyond end of device
Jun  6 21:56:25 zeratul kernel: 09:01: rw=0, want=1085624632,
limit=20972736

The end of the story: I booted from CD, copied the whole root
partition's contents off-disk, reformatted the RAID1 partition with
mke2fs -j, copied back the contents, and it seems that now everything is
good (so far). I don't think that the problem is in hardware, because of
the RAID1, and also, these are brand-new 80G Maxtor SCSI drives.

What can cause an error like this? This is a linux 2.4.26 on a PIV-2666,
debian testing dist.

Thanks,
Ferenc Engard"

So...what's happening here?  I did some more digging and found the source of 
the problem behavior in /etc/init.d/boot.rootfsck.  

This is the relevant code:

# do fsck and start sulogin, if it fails.
	#
	FSCK_RETURN=0
	# on first startup of a system with a lvm root device lvm /dev entries
	# may not exist at this time, so skip fsck in this case
	ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null`
	ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% }
        ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* }
	DES_OK=0
	if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b 
"$ROOTFS_BLKDEV" ; then
	    DES_OK=1
	fi

	if test ! -f /fastboot -a -z "$fastboot" -a $DES_OK -eq 1 ; then
	    FSCK_FORCE=""
	    test -f /forcefsck && FSCK_FORCE="-f"

And this is the heart of the problem:

if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"

As I understand it this is a three way comparison:

 The -a option means  "-n "$ROOTFS_BLKDEV" " equals  "$ROOTFS_BLKDEV" and "/" 
equals "-a -b "$ROOTFS_BLKDEV""

And then the  "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b 
"$ROOTFS_BLKDEV"" comparison makes three.

One or more of these comparasions is failing and the code assumes it's means a 
file system failure.  But does it?  Reiserfsck says evertything is fine.

Now "ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null`" tells fsck to check something 
(some metadata) but not do do an actual filesystem check. I edited out 
'2>/dev/null' but I didn't see the error output on screen or 
in /var/log/boot.msg.

I believe there is an error in the process fsck uses to check the meta data or 
an error in the meta data itself; an error that reiserfsck does not recognize 
as an error. 

I ran echo $ROOTFS_BLKDEV and  
 test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"
but only blank lines were returned.

Unanswered questions:

What are the differences these three names represent:
ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null`
ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% }
ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* }

Is there a way to see the values they represent?

What do these modifiers do: %%;##*

Is there a way to see the values " test -n "$ROOTFS_BLKDEV" -a 
"$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" represent?

Does 'DES_OK' represent the meta data?  What is the meta data?
Thanks,
Jerome