On Thursday 14 October 2004 06:35 pm, Jerome Lyles wrote:
On Thursday 14 October 2004 04:53 pm, Patrick Shanahan wrote:
* Jerome Lyles
[10-14-04 21:39]: Tried it, no change. Unless you have another idea I think I'll wipe this partition and load a backup from June.
I'm wiped out. ////// :^(
gud luk,
Thanks anyway, it was a good try. Jerome
The Problem: fsck failed. Please repair manually and reboot. Please note that the root file system is currently mounted read-only. To remount it read-write: # mount -n -o remount,rw / CONTROL-D will exit from this shell and REBOOT the system. (none):~# So I bought a new hard drive , loaded 9.1 from the DVD, did the 'online update' and rebuilt my home directory from backup. Everything worked perfectly, except rebooting during the installation. I was dumpted into a grub shell so I booted the system from the install DVD and finished the installation. Everything was working perfectly. Rebooting no problem. Then I decided to do what got me into trouble once before-I did a system update using Yast. After the system update I wasn't able to boot because 'fsck failed' again. I rebooted into rescue mode and ran reiserfsck-'no corruptions found'. So I thought maybe I should do the clean install and the online update and the system update before I restored my home directory from backup this time. But before I did that I asked myself once again: If the filesystem checks out clean why is fsck failing? It's not because the file system is corrupt; so what could it be?? So I did this: gg:"fsck failed" and got 1710 hits. So I checked them out. It turns out this problem has been around since 1996 on the web, has occurred on different architectures and flavors of *nix. Here are a few comments: "I've encounter this several times. Sometimes fsck will "fix" it; sometimes it won't. If it won't, the only solution is reformat the bad partition. Be sure to do a badblock check, the one that actually writes and reads to every block on the partition. It's very slow, but thorough. It is important to take the time to do this. Bad blocks can wreak havoc with your filesystem. After doing all that, reinstall what was on the bad partition, if possible. (Fortunately, you can get your data off.)" "So I do exactly as it says and it does the same thing next reboot." "I also tried changing the disk names in fstab... like put them all to the letters they should be. i.e hdg5 to hdc5 but to no avail. It just did the same thing." "After rebooting the machine to 1.0 fsck failed during boot (of course) So I killed the shell went multi user and proceeded to recompile fsck from the 1.0 sources. Everything *seems* back to normal now." "i downloaded the root-fs from several mirrors .. always the same .. so i don´t think that the file was corrupt or something like that .." "Subject ext3 errors Hi all, I have installed a new server, and created ext3 (data=ordered) filesystems on RAID1 partitions. I finished the installation a few weeks ago, and today I tried to install in in the place of the old one. I have worked a few hours on the server, mainly in the /etc directory, and copied a few gig's to the /home dir, which was on a separate partition than root. After that I have rebooted, to check that everything going up OK. Well, nothing was OK. The fsck failed with many "Freeing blocks not in datazone" and "Journal aborting" errors, and I came up with a read-only root partition. I manually e2fsck'd it, saying yes to all questions, and rebooted again. Next time it was the same story: fsck always failed, found many errors, the lost+found growed... AND, the real interesting story: all of my modifications in the /etc directory, which I made at least an hour or more, was lost! Just as if nothing has been written to the disk from the journal for at least an hour. Also, there were some corrupted files, and also, some obvious fs errors after an fsck, for example, accessing /etc/apache-ssl/apache.conf resulted some errors like this: Jun 6 21:56:25 zeratul kernel: attempt to access beyond end of device Jun 6 21:56:25 zeratul kernel: 09:01: rw=0, want=1085624632, limit=20972736 The end of the story: I booted from CD, copied the whole root partition's contents off-disk, reformatted the RAID1 partition with mke2fs -j, copied back the contents, and it seems that now everything is good (so far). I don't think that the problem is in hardware, because of the RAID1, and also, these are brand-new 80G Maxtor SCSI drives. What can cause an error like this? This is a linux 2.4.26 on a PIV-2666, debian testing dist. Thanks, Ferenc Engard" So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck. This is the relevant code: # do fsck and start sulogin, if it fails. # FSCK_RETURN=0 # on first startup of a system with a lvm root device lvm /dev entries # may not exist at this time, so skip fsck in this case ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null` ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% } ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* } DES_OK=0 if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" ; then DES_OK=1 fi if test ! -f /fastboot -a -z "$fastboot" -a $DES_OK -eq 1 ; then FSCK_FORCE="" test -f /forcefsck && FSCK_FORCE="-f" And this is the heart of the problem: if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" As I understand it this is a three way comparison: The -a option means "-n "$ROOTFS_BLKDEV" " equals "$ROOTFS_BLKDEV" and "/" equals "-a -b "$ROOTFS_BLKDEV"" And then the "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" comparison makes three. One or more of these comparasions is failing and the code assumes it's means a file system failure. But does it? Reiserfsck says evertything is fine. Now "ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null`" tells fsck to check something (some metadata) but not do do an actual filesystem check. I edited out '2>/dev/null' but I didn't see the error output on screen or in /var/log/boot.msg. I believe there is an error in the process fsck uses to check the meta data or an error in the meta data itself; an error that reiserfsck does not recognize as an error. I ran echo $ROOTFS_BLKDEV and test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" but only blank lines were returned. Unanswered questions: What are the differences these three names represent: ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null` ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% } ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* } Is there a way to see the values they represent? What do these modifiers do: %%;##* Is there a way to see the values " test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" represent? Does 'DES_OK' represent the meta data? What is the meta data? Thanks, Jerome
On Sunday 07 November 2004 22:23, Jerome Lyles wrote:
So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck.
This is the relevant code:
# do fsck and start sulogin, if it fails. # FSCK_RETURN=0 # on first startup of a system with a lvm root device lvm /dev entries # may not exist at this time, so skip fsck in this case ROOTFS_BLKDEV=`fsck -T -N / 2>/dev/null` ROOTFS_BLKDEV=${ROOTFS_BLKDEV%% } ROOTFS_BLKDEV=${ROOTFS_BLKDEV##* } DES_OK=0 if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV" ; then DES_OK=1 fi
if test ! -f /fastboot -a -z "$fastboot" -a $DES_OK -eq 1 ; then FSCK_FORCE="" test -f /forcefsck && FSCK_FORCE="-f"
And this is the heart of the problem:
if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"
As I understand it this is a three way comparison:
The -a option means "-n "$ROOTFS_BLKDEV" " equals "$ROOTFS_BLKDEV" and "/" equals "-a -b "$ROOTFS_BLKDEV""
And then the "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" comparison makes three.
What is means is "if $ROOTFS_BLKDEV is non-empty, and if it isn't "/" and if it is a valid block device" If you run "fsck -T -N" in the rescue system, what is the output?
On Monday 08 November 2004 03:34 am, Anders Johansson wrote:
On Sunday 07 November 2004 22:23, Jerome Lyles wrote:
So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck.
And this is the heart of the problem:
if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"
As I understand it this is a three way comparison:
The -a option means "-n "$ROOTFS_BLKDEV" " equals "$ROOTFS_BLKDEV" and "/" equals "-a -b "$ROOTFS_BLKDEV""
And then the "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" comparison makes three.
What is means is "if $ROOTFS_BLKDEV is non-empty, and if it isn't "/" and if it is a valid block device"
If you run "fsck -T -N" in the rescue system, what is the output?
Rescue:~# fsck -T -N /dev/hda2 [/sbin/fsck.reiserfs(1) -- /dev/hda2] fsck.reiserfs /dev/hda2 Jerome
On Monday 08 November 2004 22:10, Jerome Lyles wrote:
On Monday 08 November 2004 03:34 am, Anders Johansson wrote:
On Sunday 07 November 2004 22:23, Jerome Lyles wrote:
So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck.
And this is the heart of the problem:
if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"
As I understand it this is a three way comparison:
The -a option means "-n "$ROOTFS_BLKDEV" " equals "$ROOTFS_BLKDEV" and "/" equals "-a -b "$ROOTFS_BLKDEV""
And then the "-n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"" comparison makes three.
What is means is "if $ROOTFS_BLKDEV is non-empty, and if it isn't "/" and if it is a valid block device"
If you run "fsck -T -N" in the rescue system, what is the output?
Rescue:~# fsck -T -N /dev/hda2 [/sbin/fsck.reiserfs(1) -- /dev/hda2] fsck.reiserfs /dev/hda2
ok, I'd be greatly interested in what the error code from reiserfsck is that it complains about then. Could you edit /etc/init.d/boot.rootfsck and in the section where it prints that error message, change the first empty echo to echo $FSCK_RETURN then reboot, let it fail and tell us what the error code is that gets printed. According to the logic the error code must be 4 or greater, and according to "man reiserfsck" that means it must be 4, 6, 8 or 16, but if a plain reiserfsck run doesn't show any errors, it almost has to be either 8 or 16, but it would be nice to know what it's complaining about
On Monday 08 November 2004 11:29 am, Anders Johansson wrote:
On Monday 08 November 2004 22:10, Jerome Lyles wrote:
On Monday 08 November 2004 03:34 am, Anders Johansson wrote:
On Sunday 07 November 2004 22:23, Jerome Lyles wrote:
So...what's happening here? I did some more digging and found the source of the problem behavior in /etc/init.d/boot.rootfsck.
And this is the heart of the problem:
if test -n "$ROOTFS_BLKDEV" -a "$ROOTFS_BLKDEV" != "/" -a -b "$ROOTFS_BLKDEV"
If you run "fsck -T -N" in the rescue system, what is the output?
Rescue:~# fsck -T -N /dev/hda2 [/sbin/fsck.reiserfs(1) -- /dev/hda2] fsck.reiserfs /dev/hda2
ok, I'd be greatly interested in what the error code from reiserfsck is that it complains about then. Could you edit /etc/init.d/boot.rootfsck and in the section where it prints that error message, change the first empty echo to
echo $FSCK_RETURN
then reboot, let it fail and tell us what the error code is that gets printed. According to the logic the error code must be 4 or greater, and according to "man reiserfsck" that means it must be 4, 6, 8 or 16, but if a plain reiserfsck run doesn't show any errors, it almost has to be either 8 or 16, but it would be nice to know what it's complaining about
fi echo $FSCK_RETURN Returns nothing except an empty line. Is this a case where FSCK actually returns nothing. If it is, it's a case that wasn't anticipated and could be the source of the problem. I don't know if this means anything but after I tried it two or three times I added another echo $FSCK_RETURN below the first one and rebooted. As far as I can tell only one empty line was returned. Jerome
On Tuesday 09 November 2004 03:24, Jerome Lyles wrote:
fi echo $FSCK_RETURN
Returns nothing except an empty line. Is this a case where FSCK actually returns nothing. If it is, it's a case that wasn't anticipated and could be the source of the problem.
I don't know if this means anything but after I tried it two or three times I added another echo $FSCK_RETURN below the first one and rebooted. As far as I can tell only one empty line was returned. Jerome
Hmmm, and you're sure you didn't misspell it as FCSK or something? I don't see anything obvious that would cause the variable to lose its value. Try, in place of that echo, pasting in this [ $FSCK_RETURN -eq 4 ] && echo "fatal errors left uncorrected" [ $FSCK_RETURN -eq 6 ] && echo "fixable errors left uncorrected" [ $FSCK_RETURN -eq 8 ] && echo "operational error" [ $FSCK_RETURN -eq 16 ] && echo "syntax error" Note the spaces in the brackets, they need to be there, it's [<space>$ and 4<space>], and so on
On Monday 08 November 2004 04:48 pm, Anders Johansson wrote:
On Tuesday 09 November 2004 03:24, Jerome Lyles wrote:
fi echo $FSCK_RETURN
Returns nothing except an empty line. Is this a case where FSCK actually returns nothing. If it is, it's a case that wasn't anticipated and could be the source of the problem. Jerome
Hmmm, and you're sure you didn't misspell it as FCSK or something?
I don't see anything obvious that would cause the variable to lose its value. Try, in place of that echo, pasting in this
[ $FSCK_RETURN -eq 4 ] && echo "fatal errors left uncorrected" [ $FSCK_RETURN -eq 6 ] && echo "fixable errors left uncorrected" [ $FSCK_RETURN -eq 8 ] && echo "operational error" [ $FSCK_RETURN -eq 16 ] && echo "syntax error"
Note the spaces in the brackets, they need to be there, it's [<space>$ and 4<space>], and so on
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file: elif test $FSCK_RETURN -gt 3; then # if appropriate, switch bootsplash to verbose # mode to make text messages visible. test -f /proc/splash && echo "verbose" > /proc/splash # Stop blogd since we reboot after sulogin test -x /sbin/blogd && killproc -QUIT /sbin/blogd if test -x /etc/init.d/kbd ; then /etc/init.d/kbd start fi [ $FSCK_RETURN -eq 4 ] && echo "fatal errors left uncorrected" [ $FSCK_RETURN -eq 6 ] && echo "fixable errors left uncorrected" [ $FSCK_RETURN -eq 8 ] && echo "operational error" [ $FSCK_RETURN -eq 16 ] && echo "syntax error" echo $FSCK_RETURN echo "fsck failed. Please repair manually and reboot. The root" echo "file system is currently mounted read-only. To remount it" echo "read-write do:" echo echo " bash# mount -n -o remount,rw /" echo echo "Attention: Only CONTROL-D will reboot the system in this" echo "maintanance mode. shutdown or reboot will not work." echo $FSCK_RETURN PS1="(repair filesystem) # " export PS1 /sbin/sulogin /dev/console Jerome
On Tuesday 09 November 2004 05:59, Jerome Lyles wrote:
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file: <snip> echo "fsck failed. Please repair manually and reboot. The root" echo "file system is currently mounted read-only. To remount it" echo "read-write do:" echo echo " bash# mount -n -o remount,rw /" echo echo "Attention: Only CONTROL-D will reboot the system in this" echo "maintanance mode. shutdown or reboot will not work." echo $FSCK_RETURN
But that is the error message that gets printed, right? OK, for the moment I'm baffled Could you make a similar change in /etc/init.d/boot.localfs, perhaps that's where it fails?! It just occurred to me that you will get the same error message regardless of which file system fails the fsck.
On Tuesday 09 November 2004 04:11 am, Anders Johansson wrote:
On Tuesday 09 November 2004 05:59, Jerome Lyles wrote:
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file:
<snip>
echo "fsck failed. Please repair manually and reboot. The root" echo "file system is currently mounted read-only. To remount it" echo "read-write do:" echo echo " bash# mount -n -o remount,rw /" echo echo "Attention: Only CONTROL-D will reboot the system in this" echo "maintanance mode. shutdown or reboot will not work." echo $FSCK_RETURN
But that is the error message that gets printed, right? OK, for the moment I'm baffled
Yes it is.
Could you make a similar change in /etc/init.d/boot.localfs, perhaps that's where it fails?! It just occurred to me that you will get the same error message regardless of which file system fails the fsck.
As soon as I get home from work. Thanks, Jerome
On Tuesday 09 November 2004 04:11 am, Anders Johansson wrote:
On Tuesday 09 November 2004 05:59, Jerome Lyles wrote:
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file:
<snip>
echo "fsck failed. Please repair manually and reboot. The root" echo "file system is currently mounted read-only. To remount it"
But that is the error message that gets printed, right? OK, for the moment I'm baffled
Could you make a similar change in /etc/init.d/boot.localfs, perhaps that's where it fails?! It just occurred to me that you will get the same error message regardless of which file system fails the fsck.
Found it! Operational error (8) on one of the other file systems. Jerome
On Thursday 11 November 2004 23:30, Jerome Lyles wrote:
On Tuesday 09 November 2004 04:11 am, Anders Johansson wrote:
On Tuesday 09 November 2004 05:59, Jerome Lyles wrote:
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file:
<snip>
echo "fsck failed. Please repair manually and reboot. The root" echo "file system is currently mounted read-only. To remount it"
But that is the error message that gets printed, right? OK, for the moment I'm baffled
Could you make a similar change in /etc/init.d/boot.localfs, perhaps that's where it fails?! It just occurred to me that you will get the same error message regardless of which file system fails the fsck.
Found it! Operational error (8) on one of the other file systems.
Ok, I've had a look at the source of reiserfsck, and error code 8 could mean any one of a dozen error conditions. If you run fsck manually on the other partitions, do you perhaps get a more verbose error message?
On Thursday 11 November 2004 02:58 pm, Anders Johansson wrote:
On Thursday 11 November 2004 23:30, Jerome Lyles wrote:
On Tuesday 09 November 2004 04:11 am, Anders Johansson wrote:
On Tuesday 09 November 2004 05:59, Jerome Lyles wrote:
No output again. Hard to believe? Welcome to my world. Here's the relevant part of the boot.rootfsck file:
Could you make a similar change in /etc/init.d/boot.localfs, perhaps that's where it fails?! It just occurred to me that you will get the same error message regardless of which file system fails the fsck.
Found it! Operational error (8) on one of the other file systems.
Ok, I've had a look at the source of reiserfsck, and error code 8 could mean any one of a dozen error conditions. If you run fsck manually on the other partitions, do you perhaps get a more verbose error message?
There were two problems. On the second harddrive hdb5 was always being recognized by fsck as a swap partition (which it originally was) even though I would reformat it as other partitions. So finally I deleted it and rebooted, still into the same problem. So I checked my external firewire drive. Fsck failed on /dev/sda1; said there was no such device, so I deleted the entry in /etc/fstab, rebooted and I am now back in business! Thank You Anders and Patrick, Bruce, C.Richard, Sid, peter, ptilopteri and Carl! Success at last! I couldn't done it without you! Jerome
participants (2)
-
Anders Johansson
-
Jerome Lyles