Re: [SLE] Preventing fs errors -with the e2fsck command? and SMART, e2fsck confusion

15 Jan 2005

      Carlos E. R. wrote:
...
The Thursday 2005-01-13 at 15:14 +0200, Hylton Conacher (ZR1HPC) wrote:
Sorry. I've rearranged the order of the Q+A also also added
substantially to it to ease my understanding . I am trying to find some
kind of link/reason/relationship between using fsck, e2fsck and
smart if fsck and smart are enabled.
...
...
...
...
My understanding is that although SMART will check the physical disk,
e2fsck will check that the data is able to be written to the physical
hdd, and of course according to the fs. So therefore eventhough I have
SMART enabled, there might still be a case where data cannot be written
to the hdd, resulting in a failed fsck on that partition on bootup.
fsck tests the partition logically, not physically. It can also run a
badblock check (in some filetypes), but is certainly not as complete in that
respect as smart.
So the fsck cmd at boot checks the logical fs and assumes the physical
structure to be OK.
If smartctl is enabled and run on the HDD, it checks the physical
characteristics ie  for bad blocks and keeps a list. It does not however
keep a list of them that the fs can access. If smartctl has been run and
a block has been marked as bad, the smartctl HDD utility can remap the
bad block so that data written to the disk lands on a good block.

When the fs is checked at boot time, if it should find a error on the fs
caused by the physical HDD( and as a result of it not being able to read
the smartctl bad blocks list), it assumes that fsck should be run
manually ie e2fsck with -c /dev/hdx to check for bad blocks. This then
creates a list of bad blocks the fs cannot use.

In effect though we have two lists of bad blocks or does the fsck
program via e2fsck -c take the smartctl list and add/append to it?

What is the relationship between smartctl, fsck, and e2fsck(being a type
of fsck for ext3 fs)?

[snip]
...
Ok, it goes like this: 1) There is a physical error in the media, meaning
that what the kernel tries to write is not the same as what it reads back. 
2) "Something" marks the block(s) containing that error as bad.
Could I assume the 'Something' to be either smartctl or e2fsck?
...
...
IF SMART doesn't check for bad blocks,
OK, sorry it does check for them but doesn't repair them. What would
repair them as neither the e2fsck or mk2fs don't seem to offer a repair
option?
...
...
then in theory fsck should check
for bad blocks
By default, it does not.
Because it assume that the physical disk is OK. personally I think the
fsck should have the e2fsck -c option added into boot. I'll have to
investigate if there is a kernel wishlist.
...
Because there is no need, and because it is terribly time consuming, a
matter of several hours. You are too paranoid about them, I think :-)
:) Paranoid I might be but with a stable and safe HDD space to put data
on, is something that computer folk(read I) have dreamt of for years ie
a self healing system. Heck it would almost negate the purpose of
backups, saving companies fortunes.
...
...
...
...
I would like to run the e2fsck command to prevent the failure of
partition checking by fsck on bootup as reding the man page on fsck it
does not seem up to working on a ext3 fs.
Then just force a check during boot, by creating the file "/forcefsck". An
ext3 partition will be checked to the needed level, not more. Doing a
badblock check everytime is an overkill, and will not really protect your
data.
/forcefsck is a partial solution as fsck itself doesn't check bad
blocks, only e2fsck does. The next thing is to find out where and what
syntax to use to get e2fsck to run at boot, if the standard boot fsck fails.
It may be overkill but it provides folk with a little more data safety.
...
...
That '/forcefsck' option is a little strong
Why?
The standard fsck normally runs at boot anyway, it is the e2fsck -c
option that has to be called on if the standard fsck fails. It would be
a good idea to '/force2fsck -c' if and only if the normal fsck failed.
...
/forcefsck just tells the '/etc/init.d/boot.localfs' that you want to check the
filesystems regardless of whether it is needed or not. It doesn't do
anything "drastic".
mmmm, had a look at the /etc...bootlocalfs file and see that it mentions
that if the error code is 1 then all is OK, if 2 the fsck failed and the
message of 'fsck has failed. Please run fsck manually is displayed.
what I am saying is that on returning an error 2 the kernel should
automatically run e2fsck -c.
...
...
but see the next paragraph
for my suggestion. Why have the 'bad block' option and why will it not
protect my data? Surely it will make sure that data is not lost because
the block has been marked as bad and therefore the data will be written
to a good block?
It only detects what new sectors are bad at that moment. What if the error
develops later, while the system is running? In fact, while running the HD 
is more vulnerable, because the heads are not parked, but flying at a very 
small distance from the HD surface.
So running e2fsck -c regularly might prevent writing to a bad block that
developed since your last e2fsck -c?
...
...
...
For a somewhat more complete check, boot from the rescue CD and test from
there.
I was thinking more along the lines of possibly aliasing the boot fsck
to e2fsck and having it run e2fsck each time the fsck is supposed to run
on a partition set with the tune2fs cmd ie every 3rd mount or 15 days etc.
fsck calls e2fsck for you.
How so, fsck doesn't check for bad blocks you end up having to run
e2fsck if the boot fsck fails?
...
Playing with that is dangerous, because at some 
time you may have a differently formated partition and apply the wrong 
program.
And not being a programmer, do not worry, I'll stay well away.
...
Look: SuSE people are quite expert and wise, and they have designed those 
scripts with a lot of thought and care. You really do not have to modify 
them.
If you are worried about bad blocks, do: 1) configure smartctld to run
tests periodically on the background. 2) Keep your backups current. 3) If
you really need it, use raid setups.
OK.

To summarize:
A) HDD has not had smartctl enabled
1) When fsck is run it scans the fs on the disk and makes sure it is
    logically correct.
2) Data is written to the HDD
2.1) Part of the data written lands in a bad block not marked as bad
3) If fsck is run now it will determine that there is an error on the fs
    and instruct you to run it manually, presumably so that the bad block
    check can be run via e2fsck -c.
4) e2fsck is run, finds the bad block and marks it as such.
4.1) The data on that bad block is lost

in my case:
B) /home HDD and backup HDD have had smartctl enabled
1) When fsck is run it scans the fs on the disk and makes sure it is
    logically correct.
2) Data is written to the HDD
2.1) Part of the data written lands in a bad block not marked as bad
3) Not knowing that some of the data has landed on a bad sector I run a
    smartctl scan
3.1) It finds the error and marks the bad clock
3) If fsck is run now it will determine that there is an error on the fs
    and instruct you to run it manually, presumably so that the bad
block list of the fs can be updated via e2fsck -c.
4) e2fsck is run, finds the bad block and marks it as such.
4.1) The data on that bad block is lost

So how to fsck, e2fsck and smart integrate? Sorry if I've left
out/assumed steps.

-- 
The Little Helper
========================================================================
Hylton Conacher - Linux user # 229959 at http://counter.li.org
Currently using SuSE 9.0 Professional with KDE 3.1
Licenced Windows user
========================================================================