Re: [SLE] Preventing fs errors -with the e2fsck command? and SMART, e2fsck confusion

21 Jan 2005

      The Thursday 2005-01-20 at 18:42 +0200, Hylton Conacher (ZR1HPC) wrote:
...
...
...
If smartctl is enabled and run on the HDD, it checks the physical
characteristics ie  for bad blocks and keeps a list.
It does far more than that, but it does not keep a list, not exactly: it
keeps a log. This log is not a file, but resides on somewhere on the HD
(there is a method to retrieve it).
In this case a log of all the bad blocks, amongst other things.
No, it is a log of each error together with the internal command stack 
leading to the error. Here, look at a sample (one of my disk had errors 
two years ago):

Error 325 occurred at disk power-on lifetime: 4275 hours
  When the command that caused the error occurred, the device was active 
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 64 9c 16 70 51  Error: UNC at LBA = 0x0170169c = 24123036

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  40 d0 00 00 15 70 51 00       3.514  READ VERIFY SECTOR(S)
  40 d0 00 00 14 70 51 00       3.476  READ VERIFY SECTOR(S)
  40 d0 00 00 13 70 51 00       7.384  READ VERIFY SECTOR(S)
  40 d0 00 00 12 70 51 00       3.537  READ VERIFY SECTOR(S)
  40 d0 00 00 11 70 51 00       3.499  READ VERIFY SECTOR(S)

Error 324 occurred at disk power-on lifetime: 4274 hours
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 64 9c 16 70 51  Error: UNC at LBA = 0x0170169c = 24123036

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  40 d0 00 00 15 70 51 00       3.473  READ VERIFY SECTOR(S)
  40 d0 00 00 14 70 51 00       3.441  READ VERIFY SECTOR(S)
  40 d0 00 00 13 70 51 00       7.341  READ VERIFY SECTOR(S)
  40 d0 00 00 12 70 51 00       3.476  READ VERIFY SECTOR(S)
  40 d0 00 00 11 70 51 00       3.474  READ VERIFY SECTOR(S)

See? It is _not_ a list of bad sectors.
...
...
You can, for example, simply try to write to the bad sector. The disk notes
that it can't, and at that moment remaps the sector. This is transparent to
any filesystem or operating system you use.
And smartctl does it.
No, not smartctl, but the HD bios, independently. smartctl is just an 
interface to it - after it happened.
...
...
Or you can take down the computer, reboot from the rescue CD, and then use
the tools there to repair the filesystem: not really repair, but use a
filesystem check utility that can locate and remap bad blocks.
Or could you schedule frequent e2fsck -c scans of all the partitions
other than /, which would have to be checked via a boot/rescue disk?
Er... yes, I think fsck can do it, but... don't do it frequently. Even 
more, you can not schedule it, because it has to be a manual operation: 
the partition has to be umounted first, and that will fail if in use. As I 
said, for schedule maintenance, use the smartctl daemon, and run the 
badblocks check manually, _after_ you know there is an error.
...
...
Convinced? Both are the _same_ program.
OK, I believe, I believe! :)
:-)
...
When fsck calls fsck.ext3, it does not run the program, ie e2fsck -c,
but it allows the fs to be checked according to the ext3 rules. If it
finds an error it says check manually?
actually, fsck directly calls fsck.ext3 for ext3 partitions, fsck.reiserfs
for reiser partitions, etc. Notice that fsck is a small program, 18484
bytes. Here, have a look at the sizes:

-rwxr-xr-x  1 root root  18484 Apr  6  2004 /sbin/fsck*
-rwxr-xr-x  1 root root  10580 May 27  2004 /sbin/fsck.cramfs*
-rwxr-xr-x  3 root root 131344 Apr  6  2004 /sbin/fsck.ext2*
-rwxr-xr-x  3 root root 131344 Apr  6  2004 /sbin/fsck.ext3*
-rwxr-xr-x  2 root root 419926 Apr  6  2004 /sbin/fsck.jfs*
-rwxr-xr-x  1 root root  22356 May 27  2004 /sbin/fsck.minix*
lrwxrwxrwx  1 root root      7 Aug 15 14:01 /sbin/fsck.msdos -> dosfsck*
-rwxr-xr-x  2 root root 275884 Apr  6  2004 /sbin/fsck.reiserfs*
lrwxrwxrwx  1 root root      7 Aug 15 14:01 /sbin/fsck.vfat -> dosfsck*
-rwxr-xr-x  1 root root   4475 Apr  6  2004 /sbin/fsck.xfs*

All the fsck.whatever are much bigger than fsck, they are the real 
workforces. fsck on its own does nearly nothing, except call the others.

Notice also that fsck.ext2 is also the same as fsck.ext3.
...
...
To see the list of errors SMART logs, do:
smartctl --log=error /dev/hdb |less

...
Great, no errors here. But then you already knew that. :)
No errors? What is this, then:
...
...
SMART Error Log Version: 1
     ATA Error Count: 325 (device log contains only the most recent five errors)
........................^^^

So, this disk did have serious errors, once upon a time. I'm still using 
it. It cured itself, with some help from me. The manufacturer had taken 
measures in advance to handle some I/O errors, that's designed for.
...
...
If there were errors, the sector (LBA sector, not ext3 sector) would be
logged there. But only the first one.
And... it is not the first time I tell you: I did last October:
Alright, alright, so I asked it before. Unfortunately I couldn't
reference my message store as the dual boot HDD my Linux was on died
just after our little episode. :(
Ah... so you are worried about HD dying on you. I see.
...
I also only just found out how to query the archives via Google.
It was a pity but this explanation was a little better ;)
I'll ask it again in 6 months time for all the green newbies too.
:)
Just kidding :)
X-)
...
To FINALLY summarize:
smartctl looks after the physical HDD and keeps its own log of errors it
has found. It does not interface with any other program.
Right.

But it does way more than that, it does some manufacturer tests. Look at 
the output of "smartctl -a /dev/hdb|less" on your drive, or more to the 
point, "--attributes". It is a table of some parameters that try to 
determine the health status of the drive. For example, if the 
Raw_Read_Error_Rate goes over a certain value, the output of the command:

	nimrodel:~ # smartctl --health /dev/hdb
	smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF READ SMART DATA SECTION ===
	SMART overall-health self-assessment test result: PASSED

would directly say that my drive was about to fail, soon. That's why I say 
that daemon is _important_.

You see, there are more things that can go bad than badblocks. For 
example, the spindle could have excessive wear, and wobble. The motor may 
overheat. The controller could have bad chips. The head could have a very 
bad crash landing... tons of things.

Bad blocks on their own are not critical. The first drive I had (a 32Mb
unit on a full length ISA card) came with a paper label with a list of six
bad sectors, handwritten by the manufacturer at their final QA test. You
were supposed to low level format the disk your self, and enter the list.
Now days that is no longer necessary, but HDs come with space reserved by
the manufacturer to automatically remap there the bad sectors that will
develop during its life.
...
fsck looks after the logical fs structure and uses the fs specific fsck
option to check that the structure of the fs is per standard. If it
finds a ,physical or logical, error it requires manual intervention ie
running the e2fsck -c option to confirm the fs is logically correctly
and also to scan for errors on the fs caused by the hardware ie bad blocks.
Yes! :-)

Well, almost O:-)

The "-c" option is not usually needed. A logical error can appear without 
having any badblock at all, and may need manual intervention. The thing 
is, fsck is run automatically at boot time, with the options selected so 
that it is safe to leave it in automatic mode, or little intervention. It 
is the script written by SuSE that decides not to continue booting the 
system and request you run fsck manually - because, for example, it can be 
a mistaken /etc/fstab file (it happened to me once).
...
Thanks a TON Carlos.
Welcome :-)

-- 
Cheers,
       Carlos Robinson

Re: [SLE] Preventing fs errors -with the e2fsck command? and SMART, e2fsck confusion

Carlos E. R.