The Thursday 2005-01-20 at 18:42 +0200, Hylton Conacher (ZR1HPC) wrote:
If smartctl is enabled and run on the HDD, it checks the physical characteristics ie for bad blocks and keeps a list.
It does far more than that, but it does not keep a list, not exactly: it keeps a log. This log is not a file, but resides on somewhere on the HD (there is a method to retrieve it). In this case a log of all the bad blocks, amongst other things.
No, it is a log of each error together with the internal command stack leading to the error. Here, look at a sample (one of my disk had errors two years ago): Error 325 occurred at disk power-on lifetime: 4275 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 64 9c 16 70 51 Error: UNC at LBA = 0x0170169c = 24123036 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- 40 d0 00 00 15 70 51 00 3.514 READ VERIFY SECTOR(S) 40 d0 00 00 14 70 51 00 3.476 READ VERIFY SECTOR(S) 40 d0 00 00 13 70 51 00 7.384 READ VERIFY SECTOR(S) 40 d0 00 00 12 70 51 00 3.537 READ VERIFY SECTOR(S) 40 d0 00 00 11 70 51 00 3.499 READ VERIFY SECTOR(S) Error 324 occurred at disk power-on lifetime: 4274 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 64 9c 16 70 51 Error: UNC at LBA = 0x0170169c = 24123036 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- 40 d0 00 00 15 70 51 00 3.473 READ VERIFY SECTOR(S) 40 d0 00 00 14 70 51 00 3.441 READ VERIFY SECTOR(S) 40 d0 00 00 13 70 51 00 7.341 READ VERIFY SECTOR(S) 40 d0 00 00 12 70 51 00 3.476 READ VERIFY SECTOR(S) 40 d0 00 00 11 70 51 00 3.474 READ VERIFY SECTOR(S) See? It is _not_ a list of bad sectors.
You can, for example, simply try to write to the bad sector. The disk notes that it can't, and at that moment remaps the sector. This is transparent to any filesystem or operating system you use. And smartctl does it.
No, not smartctl, but the HD bios, independently. smartctl is just an interface to it - after it happened.
Or you can take down the computer, reboot from the rescue CD, and then use the tools there to repair the filesystem: not really repair, but use a filesystem check utility that can locate and remap bad blocks. Or could you schedule frequent e2fsck -c scans of all the partitions other than /, which would have to be checked via a boot/rescue disk?
Er... yes, I think fsck can do it, but... don't do it frequently. Even more, you can not schedule it, because it has to be a manual operation: the partition has to be umounted first, and that will fail if in use. As I said, for schedule maintenance, use the smartctl daemon, and run the badblocks check manually, _after_ you know there is an error.
Convinced? Both are the _same_ program. OK, I believe, I believe! :)
:-)
When fsck calls fsck.ext3, it does not run the program, ie e2fsck -c, but it allows the fs to be checked according to the ext3 rules. If it finds an error it says check manually?
actually, fsck directly calls fsck.ext3 for ext3 partitions, fsck.reiserfs for reiser partitions, etc. Notice that fsck is a small program, 18484 bytes. Here, have a look at the sizes: -rwxr-xr-x 1 root root 18484 Apr 6 2004 /sbin/fsck* -rwxr-xr-x 1 root root 10580 May 27 2004 /sbin/fsck.cramfs* -rwxr-xr-x 3 root root 131344 Apr 6 2004 /sbin/fsck.ext2* -rwxr-xr-x 3 root root 131344 Apr 6 2004 /sbin/fsck.ext3* -rwxr-xr-x 2 root root 419926 Apr 6 2004 /sbin/fsck.jfs* -rwxr-xr-x 1 root root 22356 May 27 2004 /sbin/fsck.minix* lrwxrwxrwx 1 root root 7 Aug 15 14:01 /sbin/fsck.msdos -> dosfsck* -rwxr-xr-x 2 root root 275884 Apr 6 2004 /sbin/fsck.reiserfs* lrwxrwxrwx 1 root root 7 Aug 15 14:01 /sbin/fsck.vfat -> dosfsck* -rwxr-xr-x 1 root root 4475 Apr 6 2004 /sbin/fsck.xfs* All the fsck.whatever are much bigger than fsck, they are the real workforces. fsck on its own does nearly nothing, except call the others. Notice also that fsck.ext2 is also the same as fsck.ext3.
To see the list of errors SMART logs, do:
smartctl --log=error /dev/hdb |less
Great, no errors here. But then you already knew that. :)
No errors? What is this, then:
SMART Error Log Version: 1 ATA Error Count: 325 (device log contains only the most recent five errors)
........................^^^ So, this disk did have serious errors, once upon a time. I'm still using it. It cured itself, with some help from me. The manufacturer had taken measures in advance to handle some I/O errors, that's designed for.
If there were errors, the sector (LBA sector, not ext3 sector) would be logged there. But only the first one.
And... it is not the first time I tell you: I did last October: Alright, alright, so I asked it before. Unfortunately I couldn't reference my message store as the dual boot HDD my Linux was on died just after our little episode. :(
Ah... so you are worried about HD dying on you. I see.
I also only just found out how to query the archives via Google.
It was a pity but this explanation was a little better ;) I'll ask it again in 6 months time for all the green newbies too.
:) Just kidding :)
X-)
To FINALLY summarize: smartctl looks after the physical HDD and keeps its own log of errors it has found. It does not interface with any other program.
Right. But it does way more than that, it does some manufacturer tests. Look at the output of "smartctl -a /dev/hdb|less" on your drive, or more to the point, "--attributes". It is a table of some parameters that try to determine the health status of the drive. For example, if the Raw_Read_Error_Rate goes over a certain value, the output of the command: nimrodel:~ # smartctl --health /dev/hdb smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED would directly say that my drive was about to fail, soon. That's why I say that daemon is _important_. You see, there are more things that can go bad than badblocks. For example, the spindle could have excessive wear, and wobble. The motor may overheat. The controller could have bad chips. The head could have a very bad crash landing... tons of things. Bad blocks on their own are not critical. The first drive I had (a 32Mb unit on a full length ISA card) came with a paper label with a list of six bad sectors, handwritten by the manufacturer at their final QA test. You were supposed to low level format the disk your self, and enter the list. Now days that is no longer necessary, but HDs come with space reserved by the manufacturer to automatically remap there the bad sectors that will develop during its life.
fsck looks after the logical fs structure and uses the fs specific fsck option to check that the structure of the fs is per standard. If it finds a ,physical or logical, error it requires manual intervention ie running the e2fsck -c option to confirm the fs is logically correctly and also to scan for errors on the fs caused by the hardware ie bad blocks.
Yes! :-) Well, almost O:-) The "-c" option is not usually needed. A logical error can appear without having any badblock at all, and may need manual intervention. The thing is, fsck is run automatically at boot time, with the options selected so that it is safe to leave it in automatic mode, or little intervention. It is the script written by SuSE that decides not to continue booting the system and request you run fsck manually - because, for example, it can be a mistaken /etc/fstab file (it happened to me once).
Thanks a TON Carlos.
Welcome :-) -- Cheers, Carlos Robinson