The Saturday 2005-01-15 at 15:41 +0200, Hylton Conacher (ZR1HPC) wrote:
So the fsck cmd at boot checks the logical fs and assumes the physical structure to be OK.
Yes.
If smartctl is enabled and run on the HDD, it checks the physical characteristics ie for bad blocks and keeps a list.
It does far more than that, but it does not keep a list, not exactly: it keeps a log. This log is not a file, but resides on somewhere on the HD (there is a method to retrieve it).
It does not however keep a list of them that the fs can access. If smartctl has been run and a block has been marked as bad,
It does not mark a block as bad, because it knows nothing about the structure of the filesystem. Remember, it is the HD hardware and bios, independent of the main CPU or operating system, that does the testing. The program smartctl just starts the process, it doesn't actually test anything. The HD routine just takes note that there are sectors with read errors, logs that fact, and later smartctl tells you.
the smartctl HDD utility can remap the bad block so that data written to the disk lands on a good block.
No, it doesn't: it is your job to take appropriate action. You can, for example, simply try to write to the bad sector. The disk notes that it can't, and at that moment remaps the sector. This is transparent to any filesystem or operating system you use. Or you can take down the computer, reboot from the rescue CD, and then use the tools there to repair the filesystem: not really repair, but use a filesystem check utility that can locate and remap bad blocks. For example.
When the fs is checked at boot time, if it should find a error on the fs caused by the physical HDD( and as a result of it not being able to read the smartctl bad blocks list), it assumes that fsck should be run manually ie e2fsck with -c /dev/hdx to check for bad blocks. This then creates a list of bad blocks the fs cannot use.
In effect though we have two lists of bad blocks or does the fsck program via e2fsck -c take the smartctl list and add/append to it?
No, no. smartctl does not generate any list.
What is the relationship between smartctl, fsck, and e2fsck(being a type of fsck for ext3 fs)?
smartctl is independent. fsck is a wrapper that calls the actual program needed to check each partition type with its own different checker program. In effect, fsck calls fsck.ext3, which is the same as e2fsck, but with a different name. Look and believe: -rwxr-xr-x 3 root root 131344 Apr 6 2004 /sbin/fsck.ext3* -rwxr-xr-x 3 root root 131344 Apr 6 2004 /sbin/e2fsck* nimrodel:~ # cmp /sbin/e2fsck /sbin/fsck.ext3 nimrodel:~ # Convinced? Both are the _same_ program. To see the list of errors SMART logs, do: smartctl --log=error /dev/hdb |less SMART Error Log Version: 1 ATA Error Count: 325 (device log contains only the most recent five errors) ... Error 325 occurred at disk power-on lifetime: 4275 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 64 9c 16 70 51 Error: UNC at LBA = 0x0170169c = 24123036 There you see the LBA address of the error I had _years_ ago; this is not usable for fsck. With this other command, I see the log of the last 20 tests: smartctl --log=selftest /dev/hdb |less smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 8099 - # 2 Short offline Completed without error 00% 8092 - # 3 Short offline Completed without error 00% 8092 - # 4 Short offline Completed without error 00% 8085 - # 5 Short offline Completed without error 00% 8085 - # 6 Short offline Completed without error 00% 8072 - # 7 Short offline Completed without error 00% 8072 - # 8 Short offline Completed without error 00% 8058 - # 9 Short offline Completed without error 00% 8057 - #10 Short offline Completed without error 00% 8046 - #11 Short offline Completed without error 00% 8045 - #12 Short offline Completed without error 00% 8043 - #13 Short offline Completed without error 00% 8038 - #14 Extended offline Completed without error 00% 8038 - If there were errors, the sector (LBA sector, not ext3 sector) would be logged there. But only the first one. And... it is not the first time I tell you: I did last October: Date: Mon, 18 Oct 2004 13:07:25 +0200 (CEST) From: Carlos E. R. To: SLE <suse-linux-e Subject: Re: [SLE] Re: {SLE} SMART HDD technology was Re: [SLE] e2fsck command
[snip]
Ok, it goes like this: 1) There is a physical error in the media, meaning that what the kernel tries to write is not the same as what it reads back. 2) "Something" marks the block(s) containing that error as bad. Could I assume the 'Something' to be either smartctl or e2fsck?
smartctl no. e2fsck, yes. Also, the kernel "could" do it on the fly.
IF SMART doesn't check for bad blocks, OK, sorry it does check for them but doesn't repair them. What would repair them as neither the e2fsck or mk2fs don't seem to offer a repair option?
Bad blocks can not be "repaired": they are bad, and remain so for ever. They can be "remapped", depending on the filesystem: FAT, ext2, reiserfs partially (since SuSE 9.0 or 9.1). And e2fsck can do that (with -c).
then in theory fsck should check for bad blocks
By default, it does not. Because it assume that the physical disk is OK. personally I think the fsck should have the e2fsck -c option added into boot. I'll have to investigate if there is a kernel wishlist.
No, I don't. Nobody will add it, and certainly not to the kernel, as it is the boot.localfs script that does the testing, not the kernel. Nobody will want that "feature", because then boot time will be measured in HOURS!
Because there is no need, and because it is terribly time consuming, a matter of several hours. You are too paranoid about them, I think :-) :) Paranoid I might be but with a stable and safe HDD space to put data on, is something that computer folk(read I) have dreamt of for years ie a self healing system. Heck it would almost negate the purpose of backups, saving companies fortunes.
You can never ensure that a system will never fail. You can just make it less probable, and limit the damage. My home disks are now about 8000 hours old (of real usage). One of them developed a few bad blocks a long time ago (more than a year or two). I solved the incident, manually, and that is the last I heard of that for a very long time. And that doesn't mean that I'm going to worry to the extreme of having to wait several _hours_ after power on before the system comes on line, because it is testing itself for I/O errors on all the HD surfaces! Do you go every day to your doctor to have a full checkout before getting up? Because that is what you are asking.
That '/forcefsck' option is a little strong
Why? The standard fsck normally runs at boot anyway, it is the e2fsck -c option that has to be called on if the standard fsck fails. It would be a good idea to '/force2fsck -c' if and only if the normal fsck failed.
/forcefsck just tells the '/etc/init.d/boot.localfs' that you want to check the filesystems regardless of whether it is needed or not. It doesn't do anything "drastic". mmmm, had a look at the /etc...bootlocalfs file and see that it mentions that if the error code is 1 then all is OK, if 2 the fsck failed and the message of 'fsck has failed. Please run fsck manually is displayed.
Right.
what I am saying is that on returning an error 2 the kernel should automatically run e2fsck -c.
No. The error can be anything! It is you who must decide what to do. It can be a false alarm, a misconfiguration... more automated action and you may ruin your HD completely :-|
but see the next paragraph for my suggestion. Why have the 'bad block' option and why will it not protect my data? Surely it will make sure that data is not lost because the block has been marked as bad and therefore the data will be written to a good block?
It only detects what new sectors are bad at that moment. What if the error develops later, while the system is running? In fact, while running the HD is more vulnerable, because the heads are not parked, but flying at a very small distance from the HD surface. So running e2fsck -c regularly might prevent writing to a bad block that developed since your last e2fsck -c?
Yes. So would smartctl, and that doesn't hold up the system. No! If the HD detects an error when it is going to _write_ to a sector, that sector is remapped then and there, without telling anybody. Certainly the kernel knows nothing (only a time delay). An fsck run later would know nothing about it, the sector would show as correct (because it has been remaped somewere else transparently).
For a somewhat more complete check, boot from the rescue CD and test from there.
I was thinking more along the lines of possibly aliasing the boot fsck to e2fsck and having it run e2fsck each time the fsck is supposed to run on a partition set with the tune2fs cmd ie every 3rd mount or 15 days etc.
fsck calls e2fsck for you. How so, fsck doesn't check for bad blocks you end up having to run e2fsck if the boot fsck fails?
Read again the man page. fsck passes options it doesn't know how to handle to e2fsck.
To summarize: A) HDD has not had smartctl enabled 1) When fsck is run it scans the fs on the disk and makes sure it is logically correct. 2) Data is written to the HDD 2.1) Part of the data written lands in a bad block not marked as bad 3) If fsck is run now it will determine that there is an error on the fs and instruct you to run it manually, presumably so that the bad block check can be run via e2fsck -c.
Not by default. There is no logical or structure error.
4) e2fsck is run, finds the bad block and marks it as such. 4.1) The data on that bad block is lost
Correction. If the error is noticed at write time, the HD itself takes action (if it is reasonably modern). If the kernel gets to know, it "could" take action as well (even old MsDOS had this case designed for). If the error develops after writing, it is detected when someone tries to read it again. Remember that fsck and derivatives do not check for read errors.
in my case: B) /home HDD and backup HDD have had smartctl enabled 1) When fsck is run it scans the fs on the disk and makes sure it is logically correct. 2) Data is written to the HDD 2.1) Part of the data written lands in a bad block not marked as bad
As I said above, no.
3) Not knowing that some of the data has landed on a bad sector I run a smartctl scan 3.1) It finds the error and marks the bad clock
No, it is just reported. You could configure the daemon to email you, or sound your beeper. Whatever. :-)
3) If fsck is run now it will determine that there is an error on the fs and instruct you to run it manually, presumably so that the bad block list of the fs can be updated via e2fsck -c.
No, it doesn't. It will be detected only if the error falls in a sector belonging to the structure, like a directory, for the simple reason that it doesn't read everything, unless you give it the -c option.
4) e2fsck is run, finds the bad block and marks it as such.
No. Only if you run it manually with the right option.
4.1) The data on that bad block is lost
Maybe yes, maybe not. It may just mark a sector as bad, or it may try to read it first and remap leaving the file usable. Old chkdsk for MsDOS did, so why not fsck? Of course, as the sector has a read error, the contents of the sector could be wrong. That's a reason more to run it manually, so that you check the file against a backup if you can.
So how to fsck, e2fsck and smart integrate? Sorry if I've left out/assumed steps.
Ough... see somewhere above. fsck just calls e2fsck as soon as it notices it is an ext2/3 partition. fsck itself does nothing (have a look at its size), except detecting the type of partition and pass control to whatever program is appropiate. -- Cheers, Carlos Robinson