On 03/04/2015 04:41 PM, Stanislav Brabec wrote:
Hallo.
Many years ago we decided that smartd will not enable self tests by default.
In these old years, there were discs with firmware problems, and self test sometimes caused delays or even freezes. (In these old days, smartd was even disabled by default for the samer reason.) There was discussed possible shortening of life time by self tests.
Over the years, hardware changed a lot. Only two HDD manufacturers remain, and no firmware crashes/freezes/delays caused by S.M.A.R.T. were reported for nearly 10 years.
With raising density of discs, importance of regular checks raises as well. Detection of weak data can even prevent data losses before they even occur.
That is why I think there is a time to re-evaluate the old decision, and think about enabling regular Self Tests by default.
Nowadays, without running any self tests: If we don't run any self tests, then there is no guarantee (i. e. specification does not require it) that S.M.A.R.T. is monitored at all. However it seems, that most drives perform Offline Self Test every 4 hours (see item "Automatic Offline Testing") and update core parameters.
smartd is able to predict some types of failures that are related to inferior operation, and maybe failing mechanical parts (permanent seek errors, crash-landing of head, degradation of head) or failing electronics.
But it is not capable to predict failing disc surface.
Monitoring vital data is enabled in openSUSE since 2010. It is done twice a hour. No crash triggered by smartd on faulty firmware was not reported since 2005.
I propose to keep vital S.M.A.R.T. data frequency check on this default. I also propose to not perform Offline tests by smartd, as most (all?) discs do it every 4 hours, and do Short Self Tests instead.
Short Self Test: Short Self Test verifies status of the hardware function. It takes several minutes.
I propose to run Short Self Test once a day.
Benefits: There is no guarantee that the Offline Self Test covers all tests of Short Self Test, and that Offline Self Test is even regularly called by the firmware. Depending onf firmware implementation, Short Self Test may be required to predict some failures of core functions of HDD (i. e. total HDD failures).
If firmware does the same during the Offline Self Test and Short Self Test is enabled once per day, number of tests raises from 4 daily to 5. If it does something more, it could give a better prediction.
Long Self Test: Long Self Test (nearly for sure) performs full surface scan. It typically takes several to many hours.
If it finds a weak but still readable data, it silently relocates them (you only see change in S.M.A.R.T. statistics). If it finds unreadable sector, it retries to read it for some time, and if it fails, S.M.A.R.T. changes overall status to FAILED, and error is reported to the user.
Benefits: Running Long Self Test can prevent data loss in files that are not accessed, and if it happens, it is detected early and reported.
I propose to run Long Self Test once a month.
Risks: There is a risk, that Long Self Test slows down I/O operation due to inferior firmware. But if the firmware is written in a smart way, any read or write request should pause the Self Test, and it should be resumed after some time of being idle. If we suppose well written firmware, it should cause minimal delays.
Self Tests resume after reboot.
Note that the disc in status FAILED due to unreadable sector can still be "healed" by writing data to the failed place. Writing data to the failed place stops read retrying (now the data are not lost, but overwritten). Immediate relocation is performed, pending unreadable sector count is decreased, and if it reaches zero, next self test returns PASS.
It would nice if Yast had a similar module. -- Cheers! Roman -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org