New subject: [opensuse-factory] Re: smartmontools: Proposal to enable regular HDD Self Tests

4 Mar 2015

      Hallo.

Many years ago we decided that smartd will not enable self tests by default.

In these old years, there were discs with firmware problems, and self 
test sometimes caused delays or even freezes. (In these old days, smartd 
was even disabled by default for the samer reason.) There was discussed 
possible shortening of life time by self tests.

Over the years, hardware changed a lot. Only two HDD manufacturers 
remain, and no firmware crashes/freezes/delays caused by S.M.A.R.T. were 
reported for nearly 10 years.

With raising density of discs, importance of regular checks raises as 
well. Detection of weak data can even prevent data losses before they 
even occur.

That is why I think there is a time to re-evaluate the old decision, and 
think about enabling regular Self Tests by default.

Nowadays, without running any self tests: If we don't run any self 
tests, then there is no guarantee (i. e. specification does not require 
it) that S.M.A.R.T. is monitored at all. However it seems, that most 
drives perform Offline Self Test every 4 hours (see item "Automatic 
Offline Testing") and update core parameters.

smartd is able to predict some types of failures that are related to 
inferior operation, and maybe failing mechanical parts (permanent seek 
errors, crash-landing of head, degradation of head) or failing electronics.

But it is not capable to predict failing disc surface.

Monitoring vital data is enabled in openSUSE since 2010. It is done 
twice a hour. No crash triggered by smartd on faulty firmware was not 
reported since 2005.

I propose to keep vital S.M.A.R.T. data frequency check on this default. 
I also propose to not perform Offline tests by smartd, as most (all?) 
discs do it every 4 hours, and do Short Self Tests instead.

Short Self Test: Short Self Test verifies status of the hardware 
function. It takes several minutes.

I propose to run Short Self Test once a day.

Benefits: There is no guarantee that the Offline Self Test covers all 
tests of Short Self Test, and that Offline Self Test is even regularly 
called by the firmware. Depending onf firmware implementation, Short 
Self Test may be required to predict some failures of core functions of 
HDD (i. e. total HDD failures).

If firmware does the same during the Offline Self Test and Short Self 
Test is enabled once per day, number of tests raises from 4 daily to 5. 
If it does something more, it could give a better prediction.

Long Self Test: Long Self Test (nearly for sure) performs full surface 
scan. It typically takes several to many hours.

If it finds a weak but still readable data, it silently relocates them 
(you only see change in S.M.A.R.T. statistics). If it finds unreadable 
sector, it retries to read it for some time, and if it fails, S.M.A.R.T. 
changes overall status to FAILED, and error is reported to the user.

Benefits: Running Long Self Test can prevent data loss in files that are 
not accessed, and if it happens, it is detected early and reported.

I propose to run Long Self Test once a month.

Risks: There is a risk, that Long Self Test slows down I/O operation due 
to inferior firmware. But if the firmware is written in a smart way, any 
read or write request should pause the Self Test, and it should be 
resumed after some time of being idle. If we suppose well written 
firmware, it should cause minimal delays.

Self Tests resume after reboot.

Note that the disc in status FAILED due to unreadable sector can still 
be "healed" by writing data to the failed place. Writing data to the 
failed place stops read retrying (now the data are not lost, but 
overwritten). Immediate relocation is performed, pending unreadable 
sector count is decreased, and if it reaches zero, next self test 
returns PASS.

-- 
Best Regards / S pozdravem,

Stanislav Brabec
software developer
---------------------------------------------------------------------
SUSE LINUX, s. r. o.                          e-mail: sbrabec@suse.cz
Lihovarská 1060/12                            tel: +49 911 7405384547
190 00 Praha 9                                 fax:  +420 284 084 001
Czech Republic                                    http://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

[opensuse-factory] smartmontools: Proposal to enable regular HDD Self Tests

Stanislav Brabec

Yamaban

Malcolm

Stanislav Brabec

Per Jessen

Stanislav Brabec

Malcolm

Stanislav Brabec

Emil Stephan

Bruno Friedmann

Stanislav Brabec

Per Jessen

Stanislav Brabec

Bruno Friedmann

Stanislav Brabec

Roman Bysh

Stanislav Brabec

Martin Pluskal

Per Jessen

Yamaban

Per Jessen

Stanislav Brabec

Per Jessen

Stanislav Brabec

Guido Berhoerster

Stanislav Brabec

Opensuse user

Stanislav Brabec

Guido Berhoerster

Stanislav Brabec

Christian Boltz

Per Jessen

Per Jessen

Brüns, Stefan

Ruediger Meier

Stanislav Brabec

Stanislav Brabec

tags

participants (13)