Re: [opensuse-kernel] Is this a bug or my disk is going to fail?

21 Jul 2014

      On Mon, Jul 21, 2014 at 10:18 AM, Marguerite Su <i@marguerite.su> wrote:
...
Hi,
A few days ago suddenly I can't mount /root because:
2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00:
failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363]
end_request: I/O error, dev sda, sector 59764644
After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 84 ef 8f e3  Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail
Always       -       2847
But:
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age
Always       -       0
Google says it is the hardware problem, but after a reinstall, my
computer still works...and actually the disk is less than 2 years
old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
  When the command that caused the error occurred, the device was
active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
It is certainly not a bug, but it may also not be time to replace the drive.

You gave us too little data to even make an educated guess.

Use "smartctl -a /dev/sda" to see all the smart output and post it here.

A couple comments:

- disks are cheap, businesses often replace a disk the first time they
have a read error that is reported all the way up the stack to
userspace.  You've just had one of those.

The "smart" data you are reporting is very manufacturer and even disk
model specific.  In general you have to record it periodically and
look for changes to see if anything bad is going on.  Despite that, I
don't do that myself and neither do many other people.

- a raw read error may just mean that ECC correction had to be
invoked, so who cares.  My current laptop is reporting 226 million raw
read errors.  It also says the exact same 226 million attempts at ECC
correction worked.

- It is also reporting zero reallocations and zero
pending reallocations.  I'm not planning to replace it. You've had
only 2847 raw read errors so that by itself looks like a small,
insignificant number.

- My "reported Uncorrectable" raw value is zero, so not even a single
retry has been required yet.  That is surprising because a physical
bump of the laptop while it is reading data can cause an uncorrectable
error that a retry would likely solve.

- If you don't know, many / most (but not all) disks only re-allocate
sectors on write.  Thus if you read a sector and the data is not
readable and the ECC correction also fails, many drives will mark that
sector for re-allocation.  It will stay in that state until that
specific sector is written with replacement data. After all, why should
the drive re-allocate the sector if it doesn't know what to put in the
new sector.

- Thus if you want to know how many current
bad media sectors you have, it should be reflected in the "pending
reallocation" data.

- If you want to know how many bad sectors you used to have,
but were fixed by reallocation, look at the Re-allocated data.

As I understand it, some drives will monitor those "pending
re-allocation" sectors and if you ever successfully read data from
them, they will go ahead and allocate a new sector and write the good
data there.  I absolutely know there have been reports of read only
actions triggering re-allocations.  I don't know on which makes/models
of drives this has been observed.

Note that the drive itself will NOT proactively monitor the sectors
for issues, nor will it periodically check pending bad sectors to see
if it can get valid data.

Thus some raid setups have the ability to perform background media
scans and if a media error is reported, get the valid data from an
alternate drive and write it back to the drive reporting the bad
sector.  NetApp (among others) calls this scrubbing the array.
(https://library.netapp.com/ecmdocs/ECMP1196912/html/GUID-81F8BEA3-ADC1-4790-...)

Greg

-
-- 
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Re: [opensuse-kernel] Is this a bug or my disk is going to fail?

Greg Freemyer