Re: [opensuse] smartctl: is this HD really OK?

15 Mar 2015

      On March 15, 2015 5:50:58 AM EDT, Felix Miata  wrote:
...
I have a multiboot system that has long delays at selecting Grub menu
items.
The Linux-bzImage messages appear, and should be shortly followed by
the initrd
message, but that's where the delay occurs. It can be as long as 3
minutes.
It only happens with some installations and not others, all of which
are
normally being started from the same Grub on sda3, but can be
chainloaded to,
which doesn't eliminate the delays. Even after redoing Grub setup for
13.2,
chainloading to its Grub triggers a reboot.
This HD spent a bit over 3 years in a STB, after which I replaced it,
which I
did because the STB was stalling inexplicably. After the replacement, I
used
the WD utility to evaluate the device. It found bad sectors, and
claimed it
corrected them, that the device passed.
I umounted all partitions except / while booted to TW, and ran e2fsck
-f on
each installations /. All passed, except the last tried (Kubuntu),
which
resulted in "error reading block 34307 (attempt to read block from
filesystem
resulted in short read) while reading indirect blocks of inode 10863.
Ignore
error<y>?" This repeated for blocks 34308, 34309, 590029, 590030,
688204 and
several more, followed by directory corruption for inodes 68166, 98879
& yada.
Most of these partitions are ext4. 11.4, 12.1 & 12.2 are ext3.
Since it "passed" it has spent at least 20 hours in a PC, after I
cloned from
its original HD to the "passed" HD and switched them. NAICT, smartctl
thinks
this HD is OK, but is it really? I can't think of any other explanation
for
boot delays like this except for trouble reading sectors containing
kernels
or initrds - installing a new kernel/initrd can switch an installation
from
suffering the delays to one not, and vice versa.
I asked about this on the linux-ide mailing list 2 weeks ago, but got
no
feedback there: http://marc.info/?l=linux-ide&m=142517892627960&w=2
The problem has me frustrated in that I cannot remember whether it
started
before or after the HD switch. If it started before, obviously trouble
is not
with the "passed" HD.
This is a test system, so it doesn't really have anything "important"
on it,
other than the time invested to reconfigure on account of device IDs
being
different (e.g. /boot/grub/device.map).
# smartctl -t long /dev/sda
(more than 2 hours later)
# smartctl -H /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE
RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
# smartctl -l selftest /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE
RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     27071   
    252915937
# 2  Extended offline    Completed: read failure       90%     27070   
    252915955
# 3  Conveyance offline  Completed without error       00%     27051   
    -
# 4  Extended offline    Completed: read failure       90%     27046   
    460886821
# 5  Short offline       Completed without error       00%         3   
    -
Read failures at 125 gb and 230 gb roughly
...
# smartctl -A /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE
RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always 
    -       3210
3 Spin_Up_Time            0x0027   156   153   021    Pre-fail  Always 
    -       3191
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always 
    -       869
5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always 
    -       0
7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always 
    -       0
9 Power_On_Hours          0x0032   063   063   000    Old_age   Always 
    -       27074
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always
     -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always
     -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
     -       862
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age  
Always       -       29
193 Load_Cycle_Count        0x0032   200   200   000    Old_age  
Always       -       869
194 Temperature_Celsius     0x0022   112   073   000    Old_age  
Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age  
Always       -       0
197 Current_Pending_Sector  0x0032   195   173   000    Old_age  
Always       -       306
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age  
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age  
Always       -       2
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age  
Offline      -       0
smartctl -x data:
http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt
The tests are finding media errors.  The pending bad sectors is 306.

For a lot of drives actual sector reallocation only happens on write, so those bad sectors will stay bad until you write to them.

Personally I'd toss that drive.  If you want to salvage it, boot a live CD / DVD and run a data data destroying sequence like:

dd if=/dev/zero of=/dev/sda conv=noerror bs=4k

That should force all pending sector reallocates to happen. (of course it wipes your drive in the process).

dd if=/dev/sda of=/dev/null conv=noerror bs=4k

Consider that a diagnostic; it will report the number of failed sectors in the output.  If that reports any errors on the second dd, think very hard about tossing the drive.

If you are still thinking about keeping it, then use shred to exercise the disk even more.

shred --verbose -n1 /dev/sda

That will write a single pass of pseudo random data to the drive.

Then repeat the above dd read of the entire drive.

If the dd reports any read errors, repeat the shred/dd pair until you get at least one totally clean run.

Greg

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

Re: [opensuse] smartctl: is this HD really OK?

greg.freemyer＠gmail.com