New subject: [opensuse] blocksize, 4k, CHS */255/63 (was: smartctl: is this HD really OK?)

15 Mar 2015

      I have a multiboot system that has long delays at selecting Grub menu items.
The Linux-bzImage messages appear, and should be shortly followed by the initrd
message, but that's where the delay occurs. It can be as long as 3 minutes.
It only happens with some installations and not others, all of which are
normally being started from the same Grub on sda3, but can be chainloaded to,
which doesn't eliminate the delays. Even after redoing Grub setup for 13.2,
chainloading to its Grub triggers a reboot.

This HD spent a bit over 3 years in a STB, after which I replaced it, which I
did because the STB was stalling inexplicably. After the replacement, I used
the WD utility to evaluate the device. It found bad sectors, and claimed it
corrected them, that the device passed.

I umounted all partitions except / while booted to TW, and ran e2fsck -f on
each installations /. All passed, except the last tried (Kubuntu), which
resulted in "error reading block 34307 (attempt to read block from filesystem
resulted in short read) while reading indirect blocks of inode 10863. Ignore
error<y>?" This repeated for blocks 34308, 34309, 590029, 590030, 688204 and
several more, followed by directory corruption for inodes 68166, 98879 & yada.
Most of these partitions are ext4. 11.4, 12.1 & 12.2 are ext3.

Since it "passed" it has spent at least 20 hours in a PC, after I cloned from
its original HD to the "passed" HD and switched them. NAICT, smartctl thinks
this HD is OK, but is it really? I can't think of any other explanation for
boot delays like this except for trouble reading sectors containing kernels
or initrds - installing a new kernel/initrd can switch an installation from
suffering the delays to one not, and vice versa.

I asked about this on the linux-ide mailing list 2 weeks ago, but got no
feedback there: http://marc.info/?l=linux-ide&m=142517892627960&w=2

The problem has me frustrated in that I cannot remember whether it started
before or after the HD switch. If it started before, obviously trouble is not
with the "passed" HD.

This is a test system, so it doesn't really have anything "important" on it,
other than the time invested to reconfigure on account of device IDs being
different (e.g. /boot/grub/device.map).

# smartctl -t long /dev/sda

(more than 2 hours later)

# smartctl -H /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

# smartctl -l selftest /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     27071         252915937
# 2  Extended offline    Completed: read failure       90%     27070         252915955
# 3  Conveyance offline  Completed without error       00%     27051         -
# 4  Extended offline    Completed: read failure       90%     27046         460886821
# 5  Short offline       Completed without error       00%         3         -

# smartctl -A /dev/sda
smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3210
  3 Spin_Up_Time            0x0027   156   153   021    Pre-fail  Always       -       3191
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       869
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27074
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       862
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       869
194 Temperature_Celsius     0x0022   112   073   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   195   173   000    Old_age   Always       -       306
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

smartctl -x data:
http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt
-- 
"The wise are known for their understanding, and pleasant
words are persuasive." Proverbs 16:21 (New Living Translation)

 Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata  ***  http://fm.no-ip.com/
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

[opensuse] smartctl: is this HD really OK?

James Knott

Bob Williams

tags

participants (10)