On March 15, 2015 5:50:58 AM EDT, Felix Miata
I have a multiboot system that has long delays at selecting Grub menu items. The Linux-bzImage messages appear, and should be shortly followed by the initrd message, but that's where the delay occurs. It can be as long as 3 minutes. It only happens with some installations and not others, all of which are normally being started from the same Grub on sda3, but can be chainloaded to, which doesn't eliminate the delays. Even after redoing Grub setup for 13.2, chainloading to its Grub triggers a reboot.
This HD spent a bit over 3 years in a STB, after which I replaced it, which I did because the STB was stalling inexplicably. After the replacement, I used the WD utility to evaluate the device. It found bad sectors, and claimed it corrected them, that the device passed.
I umounted all partitions except / while booted to TW, and ran e2fsck -f on each installations /. All passed, except the last tried (Kubuntu), which resulted in "error reading block 34307 (attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 10863. Ignore error<y>?" This repeated for blocks 34308, 34309, 590029, 590030, 688204 and several more, followed by directory corruption for inodes 68166, 98879 & yada. Most of these partitions are ext4. 11.4, 12.1 & 12.2 are ext3.
Since it "passed" it has spent at least 20 hours in a PC, after I cloned from its original HD to the "passed" HD and switched them. NAICT, smartctl thinks this HD is OK, but is it really? I can't think of any other explanation for boot delays like this except for trouble reading sectors containing kernels or initrds - installing a new kernel/initrd can switch an installation from suffering the delays to one not, and vice versa.
I asked about this on the linux-ide mailing list 2 weeks ago, but got no feedback there: http://marc.info/?l=linux-ide&m=142517892627960&w=2
The problem has me frustrated in that I cannot remember whether it started before or after the HD switch. If it started before, obviously trouble is not with the "passed" HD.
This is a test system, so it doesn't really have anything "important" on it, other than the time invested to reconfigure on account of device IDs being different (e.g. /boot/grub/device.map).
# smartctl -t long /dev/sda
(more than 2 hours later)
# smartctl -H /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
# smartctl -l selftest /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27071 252915937 # 2 Extended offline Completed: read failure 90% 27070 252915955 # 3 Conveyance offline Completed without error 00% 27051 - # 4 Extended offline Completed: read failure 90% 27046 460886821 # 5 Short offline Completed without error 00% 3 -
Read failures at 125 gb and 230 gb roughly
# smartctl -A /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3210 3 Spin_Up_Time 0x0027 156 153 021 Pre-fail Always - 3191 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 869 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27074 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 862 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 869 194 Temperature_Celsius 0x0022 112 073 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 195 173 000 Old_age Always - 306 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
smartctl -x data: http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt
The tests are finding media errors. The pending bad sectors is 306. For a lot of drives actual sector reallocation only happens on write, so those bad sectors will stay bad until you write to them. Personally I'd toss that drive. If you want to salvage it, boot a live CD / DVD and run a data data destroying sequence like: dd if=/dev/zero of=/dev/sda conv=noerror bs=4k That should force all pending sector reallocates to happen. (of course it wipes your drive in the process). dd if=/dev/sda of=/dev/null conv=noerror bs=4k Consider that a diagnostic; it will report the number of failed sectors in the output. If that reports any errors on the second dd, think very hard about tossing the drive. If you are still thinking about keeping it, then use shred to exercise the disk even more. shred --verbose -n1 /dev/sda That will write a single pass of pseudo random data to the drive. Then repeat the above dd read of the entire drive. If the dd reports any read errors, repeat the shred/dd pair until you get at least one totally clean run. Greg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org