[opensuse-kernel] Is this a bug or my disk is going to fail?
Hi, A few days ago suddenly I can't mount /root because: 2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00: failed command: WRITE FPDMA QUEUED 2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363] end_request: I/O error, dev sda, sector 59764644 After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2847 But: 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 Google says it is the hardware problem, but after a reinstall, my computer still works...and actually the disk is less than 2 years old.. Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle. Can anyone help judge this? Do I have to replace the disk? Marguerite -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 22/07/14 00:18, Marguerite Su wrote:
Hi,
A few days ago suddenly I can't mount /root because:
2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00: failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363] end_request: I/O error, dev sda, sector 59764644
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2847
But:
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
Google says it is the hardware problem, but after a reinstall, my computer still works...and actually the disk is less than 2 years old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
I have not "here is the answer" answer to your question but let me say that I have had HDD fail on me on in less than 24 hours, with others failing anything between 24 hours and 3 years, while some still working fine even after 15 years. "Youse buys your ticket and youse take your chances". You don't mention what brand of HDD you have but that brand will have an app. from the manufacturer which will check out your drive for any possible failures. And, of course, if you have 'smartmontools' installed then this will do same. BC -- Using openSUSE 13.1, KDE 4.13.3 & kernel 3.15.6-1 on a system with- AMD FX 8-core 3.6/4.2GHz processor 16GB PC14900/1866MHz Quad Channel RAM Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 07/21/2014 10:00 AM, Basil Chupin wrote:
On 22/07/14 00:18, Marguerite Su wrote:
Hi,
A few days ago suddenly I can't mount /root because:
2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00: failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363] end_request: I/O error, dev sda, sector 59764644
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2847
But:
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
Google says it is the hardware problem, but after a reinstall, my computer still works...and actually the disk is less than 2 years old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
I have not "here is the answer" answer to your question but let me say that I have had HDD fail on me on in less than 24 hours, with others failing anything between 24 hours and 3 years, while some still working fine even after 15 years.
"Youse buys your ticket and youse take your chances".
You don't mention what brand of HDD you have but that brand will have an app. from the manufacturer which will check out your drive for any possible failures.
And, of course, if you have 'smartmontools' installed then this will do same.
From the printouts above, smartctl is available, and a long disk surface test should be done soon. That will test the disk surfaces, read/write heads, and the circuitry that handles reading and writing. The parts that get minimal testing are the interface with the disk controller, the cable to the controller, and the adapter in your computer. My sense is that the error message arises from communication between the adapter and the computer.
The manufacturer's app will likely test the entire system. If this system is a desktop, an easy thing to do is to replace the cable. On a laptop, that is generally not possible. In any case, I would be sure to have backups and a spare drive. Larry -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
If this system is a desktop, an easy thing to do is to replace the cable. On a laptop, that is generally not possible. In any case, I would be sure to have backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors being reported is to replace the sata cable. It that fixes the problem, just toss the old cable. If it doesn't is when I start diving into the "smartctl -a" data. Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Hi, Greg, Unfortunately it's my laptop... And here's smartctl information: smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-11-desktop] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint M8 (AF) Device Model: SAMSUNG HN-M101MBB Serial Number: S2R8J9BB808817 LU WWN Device Id: 5 0024e9 205e8f61c Firmware Version: 2AR10001 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 1.5 Gb/s) Local Time is: Tue Jul 22 02:09:58 2014 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (13320) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 222) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2850 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 089 089 025 Pre-fail Always - 3466 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2161 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12867 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 135 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2242 181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age Always - 452013413 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 820 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 047 000 Old_age Always - 36 (Min/Max 17/53) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 463 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 135 225 Load_Cycle_Count 0x0032 032 032 000 Old_age Always - 691031 SMART Error Log Version: 1 Warning: ATA error count 19579 inconsistent with error log pointer 5 ATA Error Count: 19579 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache] Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA 25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT 25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I at first didn't treat it as a hardware problem, I treated it as a software problem because I can't believe all in a sudden my disk was producting "end_request" for about 10 sectors in a very narrow time window (in just a few minutes..) while I can still use my openSUSE without any major problem (the hard shutdown will result a slow boot, the suspend to ram/disk function still looks fine, and the daily operation is good too) Marguerite On Tue, Jul 22, 2014 at 1:48 AM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
If this system is a desktop, an easy thing to do is to replace the cable. On a laptop, that is generally not possible. In any case, I would be sure to have backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors being reported is to replace the sata cable. It that fixes the problem, just toss the old cable.
If it doesn't is when I start diving into the "smartctl -a" data.
Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
I'm guessing you have one or more bad sectors in the 8 sectors between 59764612 and 59764619. Since Linux normally issues 4KB reads, you can't tell from the normal diagnostics / errors which specific sector is bad. You can use hdparm to read one sector at a time. "hdparm --read-sector 59764612 /dev/sda" will test that one sector. Test all 8, one at a time. When you know which one(s) it is you can use hpdarm to force a write to the sector: hdparm --repair-sector 59764612 /dev/sda" will trash the existing content of the sector, but it should trigger a re-allocate of the sector to a good sector. After you use --repair-sector, try to read it again with --read-sector. It should now read fine. If not, something is wrong with the drive. Be sure and not run --repair-sector on sectors you don't know are failed. The trouble is you have permanently lost whatever that sector used to hold. I know of no way general way to get it back. After you repair the sector, you should try to figure what file / inode / metadata the sector was associated with and try to fix things back up. If the sector is a part of a files data page, the simplist thing is just to delete the file and get a replacement copy from backup. For the kernel team, this brings up the question if XFS CRC metadata correction is going to be in openSUSE 13.2? If so, does the new XFS code have a way to force a write of the corrected data back to disk? == details of my analysis ==
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12867
relatively young at 12,867 power on hours
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
zero reallocated sectors
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 135
No idea what that is
181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age Always - 452013413
No idea what that is
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 463
No idea what that is
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
Strange that all of those report zero if the drive is actually having problems. If the above is all you had to go by, I would say the drive is fine and it's a problem elsewhere, but it is NOT all we have. We have the list of recent discrete errors. Look for my comments in <= comment => blocks Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) <= That is 12 power on hours ago => When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache] <= so an uncorrectable read error occurred at sector 59764612, that is the drive itself reporting the bad sector. 8 sectors is one 4KB page. I don't think you can tell which specific sector failed. => Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours) <= 32 hours before the last sector error => When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA <= Again a read of that same 4 KB page failed Not sure which sector => Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours) <= 2 hours before the last one => When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA 25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT 25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT <= It's that same 4 KB page again. => Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA <= and again, that same bad page 2 hours before the last one => Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA . <= slightly different, just before the single 4KB page read failed, a 200 sector read failed (it was the same hour) => =============================================== -- Greg Freemyer On Mon, Jul 21, 2014 at 2:15 PM, Marguerite Su <i@marguerite.su> wrote:
Hi, Greg,
Unfortunately it's my laptop...
And here's smartctl information:
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-11-desktop] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint M8 (AF) Device Model: SAMSUNG HN-M101MBB Serial Number: S2R8J9BB808817 LU WWN Device Id: 5 0024e9 205e8f61c Firmware Version: 2AR10001 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 3.0, 3.0 Gb/s (current: 1.5 Gb/s) Local Time is: Tue Jul 22 02:09:58 2014 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (13320) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 222) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2850 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 3 Spin_Up_Time 0x0023 089 089 025 Pre-fail Always - 3466 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2161 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12867 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 135 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2242 181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age Always - 452013413 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 820 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 047 000 Old_age Always - 36 (Min/Max 17/53) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 463 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 135 225 Load_Cycle_Count 0x0032 032 032 000 Old_age Always - 691031
SMART Error Log Version: 1 Warning: ATA error count 19579 inconsistent with error log pointer 5
ATA Error Count: 19579 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache]
Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA
Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA 25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT 25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT 25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT
Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA
Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA
SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
I at first didn't treat it as a hardware problem, I treated it as a software problem because I can't believe all in a sudden my disk was producting "end_request" for about 10 sectors in a very narrow time window (in just a few minutes..) while I can still use my openSUSE without any major problem (the hard shutdown will result a slow boot, the suspend to ram/disk function still looks fine, and the daily operation is good too)
Marguerite
On Tue, Jul 22, 2014 at 1:48 AM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
If this system is a desktop, an easy thing to do is to replace the cable. On a laptop, that is generally not possible. In any case, I would be sure to have backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors being reported is to replace the sata cable. It that fixes the problem, just toss the old cable.
If it doesn't is when I start diving into the "smartctl -a" data.
Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
The trouble is you have permanently lost whatever that sector used to hold. I know of no way general way to get it back. After you repair the sector, you should try to figure what file / inode / metadata the sector was associated with and try to fix things back up. If the sector is a part of a files data page, the simplist thing is just to delete the file and get a replacement copy from backup.
For the kernel team, this brings up the question if XFS CRC metadata correction is going to be in openSUSE 13.2? If so, does the new XFS code have a way to force a write of the corrected data back to disk? XFS has a CRC feature and it will be part of openSUSE 13.2 (at least kernel and xfsprogs will support it - but note that the ondisk format changes so you'll have to create filesystem from scratch). But XFS uses
On Mon 21-07-14 15:10:42, Greg Freemyer wrote: ... that feature only to detect issues - find out sector has garbage instead of real data. There is no plan (AFAIK) to use CRC's to recover anything and in fact the CRC's used are weak for anything like that. IHMO when you want to restore anything after a sector failure, you should be using RAID... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Le Monday 21 July 2014 à 15:10 -0400, Greg Freemyer a écrit :
The trouble is you have permanently lost whatever that sector used to hold. I know of no way general way to get it back. After you repair the sector, you should try to figure what file / inode / metadata the sector was associated with and try to fix things back up. If the sector is a part of a files data page, the simplist thing is just to delete the file and get a replacement copy from backup.
If the file in question is on the system partition, you can try "rpm -Va" to find out which file that was, and then reinstall the package in question. You should be able to find out which partition the file was on by comparing the sector number with the disk geometry as reported by fdisk or cfdisk. -- Jean Delvare SUSE L3 Support -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Am 22.07.2014 16:40, schrieb Jean Delvare:
Le Monday 21 July 2014 à 15:10 -0400, Greg Freemyer a écrit :
The trouble is you have permanently lost whatever that sector used to hold. I know of no way general way to get it back. After you repair the sector, you should try to figure what file / inode / metadata the sector was associated with and try to fix things back up. If the sector is a part of a files data page, the simplist thing is just to delete the file and get a replacement copy from backup.
If the file in question is on the system partition, you can try "rpm -Va" to find out which file that was, and then reinstall the package in question.
There is a nice HOWTO on the smartmontools site: http://smartmontools.sourceforge.net/badblockhowto.html I used this in the past to find out which file (on an ext3 FS) a given block number belonged to, so that I could restore it from backup when a disk started to die (the disk got replaced before it died, but I wanted to make sure to avoid overwriting the good backup with a bad file).
You should be able to find out which partition the file was on by comparing the sector number with the disk geometry as reported by fdisk or cfdisk.
That's easy, but then you need the offset into that partition and with that offset, you can use tune2fs to find out the inode number / file name this belongs to. I forgot all the details, but the above linked howto explains it all. Good Luck Marguerite! seife -- Stefan Seyfried Linux Consultant & Developer -- GPG Key: 0x731B665B B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537 -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Mon, Jul 21, 2014 at 10:18 AM, Marguerite Su <i@marguerite.su> wrote:
Hi,
A few days ago suddenly I can't mount /root because:
2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00: failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363] end_request: I/O error, dev sda, sector 59764644
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2847
But:
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
Google says it is the hardware problem, but after a reinstall, my computer still works...and actually the disk is less than 2 years old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
It is certainly not a bug, but it may also not be time to replace the drive. You gave us too little data to even make an educated guess. Use "smartctl -a /dev/sda" to see all the smart output and post it here. A couple comments: - disks are cheap, businesses often replace a disk the first time they have a read error that is reported all the way up the stack to userspace. You've just had one of those. The "smart" data you are reporting is very manufacturer and even disk model specific. In general you have to record it periodically and look for changes to see if anything bad is going on. Despite that, I don't do that myself and neither do many other people. - a raw read error may just mean that ECC correction had to be invoked, so who cares. My current laptop is reporting 226 million raw read errors. It also says the exact same 226 million attempts at ECC correction worked. - It is also reporting zero reallocations and zero pending reallocations. I'm not planning to replace it. You've had only 2847 raw read errors so that by itself looks like a small, insignificant number. - My "reported Uncorrectable" raw value is zero, so not even a single retry has been required yet. That is surprising because a physical bump of the laptop while it is reading data can cause an uncorrectable error that a retry would likely solve. - If you don't know, many / most (but not all) disks only re-allocate sectors on write. Thus if you read a sector and the data is not readable and the ECC correction also fails, many drives will mark that sector for re-allocation. It will stay in that state until that specific sector is written with replacement data. After all, why should the drive re-allocate the sector if it doesn't know what to put in the new sector. - Thus if you want to know how many current bad media sectors you have, it should be reflected in the "pending reallocation" data. - If you want to know how many bad sectors you used to have, but were fixed by reallocation, look at the Re-allocated data. As I understand it, some drives will monitor those "pending re-allocation" sectors and if you ever successfully read data from them, they will go ahead and allocate a new sector and write the good data there. I absolutely know there have been reports of read only actions triggering re-allocations. I don't know on which makes/models of drives this has been observed. Note that the drive itself will NOT proactively monitor the sectors for issues, nor will it periodically check pending bad sectors to see if it can get valid data. Thus some raid setups have the ability to perform background media scans and if a media error is reported, get the valid data from an alternate drive and write it back to the drive reporting the bad sector. NetApp (among others) calls this scrubbing the array. (https://library.netapp.com/ecmdocs/ECMP1196912/html/GUID-81F8BEA3-ADC1-4790-...) Greg - -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (7)
-
Basil Chupin
-
Greg Freemyer
-
Jan Kara
-
Jean Delvare
-
Larry Finger
-
Marguerite Su
-
Stefan Seyfried