On Mon, Jul 21, 2014 at 10:18 AM, Marguerite Su <i@marguerite.su> wrote:
Hi,
A few days ago suddenly I can't mount /root because:
2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00: failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363] end_request: I/O error, dev sda, sector 59764644
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2847
But:
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
Google says it is the hardware problem, but after a reinstall, my computer still works...and actually the disk is less than 2 years old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days + 17 hours) When the command that caused the error occurred, the device was active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
It is certainly not a bug, but it may also not be time to replace the drive. You gave us too little data to even make an educated guess. Use "smartctl -a /dev/sda" to see all the smart output and post it here. A couple comments: - disks are cheap, businesses often replace a disk the first time they have a read error that is reported all the way up the stack to userspace. You've just had one of those. The "smart" data you are reporting is very manufacturer and even disk model specific. In general you have to record it periodically and look for changes to see if anything bad is going on. Despite that, I don't do that myself and neither do many other people. - a raw read error may just mean that ECC correction had to be invoked, so who cares. My current laptop is reporting 226 million raw read errors. It also says the exact same 226 million attempts at ECC correction worked. - It is also reporting zero reallocations and zero pending reallocations. I'm not planning to replace it. You've had only 2847 raw read errors so that by itself looks like a small, insignificant number. - My "reported Uncorrectable" raw value is zero, so not even a single retry has been required yet. That is surprising because a physical bump of the laptop while it is reading data can cause an uncorrectable error that a retry would likely solve. - If you don't know, many / most (but not all) disks only re-allocate sectors on write. Thus if you read a sector and the data is not readable and the ECC correction also fails, many drives will mark that sector for re-allocation. It will stay in that state until that specific sector is written with replacement data. After all, why should the drive re-allocate the sector if it doesn't know what to put in the new sector. - Thus if you want to know how many current bad media sectors you have, it should be reflected in the "pending reallocation" data. - If you want to know how many bad sectors you used to have, but were fixed by reallocation, look at the Re-allocated data. As I understand it, some drives will monitor those "pending re-allocation" sectors and if you ever successfully read data from them, they will go ahead and allocate a new sector and write the good data there. I absolutely know there have been reports of read only actions triggering re-allocations. I don't know on which makes/models of drives this has been observed. Note that the drive itself will NOT proactively monitor the sectors for issues, nor will it periodically check pending bad sectors to see if it can get valid data. Thus some raid setups have the ability to perform background media scans and if a media error is reported, get the valid data from an alternate drive and write it back to the drive reporting the bad sector. NetApp (among others) calls this scrubbing the array. (https://library.netapp.com/ecmdocs/ECMP1196912/html/GUID-81F8BEA3-ADC1-4790-...) Greg - -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org