[opensuse-factory] a blackout messed my system

Just had a blackout and of one of the computers the UPS did not do its work. After the electricity was back again the system hangs some where. and seems to recover without result. The last info is repeated as long as I keep the machine running. Blaming the electricity fellers is perhaps not correct :) Could be of course something else but I have no clue. The produced information is not very clear to me. Perhaps somebody else could translate this into plain English (or Dutch :) ). The problem seems to start after the machine finds the USB camera. ------- Product USB 2.0 PC CAMERA Manufactor ARKMICRO random: non blocking pool is initialized ata 3 : lost interrupt (status 0x50) ata 3.00: exception E mask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata 3.00: failed command : READ DMA ata 3.00: cmd c8/00 etc etc....... tag dma 28672 in res 40/00 etc etc.........Emask 0x4 (time out) ata 3.00: status {DRDY} ata 3.00: soft resetting link ata 3.00: configured for UDMA/130 ata 3.01: configured for UDMA/ 100 ata 3.00: device reported invalid CHS sector o ata 3.00: EH complete ata 3: lost interrupt etc. etc ---------- The machine affected by this problem runs KDE Tumbleweed updated (20150807) and has a 586 processor and kernel 4.1.3-2-desktop. -- Linux User 183145 using KDE4 and LXDE on a Pentium IV , powered by openSUSE 20150807 (x86_64) Kernel: 4.1.3-2-desktop KDE Development Platform: 4.14.10 14:36pm up 0:55, 2 users, load average: 0.73, 0.50, 0.43 -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 11:46 AM, C. Brouerius van Nidek <constant@indo.net.id> wrote:
Well, apparently hard disk failed to complete request (timeout). This could be both controller and HDD issue. Try connecting HDD to different controller and different HDD to this controller.
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Monday 2015-08-10 10:46, C. Brouerius van Nidek wrote:
Your disk fails to respond. I too had an instance some years ago where a power dip ultimately caused a disk write to be interrupted, and ever since, there is a static bad zone in that region of the disk. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 5:03 AM, Jan Engelhardt <jengelh@inai.de> wrote:
Jan, a partial sector write isn't supposed to happen, but you're correct that doesn't mean it doesn't happen. But when it happens it causes the ECC sector metadata to disagree with the sector contents. When the disk controller reads that sector it verifies the ECC and does it's best to correct the error (assuming it is just a few bits). When that fails (as in your case), it generates a media error back to the kernel and it also flags the specific physical sector that is corrupt. None of that is present in the above error messages. Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 2:46 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
Agreed, but there are files (like RPMs) that are long lived and rarely if ever written. Depending on a chance write to the sector may never happen. hdparm has a "repair_sector" feature that can force good data to a specific sector that is known bad. If the error was caused by a bad write and nothing is actually wrong with the media (physical sector on the plater), then hdparm will bring it back to life. The trouble is it has no idea what the correct data is so it just fills the sector garbage. Greg
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On 2015-08-10 21:03, Greg Freemyer wrote:
On Mon, Aug 10, 2015 at 2:46 PM, Carlos E. R. <> wrote:
Yes, of course. I was thinking of what Jan said: J> I too had an instance some years ago where a power dip ultimately J> caused a disk write to be interrupted, and ever since, there is a J> static bad zone in that region of the disk. it should be repairable, no? -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

On Mon, Aug 10, 2015 at 5:37 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
Yes, My method would be to make a full copy of the drive to a usb drive via dd. dd if=/dev/sda of=image_file.dd bs=4k conv=sync,noerror Then restore it via dd. The restore should cause any bad sectors to remap, assuming there are any spare sectors to remap to. Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Tuesday 2015-08-11 04:29, Greg Freemyer wrote:
ddrescue is a much better program for this task.
The restore should cause any bad sectors to remap, assuming there are any spare sectors to remap to.
For that, you do not have to copy the entire disk. Grab a bad sector list, and only rewrite those. Something like ddrescue -f /dev/sda /dev/zero logfile.txt ddrescue -f -F - /dev/zero /dev/sda logflie.txt -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On August 10, 2015 11:00:38 PM EDT, Jan Engelhardt <jengelh@inai.de> wrote:
Maybe If the drive is in such bad shape that ddrescue is needed, then I'd toss the drive, not try to get it back operational. Also, there are "weak" sectors that succeed after multiple read attempts. The ddrescue method does nothing to help them. The full dd method lets the disk controller re-map them if it wants to. As I understand it, a lot of sectors that get re-mapped are of the weak variety, so doing a full dd read and write every year or two might be good preventive maintenance. Greg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On 08/10/2015 10:29 PM, Greg Freemyer wrote:
My method would be to make a full copy of the drive to a usb drive via dd.
dd if=/dev/sda of=image_file.dd bs=4k conv=sync,noerror
Wouldn't that require a USB drive with at least the capacity of the hard drive? -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On August 11, 2015 7:13:52 AM EDT, James Knott <james.knott@rogers.com> wrote:
Yes Costco has a 4TB drive right now for $120 Greg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Tue, Aug 11, 2015 at 2:13 PM, James Knott <james.knott@rogers.com> wrote:
I did it over network to another system when I had to replace drive due to bad sectors. Then mounted image and did rsync back to new partitions :) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 12:46 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
Transient vs persistent error is determined when the sector is written. If the verify of write succeeds, case closed. If it's transient there's a limit to how many errors are permitted (varies), and if it's persistent it causes that physical sector to be removed, and the LBA to be remapped to a reserve sector where the data is written (verified and succeeds). If there are no reserve sectors remaining, that causes a write error. Different file systems have different ways of dealing with that case, whether a bad sector map is supported or not (recent versions of mdadm do, ext234 has long had such support, XFS and Btrfs do not). -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On 2015-08-11 00:59, Chris Murphy wrote:
On Mon, Aug 10, 2015 at 12:46 PM, Carlos E. R. <> wrote:
IIRC, initially reiserfs did not, and support was added later. At the time, I think disks did no remapping, or we did not know about it, so it was a problem if the filesystem could not cope with bad sectors. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

On Tue, Aug 11, 2015 at 6:56 AM, Carlos E. R. <robin.listas@telefonica.net> wrote:
Yes, but since "ancient" times drive firmware has handled this internally. It does get a bit confusing with AF drives where you must remember to write 8 logical 512 byte sectors in order to cause a write to happen to 1 4096 byte sector. If you only overwrite part of that sector, the driver internally tries to do read, modify, write, and the read fails because the sector is "bad" (or at least ECC fails for it). So you sit there and write to the sector but get back read errors and it doesn't seem to make sense until you realize you need to write a full 4096 byte sector. And bs=4096 changes the seek= value you need to use to write to that sector because any error will report in 512 byte sectors, not 4096 byte sectors. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On August 10, 2015 4:46:57 AM EDT, "C. Brouerius van Nidek" <constant@indo.net.id> wrote:
You have a failed read, but not a bad media read. Note that the kernel is stepping down the link speed trying to get a decent communication path. In my experience with that error sequence, the odds are high your sata cable has become unreliable. You can try to reseat it. Personally, I would just replace it. They are cheap. Greg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 5:46 AM, C. Brouerius van Nidek <constant@indo.net.id> wrote:
Please run an SMART test on the drive.. I wouldn't trust that disk anymore and would replace it asap. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On 2015-08-10 20:11, Cristian Rodríguez wrote:
Please run an SMART test on the drive.. I wouldn't trust that disk anymore and would replace it asap.
A SMART test as done by smartctl runs completely inside the hard disk; it is done by its firmware, and the result is simply written to a log in the disk. It does not test the cabling, which in this case is suspect. Instead, it is possible to run the tests externally, with tools from the disk manufacturer. Seagate has a good one, called seatools. They have a version that boots from CD (or floppy?), which is available for download. It is unable to print or save report, though. Well, it can, but is is not trivial to do. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

On Mon, Aug 10, 2015 at 11:46 AM, C. Brouerius van Nidek <constant@indo.net.id> wrote:
Well, apparently hard disk failed to complete request (timeout). This could be both controller and HDD issue. Try connecting HDD to different controller and different HDD to this controller.
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Monday 2015-08-10 10:46, C. Brouerius van Nidek wrote:
Your disk fails to respond. I too had an instance some years ago where a power dip ultimately caused a disk write to be interrupted, and ever since, there is a static bad zone in that region of the disk. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 5:03 AM, Jan Engelhardt <jengelh@inai.de> wrote:
Jan, a partial sector write isn't supposed to happen, but you're correct that doesn't mean it doesn't happen. But when it happens it causes the ECC sector metadata to disagree with the sector contents. When the disk controller reads that sector it verifies the ECC and does it's best to correct the error (assuming it is just a few bits). When that fails (as in your case), it generates a media error back to the kernel and it also flags the specific physical sector that is corrupt. None of that is present in the above error messages. Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On Mon, Aug 10, 2015 at 2:46 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
Agreed, but there are files (like RPMs) that are long lived and rarely if ever written. Depending on a chance write to the sector may never happen. hdparm has a "repair_sector" feature that can force good data to a specific sector that is known bad. If the error was caused by a bad write and nothing is actually wrong with the media (physical sector on the plater), then hdparm will bring it back to life. The trouble is it has no idea what the correct data is so it just fills the sector garbage. Greg
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

On 2015-08-10 21:03, Greg Freemyer wrote:
On Mon, Aug 10, 2015 at 2:46 PM, Carlos E. R. <> wrote:
Yes, of course. I was thinking of what Jan said: J> I too had an instance some years ago where a power dip ultimately J> caused a disk write to be interrupted, and ever since, there is a J> static bad zone in that region of the disk. it should be repairable, no? -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)
participants (9)
-
Andrei Borzenkov
-
C. Brouerius van Nidek
-
Carlos E. R.
-
Chris Murphy
-
Cristian Rodríguez
-
Greg Freemyer
-
greg.freemyer@gmail.com
-
James Knott
-
Jan Engelhardt