[Bug 427264] ata problems (exceptions?), I suspect that it sometime even corrupts data on disks
--- Comment #8 from Tejun Heo <teheo@xxxxxxxxxx> 2008-09-29 22:59:26 MDT ---

This one from comment #7 is the actual failure which is corrupting your

Sep 29 11:31:01 rychlik kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr
0x4050002 action 0xa frozen
Sep 29 11:31:01 rychlik kernel: ata1: SError: { RecovComm PHYRdyChg CommWake
DevExch }
Sep 29 11:31:01 rychlik kernel: ata1.00: cmd
35/00:10:2b:c8:9a/00:00:19:00:00/e0 tag 0 dma 8192 out
Sep 29 11:31:01 rychlik kernel: res
40/00:00:01:4f:c2/00:00:00:00:00/10 Emask 0x14 (ATA bus error)

It's most likely a power problem. During WRITE_EXT, voltage drops shortly
causing the disk to check out briefly. In most cases, you can hear the drive
performing emergency head unload and it will show in the smartctl output as
incremented load cycle count and/or emergency head unload count. This brief
power interruption causes data stored in memory buffer of the disk drive to be
lost thus causing filesystem corruption (there's no way the operating system
can tell that the drive has lost data in the buffer).

The easiest way to verify the problem is to get a power supply (any cheap one
should be enough) and connect half of the drives to the power supply and power
it up separately. Either the problem goes away for all drives or at least for
the ones connected to a separate PSU. You can power up a PSU without a
motherboard by following the following instructions.

Please check the SMART counters before and after such failures and post the
result and if possible try the suggested separate PSU.

As for the SMART ENABLE OPERATIONS timeout, I'll ask around and report back.

