On Sunday 22 November 2009 03:57:36 and regarding:
On Saturday 21 November 2009 07:33:22 and regarding:
Dave it looks like to me got hardware starting to fail try a live cd on the box if you can
Humm,
OK, I'll give the live CD a go in the morning. I suspect it is something suse related, because with the drive in question, if I install it as a normal harddrive in my laptop, it works great. Also, I swapped drives in my Laptop to updated the Archlinux install. I attached the same usb drive to the Arch Linux box and it did just fine, no errors and no disconnects, The Arch kernel is:
26-2.6.31.6-1
Same box, same usb drive, works fine on arch, disconnectint in 11.2. That's what has me thinking 11.2 may be to blame. I'll try the live CD and let you know. If you think of anything else, please let me know. Thanks
(live cd worked fine in the testing I did with the USB drive)
Guys,
There is a problem with 11.2 x86_64 that causes disk corruption on USB and Remote filesystems when you connect to them. I have probably spent the better part of 20 hours picking through longs and running fsck and testing to try and identify the issue. I still do not understand it, but I have experienced it on two separate remote filesystems (the first over a USB connection, the second over ssh/sftp/fish connections to my 10.3 server). My first post about it was in this thread. My second (which I can't find right now) was a reply to John Bennett who experienced disk corruption.
Ah hah! found the second thread, my reply was sent Friday in "Re: [opensuse] Deleting corrupted files".
I don't know if this is related (I strongly suspect it is, but that is just my hunch), but I first noticed screwed up characters in 'mc'. The little scrollbars and lines were not the noral ascii line drawing and extended characters, but instead were funky 'A' or 'a' characters with the little 'circles' or 'degree sysmbols' over them. (I have screenshots somewhere, but we have all seen these characters when encoding is fubar)
Let me know if anyone wants to see them, and I'll track them down. (files are spread over several machines due to this (what I will call BUG) requring a complete wipe of my 11.2 laptop install, and complete backup and recovery of a 500G dmraid /home partition on my 10.3 server). (not exactly a simple fsck issue)
The corruption problem occurred when connecting from my 11.2 x86_64 install on my 11.0 laptop either to a usb connected hard drive or when connecting via ssh, fish or sftp and then copying files to/from my laptop or performing remote operation of the remote filesystems (copying, moving, deleting, etc...).
The 11.2 install on my laptop was a 'plain-Jane' install where partitions were the normal /, /home, swap in a dual-boot setup with XP. The install was done via 'upgrade' with the 11.2 dvd (md5sums confirmed and install media check passed). The 'upgrade' was in reaity a fresh install that formatted / as ext4 but preserved /home as ext3. I first noticed the characters in mc, and then noticed the 'LC_** not set' messages when connecting from my laptop to the server via ssh. I understood that this was a language issue, but the language was set properly in yast -> sysconfig-editor, and in kde and gnome control panels. (I have several posts on the LC_ issues previously).
After connecting from 11.2, the usb drive I was connecting to started throwing errors and would disconnect (which started this thread originally). I can't recall whether is was a matter of hours or a day or two later when I noticed the errors in my server logs. (I connect to and manage files on my home server 10 times a day eith via ssh, sftp, smb or fish). The disk corruption on my server began after my connection from 11.2 and the errors took the form:
<snip>
Dec 3 18:05:42 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:05:42 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:05:42 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:05:42 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:05:42 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:05:42 nirvana kernel: ata3: EH complete
Dec 3 18:05:44 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:05:44 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:05:44 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:05:44 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:05:44 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:05:44 nirvana kernel: ata3: EH complete
Dec 3 18:05:46 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:05:46 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:05:46 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:05:46 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:06:30 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:06:30 nirvana kernel: ata3: EH complete
Dec 3 18:06:30 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:06:30 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:06:30 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:06:30 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:06:30 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:06:30 nirvana kernel: ata3: EH complete
Dec 3 18:06:30 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:06:30 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:06:30 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:06:30 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:06:30 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:06:30 nirvana kernel: ata3: EH complete
Dec 3 18:06:30 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 3 18:06:30 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec 3 18:06:30 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec 3 18:06:30 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec 3 18:06:30 nirvana kernel: ata3.00: configured for UDMA/133
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Dec 3 18:06:30 nirvana kernel: Descriptor sense data with sense descriptors (in hex):
Dec 3 18:06:30 nirvana kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Dec 3 18:06:30 nirvana kernel: 34 8c 0c 39
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Dec 3 18:06:30 nirvana kernel: end_request: I/O error, dev sda, sector 881593401
Dec 3 18:06:30 nirvana kernel: ata3: EH complete
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Write Protect is off
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
Dec 3 18:06:30 nirvana kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
<snip>
I was sure it was a simple coincidence that the usb disconnects just happened and would be fixed with an update, and that I just happened to start having server /home partition corruption at that time. After all, my 10.3 server has been running error free since 10.3 was released (Oct. '07 IIRC). So it was 2 years old, but in reality operates under a fairly light load mosly idling (spell check is still broke sorry;-) except for backups. So I had no reason to expect a disk failure, but figured it just happened.
Then I saw John's post discussing almost identical errors and things just seemed to curious to be coincidence. Then the threads about "Deleting corrupted files" started hitting the list.
I don't know enough about filesystems to understand how or why file corruption occurs or how different encoding/language/LC_/etc. bugs could cause or contribute to it, but I think we have a real problem with the current 11.2 x86_64 release. I know from my experience with my usb drive and my server, that the errors were directly related to connection from my new 11.2 x86_64 laptop. My 11.2 install there finally experienced complete meltdown and I had to wipe and reinstall. I have configured the laptop, but haven't had time to do too much additional testing. (not to mention not wanting to spend another 4 hours repairing the 2 disk dmraid /home on my server again -- fsck of 500G takes a long time, and hitting