Thanks for providing the support files. We have done investigation on our driver end and there seems to be firmware code related error involved. From the controller logs, I could see that commands are getting timed out and device port reset occurred. Hence, command abort/reset/hang occurred at the driver level. We are working with firmware group and will keep you inform on this shortly. Thanks & Regards, Mahesh -----Original Message----- From: Bruno Friedmann [mailto:bruno@ioda-net.ch] Sent: Thursday, September 19, 2013 11:50 AM To: Mahesh Rajashekhara Cc: opensuse-kernel@opensuse.org; Achim Leubner; Tony Ruiz; francois.spinosi@sigeom.ch Subject: Re: aacraid driver crash when it shouldn't Thanks for asking information I provide attached a tar.bz2 with all the information I can grab from the computer in the hope you will be able to reproduce. A backup link is at https://dl.dropboxusercontent.com/u/13333867/clochette.support-logs.tar.bz2 (The controler has been showing the same kind of error with another motherboard AMD X6 1090 based) The way to reproduce exactly is a bit conplicated : On the controleur we have 8 intel 510 (same firmware) normally in the supported device adaptec list assembled in a RAID6 with 1Mb chunksize. on the operating system : pure openSUSE 12.3 x86_64 we have defined a logical volume containing the whole raid (sde) on this one we have 2 lv (~250GB each) which are used as primary disk for the 2 virtual machines The you get a 3.7.10 kernel. We try to use the latest kernel available this week-end 3.11.0-2.g0a1c41f-default. A new 3.11.1 was published since but we don't yet upgrade to it. We use qemu-kvm virtualisation and setup 2 vms ( Windows 2008-R2 64bits ) with virtio drivers 0.1.5.2 stable from fedora installed to support para-virt for disk and network. The vm are started with the following definition qemu 5285 18.9 25.0 13160744 8210400 ? Sl 07:28 7:33 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name peterpan -S -machine pc-0.15,accel=kvm,usb=off -cpu Westmere,+erms,+smep,+fsgsbase,+rdtscp,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+pcid,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid bdc39043-721b-4bcd-9580-ff5a942f63c6 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/peterpan.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/vgvms/lvpeterpan_2008,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -drive file=/var/lib/libvirt/images/virtio-win-0.1-52.iso,if=none,id=drive-ide0-0-1,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:f0:d3:74,bus=pci.0,addr=0x3,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -device vmware-svga,id=video0,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x6 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 What we have discovered saturday is the following, if we put disk cache to writethrought and run oracle dumps (exp) then the crash will occur more than with cache=none. With cache=none the vm and host have run during 3 days without big incident but finally die during this night. What I can tell more is that the same hardware : controler and ssd disk with firmware 19076 have run during years without any trouble with a kernel 3.1.10 (was openSUSE 12.1) On Tuesday 17 September 2013 23.27:54 Mahesh Rajashekhara wrote:
Hi,
Author: Mahesh Rajashekhara <Mahesh.Rajashekhara@XXXXX>
Date: Tue Jun 18 17:02:07 2013 +0530
SCSI: aacraid: Fix for arrays are going offline in the system. System hangs
commit c5bebd829dd95602c15f8da8cc50fa938b5e0254 upstream.
This driver patch has a fix for race condition between the doorbell and the circular buffer. This one fixes SCSI command timeout issue with Sync. mode driver. I understand, you were using Series 6 and the in-box driver switches to Async mode by default. So, this patch has no effect.
I have tried to duplicate this issue. But, I cannot reproduce this issue yet.
Can you please send us the detailed steps for reproducing this issue ?
OS ?
Kernel source download link ?
IO tool?
Also, can you please obtain the support archive from the controller after reboot (after the issue happened)?
The syntax is:
arcconf savesupportarchive
This stores all the logs into log folder (you will see the path where it is stored in the response) and please provide all the files in that folder.
Thanks,
Mahesh
-----Original Message----- From: Bruno Friedmann [mailto:bruno@ioda-net.ch] Sent: Sunday, September 15, 2013 3:20 PM To: opensuse-kernel@opensuse.org Cc: Mahesh Rajashekhara Subject: aacraid driver crash when it shouldn't
Hi there, I'm facing an issue with 3.11 coming from our kernel-stable repos
A intel motherboard + i7 3770 32Gb ram with an adaptec 6805 adapter.
(8 intel ssd 510 series 120 go attached in a raid6 array)
The bundle adaptec + ssd in raid6 has worked during months (with openSUSE 12.1) without any glitches.
Now this night we received that kind of error
Sep 15 02:06:52 clochette.disney.interne kernel: aacraid: Host adapter abort request (6,0,0,0) Sep 15 02:06:52 clochette.disney.interne kernel: aacraid: Host adapter abort request (6,0,0,0) Sep 15 02:06:52 clochette.disney.interne kernel: aacraid: Host adapter reset request. SCSI hang ?
Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] Medium access timeout failure. Offlining disk!
Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: Device offlined - not ready after error recovery Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: Device offlined - not ready after error recovery Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] Unhandled error code Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] Sep 15 02:07:07 clochette.disney.interne kernel: Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] CDB:
Sep 15 02:07:07 clochette.disney.interne kernel: Write(10): 2a 00 02 b7 f4 00 00 00 20 00 Sep 15 02:07:07 clochette.disney.interne kernel: end_request: I/O error, dev sde, sector 45609984 Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-4, logical block 5252224 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-4 Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-4, logical block 5252225 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-4 Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-4, logical block 5252226 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-4 Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-4, logical block 5252227 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-4 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] Unhandled error code Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] Sep 15 02:07:07 clochette.disney.interne kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: [sde] CDB:
Sep 15 02:07:07 clochette.disney.interne kernel: Write(10): 2a 00 31 3e f6 20 00 00 08 00 Sep 15 02:07:07 clochette.disney.interne kernel: end_request: I/O error, dev sde, sector 826209824 Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 853188 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-3 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 2981929 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 2981929 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 45451 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-3 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 785908 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-3 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device Sep 15 02:07:07 clochette.disney.interne kernel: Buffer I/O error on device dm-3, logical block 810442 Sep 15 02:07:07 clochette.disney.interne kernel: lost page write due to I/O error on dm-3 Sep 15 02:07:07 clochette.disney.interne kernel: sd 6:0:0:0: rejecting I/O to offline device
But after checking kernel changelog I've found this commit for 3.10 series which should have fix it?
Author: Mahesh Rajashekhara <Mahesh.Rajashekhara@XXXXX <mailto:Mahesh.Rajashekhara@XXXXX> >
Date: Tue Jun 18 17:02:07 2013 +0530
SCSI: aacraid: Fix for arrays are going offline in the system. System hangs
commit c5bebd829dd95602c15f8da8cc50fa938b5e0254 upstream.
One of the customer had reported that the set of raid logical arrays will
become unavailable (I/O offline) after a long hours of IO stress test. The OS
wouldn`t be accessible afterwards and require a hard reset.
This driver patch has a fix for race condition between the doorbell and the
circular buffer. The driver is modified to do an extra read after clearing the
doorbell in case there had been a completion posted during the small timing
window.
With this fix, we ran IO stress for ~13 days. There were no IO failures.
Signed-off-by: Mahesh Rajashekhara <Mahesh.Rajashekhara@XXXX <mailto:Mahesh.Rajashekhara@XXXX> >
Signed-off-by: James Bottomley <JBottomley@XXXX>
Signed-off-by: Greg Kroah-Hartman <gregkh@XXXXX>
Then it's hard to find what's going wrong ?
Should the firmware being downgraded to previous rev 19076 as it was before the mainboard change and openSUSE 12.3 installation ?
On this Raid6 we have a LVM group which contain virtuals machines who were running with a consequent load (oracle dumps)
running from 1.5 to 2 hours.
Thanks for any pointer, or feedback
I can provide more information if needed.
--
Bruno Friedmann
Ioda-Net Sàrl www.ioda-net.ch <http://www.ioda-net.ch>
openSUSE Member
GPG KEY : D5C9B751C4653227
irc: tigerfoot
-- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch openSUSE Member GPG KEY : D5C9B751C4653227 irc: tigerfoot