[Bug 554152] New: mptscsih driver errors on high IO, server becomes unresponsive
http://bugzilla.novell.com/show_bug.cgi?id=554152 User r.ems@gmx.net added comment http://bugzilla.novell.com/show_bug.cgi?id=554152#c000 Summary: mptscsih driver errors on high IO, server becomes unresponsive Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: x86-64 OS/Version: openSUSE 11.1 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: r.ems@gmx.net QAContact: qa@suse.de Found By: --- Created an attachment (id=326601) --> (http://bugzilla.novell.com/attachment.cgi?id=326601) Extract from /var/log/messages from 1st freeze. User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091103 SUSE/3.5.5-2.1 Firefox/3.5.5 On our server while running nightly backups, the mptscsih driver throws many errors and the load on the server goes up to about 240, and the server gets almost unresponsive. Only a reboot helps. This backups have been running fine on the same hardware for over a year. The server is running openSuSE 11.1 with kernel-default-2.6.27.37-0.1.1. Reproducible: Always Steps to Reproduce: 1a. let the nightly backup, dirvish (www.dirvish.org), which uses rsync to backup, and heavily uses hard links to previous backups. It will hang. 1b. starting the same rsync command on the console also triggered similar errors, see 3rd attachment. Actual Results: The server get very high load. IO hangs. rsync gets into D state and cannot be killed. Expected Results: backup/rsync should run without errors. Some hardware data: # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: MATSHITA Model: DVD-ROM SR-8178 Rev: PZ16 Type: CD-ROM ANSI SCSI revision: 05 Host: scsi8 Channel: 00 Id: 00 Lun: 00 Vendor: IFT Model: A24U-G2421 Rev: 347G Type: Direct-Access ANSI SCSI revision: 05 Host: scsi9 Channel: 00 Id: 00 Lun: 00 Vendor: Areca Model: ARC-1680-VOL#000 Rev: R001 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi9 Channel: 00 Id: 16 Lun: 00 Vendor: Areca Model: RAID controller Rev: R001 Type: Processor ANSI SCSI revision: 00 Host: scsi10 Channel: 00 Id: 00 Lun: 00 Vendor: Areca Model: ARC-1220-VOL#00 Rev: R001 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi10 Channel: 00 Id: 16 Lun: 00 Vendor: Areca Model: RAID controller Rev: R001 Type: Processor ANSI SCSI revision: 00 # lspci 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub (rev b1) 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 2-3 (rev b1) 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 4-5 (rev b1) 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev b1) 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA Engine (rev b1) 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1) 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1) 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1) 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1) 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09) 00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09) 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 09) 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI Controller (rev 09) 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09) 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01) 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01) 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01) 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E3 (rev 01) 03:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 03:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 05:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 08:00.0 RAID bus controller: Areca Technology Corp. ARC-1680 8 port PCIe/PCI-X to SAS/SATA II RAID Controller 09:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express Bridge 09:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express Bridge 0a:0e.0 RAID bus controller: Areca Technology Corp. ARC-1220 8-Port PCI-Express to SATA RAID Controller 0d:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
User r.ems@gmx.net added comment
http://bugzilla.novell.com/show_bug.cgi?id=554152#c1
--- Comment #1 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
User r.ems@gmx.net added comment
http://bugzilla.novell.com/show_bug.cgi?id=554152#c2
--- Comment #2 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
zhu rensheng
http://bugzilla.novell.com/show_bug.cgi?id=554152
--- Comment #3 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c
Jeff Mahoney
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c4
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c5
Sathya Prakash Veerichetty
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c6
--- Comment #6 from kashyap desai
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c7
--- Comment #7 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c9
--- Comment #9 from kashyap desai
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c10
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c11
Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c12
roland kletzing
This backups have been running fine on the same hardware for over a year. The server is running openSuSE 11.1 with kernel-default-2.6.27.37-0.1.1.
so one more question may be: what changed since then to make that issue happen? if it worked before for so long, how can this be a driver issue or why had that issue been undetected for so long ? maybe one of your harddisks is flaky or starts dying, and maybe the controller/driver doesn`t handle that properly ? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c13
--- Comment #13 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c14
--- Comment #14 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c15
--- Comment #15 from roland kletzing
The RAID system probably also changed, we were adding disks to the shelf, now it's full with 24 x 1 TB Hitachi HDDs. and how does the raid setup look like?
not sure what changed between .25 and .29, i don`t have a link to git or whatever scm novell use, but at least the .39 changelog tells this: rpm -q --changelog -p kernel-source-2.6.27.39-0.2.1.src.rpm --snnipp--- * Tue Sep 22 2009 mmarek@suse.cz Refresh patches to apply with older versions of patch(1) - patches.drivers/alsa-post-ga-hda-stac-automic: Refresh. - patches.drivers/cxgb3i-fix-skb-overrun: Refresh. - patches.drivers/mpt-fusion-4.16.00.00-update: Refresh. - patches.drivers/qla4xxx-5.01.00-k8_sles11-03-update: Refresh. - patches.fixes/udf-faster_anchor_detection.patch: Refresh. ---snipp--- * Mon Sep 21 2009 hare@suse.de - patches.drivers/mpt-fusion-4.00.43.00-update: Refresh. - patches.drivers/mpt-fusion-4.16.00.00-update: Refresh. --snipp--- so, it`s likely that there were changes to mptscsi and i (for myself) would try downgrading the kernel to see if it makes a difference. anyway - you have lsi scsi controller AND areca raid controller in your system -i assume you have the backup disks attached to the areca controller? how is areca raid controller related to mptscsih? doesn`t that belong to the lsi controller? not sure, but maybe it would be useful if you would describe your disk/raid/partition/device layout...!? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c16
--- Comment #16 from roland kletzing
# cat /proc/scsi/mptspi/8 ioc0: LSI53C1030 C0, FwRev=01033010h, Ports=1, MaxQ=222 Is this the info you were asking for? Or do you need something else?
since i have seen error messages about Firmware in your logs: if FwRew 01033010h reads "01.03.30.10", then there may be a newer firmware version for that controller - i have seen 1.03.48.00 on Fujitsu Website for example. but let`s wait what novell/lsi people have to tell.... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c17
--- Comment #17 from kashyap desai
(In reply to comment #11)
# cat /proc/scsi/mptspi/8 ioc0: LSI53C1030 C0, FwRev=01033010h, Ports=1, MaxQ=222 Is this the info you were asking for? Or do you need something else?
since i have seen error messages about Firmware in your logs: if FwRew 01033010h reads "01.03.30.10", then there may be a newer firmware version for that controller - i have seen 1.03.48.00 on Fujitsu Website for example.
but let`s wait what novell/lsi people have to tell.... LSI FW engineer is in touch with Hannes Reinecke and latest update from LSI FW engineer is
That version is not the latest version and I'm reviewing changes since that version. Do you know if they are running IT or IR firmware? ---- -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c18
--- Comment #18 from Richard Ems
The RAID system probably also changed, we were adding disks to the shelf, now it's full with 24 x 1 TB Hitachi HDDs. and how does the raid setup look like?
The backup system is a RAID 6 array, with 23 x 1 TB HDDs, 1 HDD is set as hot spare. XFS filesystem on it.
so, it`s likely that there were changes to mptscsi and i (for myself) would try downgrading the kernel to see if it makes a difference.
This is a production system. Perhaps I can downgrade next weekend, but this depends on our production load.
anyway - you have lsi scsi controller AND areca raid controller in your system -i assume you have the backup disks attached to the areca controller?
No no. Forget the Areca controller, it has nothing to do with this errors, at least nothing that *I* know of, it would be very weird if it does. The Infortrend system used for the backups is attached to the LSI SCSI controller. On the Areca controller we have also a RAID 6 array, mounted at /home . But all the errors come from the mptscsih driver, nothing from the arcmsr driver.
how is areca raid controller related to mptscsih? doesn`t that belong to the lsi controller?
not sure, but maybe it would be useful if you would describe your disk/raid/partition/device layout...!?
So here again: 1. openSUSE 11.1 64 bit, kernel 2.6.27.39-0.2-default, on local RAID 1, 2 x 160 GB HDDs. Everything running fine here. 2. one external HDD shelf, mounted at /home, attached to the Areca 1680 controller, with one RAID 6 array on 11 x 1 TB Seagate SAS HDDs. 3. one external Infortrend A24U-G2421 system for backups, attached to the LSI SCSI controller, with one RAID 6 array on 23 x 1 TB Hitachi SATA HDDs, XFS filesystem. Here is where we are having the freezes. c3m:~ # cat /proc/partitions major minor #blocks name 8 0 20506407936 sda 8 1 20506407902 sda1 8 16 12695308288 sdb 8 17 12695308254 sdb1 8 32 156249856 sdc 8 33 10490413 sdc1 8 34 2104515 sdc2 8 35 143653230 sdc3 253 0 41943040 dm-0 253 1 52428800 dm-1 253 2 10485760 dm-2 253 3 10485760 dm-3 253 4 10485760 dm-4 253 5 10485760 dm-5 c3m:~ # cat /etc/fstab /dev/disk/by-id/scsi-20004d927fffff800-part1 / ext3 acl,user_xattr 1 1 /dev/disk/by-id/dm-name-vg1-opt /opt ext3 acl,user_xattr 1 2 /dev/disk/by-id/dm-name-vg1-srv /srv ext3 acl,user_xattr 1 2 /dev/disk/by-id/dm-name-vg1-tmp /tmp ext3 acl,user_xattr 1 2 /dev/disk/by-id/dm-name-vg1-usr /usr ext3 acl,user_xattr 1 2 /dev/disk/by-id/dm-name-vg1-software /usr/local/software ext3 acl,user_xattr 1 2 /dev/disk/by-id/dm-name-vg1-var /var ext3 acl,user_xattr 1 2 ### /dev/disk/by-id/scsi-20004d927fffff802-part1 /home-old xfs defaults,ro 1 2 LABEL=home2 /home xfs defaults 1 2 ### /dev/disk/by-id/scsi-20004d927fffff800-part2 swap swap defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs noauto 0 0 debugfs /sys/kernel/debug debugfs noauto 0 0 usbfs /proc/bus/usb usbfs noauto 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 /dev/disk/by-id/scsi-3600d02300070a9c30ffb202f83b35100-part1 /backup/IFT xfs defaults,rw 0 0 /backup/IFT /backup/IFT auto bind /backup/IFT /backup/IFT auto bind,remount,ro /dev/disk/by-id/usb-LaCie_BiggerDisk_AABF04D7133B-0:0-part1 /backup/LaCie-disc-1 xfs noauto 0 0 /dev/disk/by-id/usb-LaCie_BiggerDisk_AABF04D40115-0:0-part1 /backup/LaCie-disc-2 xfs noauto 0 0 c3m:~ # df -hl Filesystem Size Used Avail Use% Mounted on /dev/sdc1 9.9G 732M 8.7G 8% / udev 2.0G 140K 2.0G 1% /dev /dev/dm-0 40G 27G 12G 70% /opt /dev/dm-2 9.9G 184M 9.2G 2% /srv /dev/dm-3 9.9G 161M 9.2G 2% /tmp /dev/dm-4 9.9G 5.8G 3.7G 62% /usr /dev/dm-1 50G 38G 13G 76% /usr/local/software /dev/dm-5 9.9G 1.1G 8.3G 12% /var /dev/sdb1 12T 11T 1.7T 87% /home /dev/sda1 20T 18T 1.5T 93% /backup/IFT This is it. More info needed? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c19
--- Comment #19 from Richard Ems
(In reply to comment #16)
(In reply to comment #11)
# cat /proc/scsi/mptspi/8 ioc0: LSI53C1030 C0, FwRev=01033010h, Ports=1, MaxQ=222 Is this the info you were asking for? Or do you need something else?
since i have seen error messages about Firmware in your logs: if FwRew 01033010h reads "01.03.30.10", then there may be a newer firmware version for that controller - i have seen 1.03.48.00 on Fujitsu Website for example.
but let`s wait what novell/lsi people have to tell.... LSI FW engineer is in touch with Hannes Reinecke and latest update from LSI FW engineer is
That version is not the latest version and I'm reviewing changes since that version.
Do you know if they are running IT or IR firmware?
What is the difference between "IT" and "IR" firmware? Your question was to Roland, right? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c20
--- Comment #20 from kashyap desai
(In reply to comment #17)
(In reply to comment #16)
(In reply to comment #11)
# cat /proc/scsi/mptspi/8 ioc0: LSI53C1030 C0, FwRev=01033010h, Ports=1, MaxQ=222 Is this the info you were asking for? Or do you need something else?
since i have seen error messages about Firmware in your logs: if FwRew 01033010h reads "01.03.30.10", then there may be a newer firmware version for that controller - i have seen 1.03.48.00 on Fujitsu Website for example.
but let`s wait what novell/lsi people have to tell.... LSI FW engineer is in touch with Hannes Reinecke and latest update from LSI FW engineer is
That version is not the latest version and I'm reviewing changes since that version.
Do you know if they are running IT or IR firmware?
What is the difference between "IT" and "IR" firmware? IT = Initiator/Target and IR = Inter aged Raid Your question was to Roland, right? Anyways. comment #18 has answer of my question. It is IR firmware
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c21
--- Comment #21 from kashyap desai
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c22
--- Comment #22 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c23
--- Comment #23 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c24
--- Comment #24 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c27
--- Comment #27 from kashyap desai
Created an attachment (id=344059) --> (http://bugzilla.novell.com/attachment.cgi?id=344059) [details] Extract from /var/log/messages
Few questions to you. 1. Why Domain Validation is too late. See.. Feb 20 16:27:42 c3m kernel: scsi target8:0:0: Beginning Domain Validation Feb 20 16:27:43 c3m kernel: scsi target8:0:0: Ending Domain Validation Feb 20 16:27:43 c3m kernel: scsi target8:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127) Feb 20 16:27:43 c3me PCS: Lost the local network management interface-to-UPS communication. 0x010 Can you wait for this print and start your rsync. I need some more logs. 2. Before you start rsync what is a log. 3. Can you modify driver and recompile? If yes please add __func__ and __LINE__ information in printk wherever DID_NO_CONNECT return value is set In mptfusion. (specially mptscsih and mptspi) 4. How did you start your rsync process? ( I guess fs is already available. You just mounted to some directory right?) Before you mount what is log and after you mount what is log difference? When this Domain Validation comes? --Kashyap -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c
Philip Oswald
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c28
--- Comment #28 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c
Philip Oswald
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c29
Philip Oswald
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c30
--- Comment #30 from kashyap desai
Kashyap, What else do you need? Sorry for delay in response.
Issue has huge scope, so first we need to minimize the scope of the issue. Here are some points which needs to be discuss. 1. I have seen so many I/O errors in logs. ( see below message) "Nov 10 11:40:25 c3m kernel: sd 8:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Nov 10 11:40:25 c3m kernel: end_request: I/O error, dev sda, sector 14156255378 Nov 10 11:40:25 c3m kernel: Buffer I/O error on device sda1, logical block 1769531918 " -> why those errors are coming is this a root cause of this issue ? My guess from above message: Driver is sending "DID_NO_CONNECT" to mid-layer. (To make sure I required logs which can print IOCstatus. This is what I asked in comment #27. we can skip this as of now since 99% of my guess is correct that FW is sending back to driver IOCSTATSUS_SCSI_DEVICE_NOT_THERE.) Need to know Why ? Is that device connected at 8:0:0:0 is really shaky ? If any of the drives which is part of /dev/sda1 20T 19T 273G 99% /backup/IFT is BAD/SHAKY then we need to minimize the scope of the issue by some route. I don't know how much is it possible (since it is not a test machine ). do you think this can be done ? 2. Why frequent Task abort is coming is second issue. See below set of prints. -- Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found Nov 10 11:42:04 c3m kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880074355bc0) Nov 10 11:42:04 c3m kernel: sd 8:0:0:0: [sda] CDB: Write(16): 8a 00 00 00 00 03 4b c8 b1 6a 00 00 04 00 00 00 Nov 10 11:42:15 c3m kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! -- Because of above state, All other IOs will be blocked since all 64 IOs (shost->queue_depth) are pending at driver and this leads 30 second delay on further IO processing. Unfortunately, For all those IOs driver has received " mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found" I feel this can be side effect of the issue which might have already occurred. **** Most importantly, I would like to understand end behavior when server becomes unresponsive. a) Is it possible when you hit this issue, kill your rsync do "sg_reset -h /dev/sda". Is it still unresponsive ? b) Do you think after the issue, things never come back to normal? I mean if you shutdown your rsync and restart what happen ? How do you come out of this issue ? Is it reboot or something else ? It will really help me if I have logs 2-3 min before the issue occur till the issue occurred. Thanks, Kashyap -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c31
Philip Oswald
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c32
Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c33
--- Comment #33 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c34
--- Comment #34 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c35
--- Comment #35 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c36
--- Comment #36 from kashyap desai
Kashyap, I will be able to continue testing today, probably also on monday an perhaps on tuesday.
Richard
Now it has cleared some of my doubts. Can we switch to Email mode and later we can post our analysis ? - Kashyap -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c37
--- Comment #37 from Richard Ems
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c38
--- Comment #38 from kashyap desai
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c39
--- Comment #39 from kashyap desai
http://bugzilla.novell.com/show_bug.cgi?id=554152
http://bugzilla.novell.com/show_bug.cgi?id=554152#c40
Philip Oswald
participants (1)
-
bugzilla_noreply@novell.com