[Bug 554152] mptscsih driver errors on high IO, server becomes unresponsive

29 Mar 2010

      http://bugzilla.novell.com/show_bug.cgi?id=554152

http://bugzilla.novell.com/show_bug.cgi?id=554152#c30

--- Comment #30 from kashyap desai <kashyap.desai@lsi.com> 2010-03-29 07:49:47 UTC ---
(In reply to comment #29)
...
Kashyap, What else do you need?
Sorry for delay in response.
Issue has huge scope, so first we need to minimize the scope of the issue.
Here are some points which needs to be discuss. 

1. I have seen so many I/O errors in logs. ( see below message)
"Nov 10 11:40:25 c3m kernel: sd 8:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK 
Nov 10 11:40:25 c3m kernel: end_request: I/O error, dev sda, sector 14156255378
Nov 10 11:40:25 c3m kernel: Buffer I/O error on device sda1, logical block
1769531918 "

-> why those errors are coming is this a root cause of this issue ? My guess
from above message: Driver is sending "DID_NO_CONNECT" to mid-layer. (To make
sure I required logs which can print IOCstatus. This is what I asked in comment
#27. we can skip this as of now since 99% of my guess is correct that FW is
sending back to driver IOCSTATSUS_SCSI_DEVICE_NOT_THERE.) 
Need to know Why ? 
Is that device connected at 8:0:0:0 is really shaky ? 
If any of the drives which is part of 
/dev/sda1              20T   19T  273G  99% /backup/IFT
is BAD/SHAKY then we need to minimize the scope of the issue by some route. I
don't know how much is it possible (since it is not a test machine ).
do you think this can be done ?

2. Why frequent Task abort is coming is second issue. 
See below set of prints.

--
Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID
not found
Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID
not found
Nov 10 11:42:04 c3m kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff880074355bc0)
Nov 10 11:42:04 c3m kernel: sd 8:0:0:0: [sda] CDB: Write(16): 8a 00 00 00 00 03
4b c8 b1 6a 00 00 04 00 00 00
Nov 10 11:42:15 c3m kernel: mptscsih: ioc0: WARNING - Issuing Reset from
mptscsih_IssueTaskMgmt!!

--

Because of above state, All other IOs will be blocked since all 64 IOs
(shost->queue_depth) are pending at driver and this leads 30 second delay on
further IO processing. Unfortunately, For all those IOs driver has received 
" mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found"

I feel this can be side effect of the issue which might have already occurred. 

**** Most importantly, I would like to understand end behavior when server
becomes unresponsive.
a) Is it possible when you hit this issue, kill your rsync do "sg_reset -h
/dev/sda". Is it still unresponsive ?
b) Do you think after the issue, things never come back to normal? I mean if
you shutdown your rsync and restart what happen ? How do you come out of this
issue ? Is it reboot or something else ?

It will really help me if I have logs 2-3 min before the issue occur till the
issue occurred. 

Thanks,
Kashyap

-- 
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Bug 554152] mptscsih driver errors on high IO, server becomes unresponsive

bugzilla_noreply＠novell.com