http://bugzilla.novell.com/show_bug.cgi?id=554152 http://bugzilla.novell.com/show_bug.cgi?id=554152#c30 --- Comment #30 from kashyap desai <kashyap.desai@lsi.com> 2010-03-29 07:49:47 UTC --- (In reply to comment #29)
Kashyap, What else do you need? Sorry for delay in response.
Issue has huge scope, so first we need to minimize the scope of the issue. Here are some points which needs to be discuss. 1. I have seen so many I/O errors in logs. ( see below message) "Nov 10 11:40:25 c3m kernel: sd 8:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Nov 10 11:40:25 c3m kernel: end_request: I/O error, dev sda, sector 14156255378 Nov 10 11:40:25 c3m kernel: Buffer I/O error on device sda1, logical block 1769531918 " -> why those errors are coming is this a root cause of this issue ? My guess from above message: Driver is sending "DID_NO_CONNECT" to mid-layer. (To make sure I required logs which can print IOCstatus. This is what I asked in comment #27. we can skip this as of now since 99% of my guess is correct that FW is sending back to driver IOCSTATSUS_SCSI_DEVICE_NOT_THERE.) Need to know Why ? Is that device connected at 8:0:0:0 is really shaky ? If any of the drives which is part of /dev/sda1 20T 19T 273G 99% /backup/IFT is BAD/SHAKY then we need to minimize the scope of the issue by some route. I don't know how much is it possible (since it is not a test machine ). do you think this can be done ? 2. Why frequent Task abort is coming is second issue. See below set of prints. -- Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found Nov 10 11:41:33 c3m kernel: mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found Nov 10 11:42:04 c3m kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880074355bc0) Nov 10 11:42:04 c3m kernel: sd 8:0:0:0: [sda] CDB: Write(16): 8a 00 00 00 00 03 4b c8 b1 6a 00 00 04 00 00 00 Nov 10 11:42:15 c3m kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! -- Because of above state, All other IOs will be blocked since all 64 IOs (shost->queue_depth) are pending at driver and this leads 30 second delay on further IO processing. Unfortunately, For all those IOs driver has received " mptbase: ioc0: LogInfo(0x11010000): F/W: bug! MID not found" I feel this can be side effect of the issue which might have already occurred. **** Most importantly, I would like to understand end behavior when server becomes unresponsive. a) Is it possible when you hit this issue, kill your rsync do "sg_reset -h /dev/sda". Is it still unresponsive ? b) Do you think after the issue, things never come back to normal? I mean if you shutdown your rsync and restart what happen ? How do you come out of this issue ? Is it reboot or something else ? It will really help me if I have logs 2-3 min before the issue occur till the issue occurred. Thanks, Kashyap -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.