[Bug 650593] New: [LSI CR183682] No response messages from DM path down with VDs on no preferred path

https://bugzilla.novell.com/show_bug.cgi?id=650593 https://bugzilla.novell.com/show_bug.cgi?id=650593#c0 Summary: [LSI CR183682] No response messages from DM path down with VDs on no preferred path Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: x86-64 OS/Version: SLES 11 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: chris.chavez@lsi.com QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=398049) --> (http://bugzilla.novell.com/attachment.cgi?id=398049) serial console logs and syslog User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7 There appears to be an issue at the transport layer or the lower driver layer that's sending the DID_TRANSPORT_FAILFAST forever which is causing DMMP to hang. below is the analysis from the DMMP failover team: Configuration on which I could easily able to reproduce this issue: SLES11SP1 2.6.32.12-0.7-default Broadcom 57710 NIC with inbox driver bnx2x: 1.52.1-7 inbox open-iscsi: 2.0.871-0.20.3 inbox scsi_dh_rdac (with debug enabled): 1.5.2.0-1 IOMonkey tool 50 LUNS for IO Two Paths per controller (4 iSCSI sessions from host to SAN array) Dell MD36xxi controller In the above config, I could notice multipathd getting freezed during failover and failback. Reason for multipathd freeze is due to IO hang in physical paths. Below snippet would give better understanding of the issue. I think this analysis should be good enough to understand that this is not a device mapper issue. Further investigation should be done on iscsi driver for IO hang . Hope this helps Findings and logs analysis: [controller serial logs and syslog is attached] * At start of the test, DM device had 2 paths per controller and each paths are in active state. IO was started on all the 50 LUNs #multipath -ll 36842b2b0001681f0000072293b28b78c mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 4:0:0:48 sdbw 68:160 active ready running | `- 6:0:0:48 sdgq 132:96 active ready running `-+- policy='round-robin 0' prio=1 status=enabled |- 5:0:0:48 sdcu 70:32 active ghost running `- 7:0:0:48 sdgh 131:208 active ghost running * When controller A was made offline, as expected paths from controller A was marked as failed by DM multipath. For failed paths, sg_inq command returned DID_TRANSPORT_FASTFAIL for ever. I was expecting DID_NO_CONNECT after session recovery timeout. --> controller A offlined -> cmgrSetAltToFailed 1 10/30/10-05:41:23 (tShell1): NOTE: vdm::syncRequired(): Begin 10/30/10-05:41:57 (tShell1): NOTE: vdm::syncRequired(): Complete, elapsed time = 34 seconds 10/30/10-05:41:57 (tShell1): NOTE: CCM: takeover() setting MOS 10/30/10-05:41:57 (tShell1): WARN: Alt Ctl Reboot: #multipath -ll 36842b2b0001681f0000072293b28b78c Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 4:0:0:48 sdbw 68:160 active ready running | `- 6:0:0:48 sdgq 132:96 active ready running `-+- policy='round-robin 0' prio=0 status=enabled |- 5:0:0:48 sdcu 70:32 failed faulty running `- 7:0:0:48 sdgh 131:208 failed faulty running * When controller A was made online, I was expecting state of paths from controller A to become active from failed which never happened. sg_inq on those physical paths still reported DID_TRANSPORT_FASTFAIL. --> controller A was made online -> cmgrSetAltToOptimal 10/30/10-05:50:04 (tShell1): NOTE: releasing alt ctl from reset #multipath -ll 36842b2b0001681f0000072293b28b78c Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 4:0:0:48 sdbw 68:160 active ready running | `- 6:0:0:48 sdgq 132:96 active ready running `-+- policy='round-robin 0' prio=0 status=enabled |- 5:0:0:48 sdcu 70:32 failed faulty running `- 7:0:0:48 sdgh 131:208 failed faulty running @ Sat Oct 30 01:09:40 CDT 2010 sles111host:/tmp # sg_inq -p0x80 /dev/sdcu VPD INQUIRY: Unit serial number page inquiry: transport: Host_status=0x0f is invalid ---controller A path Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] inquiry: failed, res=-1 @ Sat Oct 30 01:18:05 CDT 2010 sles111host:/tmp # sg_inq -p0x80 /dev/sdcu VPD INQUIRY: Unit serial number page inquiry: transport: Host_status=0x0f is invalid ---controller A path Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] inquiry: failed, res=-1 @Sat Oct 30 01:31:33 CDT 2010 sles111host:/tmp/vijay # sg_inq -p0x80 /dev/sdcu VPD INQUIRY: Unit serial number page inquiry: transport: Host_status=0x0f is invalid ---controller A path Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] inquiry: failed, res=-1 #multipath -ll 36842b2b0001681f0000072293b28b78c Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 4:0:0:48 sdbw 68:160 active ready running | `- 6:0:0:48 sdgq 132:96 active ready running `-+- policy='round-robin 0' prio=0 status=enabled |- 5:0:0:48 sdcu 70:32 failed faulty running `- 7:0:0:48 sdgh 131:208 failed faulty running * When controller B was put offline then I could notice hang in multipathd. Here IO hang was noticed on physical paths of controller B. paths from controller A still reported DID_TRANSPORT_FAILFAST. --> controller B is offlined -> cmgrSetAltToFailed 1 10/30/10-06:52:56 (tShell1): NOTE: CCM: takeover() setting MOS 10/30/10-06:52:57 (tShell1): WARN: Alt Ctl Reboot: Reboot CompID: 0x407 Reboot reason: 0x6 Reboot reason extra: 0x0 #multipath -ll hang> In syslog Oct 30 01:56:35 sles111host kernel: [160171.336055] connection1:0: detected conn error (1011) Oct 30 01:59:35 sles111host kernel: [160351.644139] scsi 6:0:0:31: timing out command, waited 20s Oct 30 01:59:55 sles111host kernel: [160371.648142] scsi 6:0:0:31: timing out command, waited 20s Sat Oct 30 02:05:59 2010 localhost IOMonkey Warning: /home/dmMOUNT21 - No response for over 600 seconds on /home/dmMOUNT21:. Sat Oct 30 02:06:07 2010 localhost IOMonkey Warning: /home/dmMOUNT17 - No response for over 600 seconds on /home/dmMOUNT17:. Sat Oct 30 02:06:07 2010 localhost IOMonkey Warning: /home/dmMOUNT38 - No response for over 600 seconds on /home/dmMOUNT38:. Sa * Even when controller B was made online, Paths did not recover from hang. And I could still notice no response message in the syslog. --> controller B is made online -> cmgrSetAltToOptimal 10/30/10-07:07:05 (tShell1): NOTE: releasing alt ctl from reset value In syslog Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT43 - No response for over 900 seconds on /home/dmMOUNT43:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT44 - No response for over 900 seconds on /home/dmMOUNT44:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT45 - No response for over 900 seconds on /home/dmMOUNT45:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT46 - No response for over 900 seconds on /home/dmMOUNT46:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT47 - No response for over 900 seconds on /home/dmMOUNT47:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT48 - No response for over 900 seconds on /home/dmMOUNT48:. Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning: /home/dmMOUNT49 - No response for over 900 seconds on /home/dmMOUNT49:. Sat Oct 30 02:11:27 2010 localhost IOMonkey Warning: /home/dmMOUNT22 - No response for over 900 seconds on /home/dmMOUNT22:. @ Sat Oct 30 02:19:01 CDT 2010 sles111host:~ # sg_inq -p0x80 /dev/sdcu VPD INQUIRY: Unit serial number page inquiry: transport: Host_status=0x0f is invalid Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] ---controller A path inquiry: failed, res=-1 sles111host:~ # sg_inq -p0x80 /dev/sdgh VPD INQUIRY: Unit serial number page inquiry: transport: Host_status=0x0f is invalid ---controller A path Driver_status=0x00 [DRIVER_OK, SUGGEST_OK] inquiry: failed, res=-1 sles111host:~ # sg_inq -p0x80 /dev/sdgq VPD INQUIRY: Unit serial number page ---controller B path hang> sles111host:~ # sg_inq -p0x80 /dev/sdbw VPD INQUIRY: Unit serial number page ---controller B path hang> #multipath -ll 36842b2b0001681f0000072293b28b78c hang> Reproducible: Always Steps to Reproduce: 1. Map 50 LUNs from storage array to host and run IO to logical volumes for these devices 2. Manually fail one controller through the controller management GUI or with the serial console command 'cmgrSetAltToFailed 1' from the alternate controller. 3. Verify all volumes owned by the failed controller transfer to their non-preferred path 4. multipath -ll reports devices as active/active instead of active/failed 4. Revive the failed controller from the controller management GUI or with the serial console command 'cmgrSetAltToOptimal 1' from the alternate controller. Actual Results: multipath -ll hangs, sq_inq <failed device> reports DID_TRANSPORT_FASTFAIL forever, IO attempts on these devices hangs forever and IOMonkey reports 'No Response' messages. Expected Results: Devices failback and IOs complete successfully. This issue was also reproduced with 100 LUNs mapped and only 2 iSCSI sessions from the host to controller (twice as many DM logical devices, same number of physical devices seen by DM). This issue only appears with sufficient stress on the host. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com