[Bug 650593] New: [LSI CR183682] No response messages from DM path down with VDs on no preferred path

1 Nov 2010

      https://bugzilla.novell.com/show_bug.cgi?id=650593

https://bugzilla.novell.com/show_bug.cgi?id=650593#c0

           Summary: [LSI CR183682] No response messages from DM path down
                    with VDs on no preferred path
    Classification: openSUSE
           Product: openSUSE 11.1
           Version: Final
          Platform: x86-64
        OS/Version: SLES 11
            Status: NEW
          Severity: Normal
          Priority: P5 - None
         Component: Kernel
        AssignedTo: bnc-team-screening@forge.provo.novell.com
        ReportedBy: chris.chavez@lsi.com
         QAContact: qa@suse.de
          Found By: ---
           Blocker: ---

Created an attachment (id=398049)
 --> (http://bugzilla.novell.com/attachment.cgi?id=398049)
serial console logs and syslog

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)
AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7

There appears to be an issue at the transport layer or the lower driver layer
that's sending the DID_TRANSPORT_FAILFAST forever which is causing DMMP to
hang. below is the analysis from the DMMP failover team:

Configuration on which I could easily able to reproduce this issue:
SLES11SP1 2.6.32.12-0.7-default
Broadcom 57710 NIC with inbox driver bnx2x: 1.52.1-7
inbox open-iscsi: 2.0.871-0.20.3
inbox scsi_dh_rdac (with debug enabled): 1.5.2.0-1
IOMonkey tool 
50 LUNS for IO
Two Paths per controller (4 iSCSI sessions from host to SAN array)
Dell MD36xxi controller

In the above config, I could notice multipathd getting freezed during failover
and failback. Reason for multipathd freeze is due to IO hang in physical paths.
Below snippet would give better understanding of the issue. I think this
analysis should be good enough to understand that this is not a device mapper
issue. Further investigation should be done on iscsi driver  for IO hang . Hope
this helps

Findings and logs analysis:

[controller serial logs and syslog is attached]
* At start of the test, DM device had 2 paths per controller and each paths are
in active state. IO was started on all the 50 LUNs

#multipath -ll 36842b2b0001681f0000072293b28b78c
mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi
size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac'
wp=rw
|-+- policy='round-robin 0' prio=6 status=active
| |- 4:0:0:48 sdbw 68:160  active ready running
| `- 6:0:0:48 sdgq 132:96  active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  |- 5:0:0:48 sdcu 70:32   active ghost running
  `- 7:0:0:48 sdgh 131:208 active ghost running
* When controller A was made offline, as expected paths from controller A was
marked as failed by DM multipath. For failed paths, sg_inq command returned
DID_TRANSPORT_FASTFAIL for ever. I was expecting DID_NO_CONNECT after session
recovery timeout.

--> controller A offlined

-> cmgrSetAltToFailed 1
10/30/10-05:41:23 (tShell1): NOTE:  vdm::syncRequired(): Begin
10/30/10-05:41:57 (tShell1): NOTE:  vdm::syncRequired(): Complete, elapsed time
= 34 seconds
10/30/10-05:41:57 (tShell1): NOTE:  CCM: takeover() setting MOS
10/30/10-05:41:57 (tShell1): WARN:  Alt Ctl Reboot:

#multipath -ll 36842b2b0001681f0000072293b28b78c
Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error
Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error
mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi
size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac'
wp=rw
|-+- policy='round-robin 0' prio=6 status=active
| |- 4:0:0:48 sdbw 68:160  active ready  running
| `- 6:0:0:48 sdgq 132:96  active ready  running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 5:0:0:48 sdcu 70:32   failed faulty running
  `- 7:0:0:48 sdgh 131:208 failed faulty running

* When controller A was made online,  I was expecting state of paths from
controller A to become active from failed which never happened. sg_inq on those
physical paths still reported DID_TRANSPORT_FASTFAIL.

--> controller A was made online 

-> cmgrSetAltToOptimal
10/30/10-05:50:04 (tShell1): NOTE:  releasing alt ctl from reset

#multipath -ll 36842b2b0001681f0000072293b28b78c
Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error
Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error
mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi
size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac'
wp=rw
|-+- policy='round-robin 0' prio=6 status=active
| |- 4:0:0:48 sdbw 68:160  active ready  running
| `- 6:0:0:48 sdgq 132:96  active ready  running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 5:0:0:48 sdcu 70:32   failed faulty running
  `- 7:0:0:48 sdgh 131:208 failed faulty running

@ Sat Oct 30 01:09:40 CDT 2010
sles111host:/tmp # sg_inq -p0x80 /dev/sdcu
VPD INQUIRY: Unit serial number page
inquiry: transport: Host_status=0x0f is invalid        ---controller A path
Driver_status=0x00 [DRIVER_OK, SUGGEST_OK]

    inquiry: failed, res=-1

@ Sat Oct 30 01:18:05 CDT 2010

sles111host:/tmp # sg_inq -p0x80 /dev/sdcu
VPD INQUIRY: Unit serial number page
inquiry: transport: Host_status=0x0f is invalid         ---controller A path
Driver_status=0x00 [DRIVER_OK, SUGGEST_OK]

    inquiry: failed, res=-1

@Sat Oct 30 01:31:33 CDT 2010
sles111host:/tmp/vijay # sg_inq -p0x80 /dev/sdcu
VPD INQUIRY: Unit serial number page
inquiry: transport: Host_status=0x0f is invalid        ---controller A path
Driver_status=0x00 [DRIVER_OK, SUGGEST_OK]

    inquiry: failed, res=-1

#multipath -ll 36842b2b0001681f0000072293b28b78c
Oct 30 00:57:38 | sdcu: rdac prio: inquiry command indicates error
Oct 30 00:57:38 | sdgh: rdac prio: inquiry command indicates error
mpathax (36842b2b0001681f0000072293b28b78c) dm-47 DELL,MD36xxi
size=4.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac'
wp=rw
|-+- policy='round-robin 0' prio=6 status=active
| |- 4:0:0:48 sdbw 68:160  active ready  running
| `- 6:0:0:48 sdgq 132:96  active ready  running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 5:0:0:48 sdcu 70:32   failed faulty running
  `- 7:0:0:48 sdgh 131:208 failed faulty running

* When controller B was put offline then I could notice hang in multipathd.
Here IO hang was noticed on physical paths of controller B. paths from
controller A still reported DID_TRANSPORT_FAILFAST.

--> controller B is offlined

-> cmgrSetAltToFailed 1
10/30/10-06:52:56 (tShell1): NOTE:  CCM: takeover() setting MOS
10/30/10-06:52:57 (tShell1): WARN:  Alt Ctl Reboot:
      Reboot CompID: 0x407
      Reboot reason: 0x6
      Reboot reason extra: 0x0

#multipath -ll 
 hang>

In syslog
Oct 30 01:56:35 sles111host kernel: [160171.336055]  connection1:0: detected
conn error (1011)
Oct 30 01:59:35 sles111host kernel: [160351.644139] scsi 6:0:0:31: timing out
command, waited 20s
Oct 30 01:59:55 sles111host kernel: [160371.648142] scsi 6:0:0:31: timing out
command, waited 20s
Sat Oct 30 02:05:59 2010 localhost IOMonkey Warning:    /home/dmMOUNT21 - No
response for over 600 seconds on /home/dmMOUNT21:.
Sat Oct 30 02:06:07 2010 localhost IOMonkey Warning:    /home/dmMOUNT17 - No
response for over 600 seconds on /home/dmMOUNT17:.
Sat Oct 30 02:06:07 2010 localhost IOMonkey Warning:    /home/dmMOUNT38 - No
response for over 600 seconds on /home/dmMOUNT38:.
Sa

* Even when controller B was made online, Paths did not recover from hang. And
I could still notice no response message in the syslog.
--> controller B is made online
-> cmgrSetAltToOptimal
10/30/10-07:07:05 (tShell1): NOTE:  releasing alt ctl from reset
value

In syslog
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT43 - No
response for over 900 seconds on /home/dmMOUNT43:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT44 - No
response for over 900 seconds on /home/dmMOUNT44:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT45 - No
response for over 900 seconds on /home/dmMOUNT45:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT46 - No
response for over 900 seconds on /home/dmMOUNT46:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT47 - No
response for over 900 seconds on /home/dmMOUNT47:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT48 - No
response for over 900 seconds on /home/dmMOUNT48:.
Sat Oct 30 02:11:12 2010 localhost IOMonkey Warning:    /home/dmMOUNT49 - No
response for over 900 seconds on /home/dmMOUNT49:.
Sat Oct 30 02:11:27 2010 localhost IOMonkey Warning:    /home/dmMOUNT22 - No
response for over 900 seconds on /home/dmMOUNT22:.

@ Sat Oct 30 02:19:01 CDT 2010

sles111host:~ # sg_inq  -p0x80 /dev/sdcu
VPD INQUIRY: Unit serial number page
inquiry: transport: Host_status=0x0f is invalid
Driver_status=0x00 [DRIVER_OK, SUGGEST_OK]            ---controller A path

    inquiry: failed, res=-1

sles111host:~ # sg_inq  -p0x80 /dev/sdgh
VPD INQUIRY: Unit serial number page
inquiry: transport: Host_status=0x0f is invalid        ---controller A path
Driver_status=0x00 [DRIVER_OK, SUGGEST_OK]

    inquiry: failed, res=-1

sles111host:~ # sg_inq  -p0x80 /dev/sdgq
VPD INQUIRY: Unit serial number page                   ---controller B path
hang>

sles111host:~ # sg_inq -p0x80 /dev/sdbw
VPD INQUIRY: Unit serial number page                    ---controller B path
hang>

#multipath -ll 36842b2b0001681f0000072293b28b78c
hang>

Reproducible: Always

Steps to Reproduce:
1. Map 50 LUNs from storage array to host and run IO to logical volumes for
these devices
2. Manually fail one controller through the controller management GUI or with
the serial console command 'cmgrSetAltToFailed 1' from the alternate
controller.
3. Verify all volumes owned by the failed controller transfer to their
non-preferred path
4. multipath -ll reports devices as active/active instead of active/failed
4. Revive the failed controller from the controller management GUI or with the
serial console command 'cmgrSetAltToOptimal 1' from the alternate controller.
Actual Results:  
multipath -ll hangs, sq_inq <failed device> reports DID_TRANSPORT_FASTFAIL
forever, IO attempts on these devices hangs forever and IOMonkey reports 'No
Response' messages.

Expected Results:  
Devices failback and IOs complete successfully.

This issue was also reproduced with 100 LUNs mapped and only 2 iSCSI sessions
from the host to controller (twice as many DM logical devices, same number of
physical devices seen by DM).  This issue only appears with sufficient stress
on the host.

-- 
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

bugzilla_noreply＠novell.com

tags

participants (1)