Mailinglist Archive: opensuse-bugs (2746 mails)

< Previous Next >
[Bug 826486] New: SLES 11.3 x64 with DMMP crashes while adding paths to map
  • From: bugzilla_noreply@xxxxxxxxxx
  • Date: Mon, 24 Jun 2013 19:58:23 +0000
  • Message-id: <bug-826486-21960@http.bugzilla.novell.com/>

https://bugzilla.novell.com/show_bug.cgi?id=826486

https://bugzilla.novell.com/show_bug.cgi?id=826486#c0


Summary: SLES 11.3 x64 with DMMP crashes while adding paths to
map
Classification: openSUSE
Product: openSUSE 12.3
Version: Final
Platform: x86-64
OS/Version: SLES 11
Status: NEW
Severity: Critical
Priority: P5 - None
Component: Kernel
AssignedTo: kernel-maintainers@xxxxxxxxxxxxxxxxxxxxxx
ReportedBy: garrett.marks@xxxxxxxxxx
QAContact: qa-bugs@xxxxxxx
Found By: ---
Blocker: ---


Created an attachment (id=545385)
--> (http://bugzilla.novell.com/attachment.cgi?id=545385)
Output of multipath -ll command from one host.

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101
Firefox/21.0

While running an automated test with two hosts with SLES 11 SP3 (GMC2)
connected to two NetApp E-series storage arrays using infiniband connections
the SUSE hosts have crashed and rebooted on several occasions. The hosts have
crashed independently and so far not at the same time. The test that is run
that causes the issue fails one of the two controllers from each storage array
and then fails some drives from the volume groups on the array. Then the
script recovers the controllers and then begins reconstructing the drives.
Every time this problem has been observed has been during the process of
bringing the array controllers back online.

Configuration description
Two x64 hosts, each from a different vendor and each running SLES 11 SP3
(GMC2).
HCA: One host has ConnectX and ConnecX-3 adapters from Mellanox, the other host
has a ConnectX-2 adapter.
OFED: inbox (ofed-1.5.4.1-0.11.5).
Failover: DM-MP (device-mapper-1.02.77-0.11.33) Using scsi_dh_rdac.
Connection: Connected to the storage through a Mellanox infiniband switch.
Switch: Mellanox SX6036 with MLNX_OS SX_3.3.3500
IBstat output:
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.500
Hardware version: 0
Node GUID: 0x0002c9030038c8e0
System image GUID: 0x0002c9030038c8e3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 25
LMC: 0
SM lid: 4
Capability mask: 0x02514868
Port GUID: 0x0002c9030038c8e1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c9030038c8e2
Link layer: InfiniBand
CA 'mlx4_1'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: a0
Node GUID: 0x0002c90300087a12
System image GUID: 0x0002c90300087a15
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 19
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c90300087a13
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c90300087a14
Link layer: InfiniBand


Storage
Two NetApp E-series SRP storage arrays.
LUNs: 32 volumes mapped to each host from each array (64 volumes in total
mapped to each host)


Reproducible: Sometimes

Steps to Reproduce:
1. Install SLES 11 SP3 and configure infiniband OFED and DMMP.
2. Map 32 LUNs to each host from each of two E-series arrays.
3. Start IO on each host to the LUNs.
4. Fail one controller from each storage array.
5. Fail a couple of drives leaving each volume group useable.
6. Set one of the failed controllers online.
7. Wait a couple of minutes.
8. Set the second failed controller online.

Actual Results:
When the problem occurs the host will reboot and leave a crash dump behind.

Expected Results:
The host should not crash and should run IO without any problems while the
controllers come back online.

Attached is the output of "multipath -ll" from one of the hosts at the start of
the test.

Every occurrence of this the host has been in the process of adding paths to
device maps. This is an example from one host’s messages file right before the
crash and then the host booting back up.

Jun 22 14:19:14 wicb-catawba multipathd: sddg path added to devmap mpathai
Jun 22 14:19:14 wicb-catawba multipathd: sddk: add path (uevent)
Jun 22 14:19:14 wicb-catawba multipathd: sddk: using deprecated getuid callout
Jun 22 14:19:14 wicb-catawba multipathd: mpathbk: load table [0 4194304
multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 2 1 round-robin 0 2 1
65:32 14 69:32 14 round-robin 0 1
1 71:32 9]
Jun 22 14:19:14 wicb-catawba multipathd: sddk path added to devmap mpathbk
Jun 22 14:19:14 wicb-catawba multipathd: sday: add path (uevent)
Jun 22 14:19:14 wicb-catawba multipathd: sday: using deprecated getuid callout
Jun 22 14:19:14 wicb-catawba multipathd: mpathbk: load table [0 4194304
multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 2 1 round-robin 0 2 1
65:32 14 69:32 14 round-robin 0 2
1 71:32 9 67:32 9]
Jun 22 14:19:14 wicb-catawba multipathd: sday path added to devmap mpathbk
Jun 22 14:19:14 wicb-catawba multipathd: sdbb: add path (uevent)
Jun 22 14:19:14 wicb-catawba multipathd: sdbb: using deprecated getuid callout
Jun 22 14:23:00 wicb-catawba syslog-ng[1809]: syslog-ng starting up;
version='2.0.9'
Jun 22 14:23:01 wicb-catawba rchal: CPU frequency scaling is not supported by
your processor.
Jun 22 14:23:01 wicb-catawba rchal: boot with 'CPUFREQ=no' in to avoid this
warning.
Jun 22 14:23:01 wicb-catawba rchal: Cannot load cpufreq governors - No cpufreq
driver available
Jun 22 14:23:05 wicb-catawba kernel: klogd 1.4.1, log source = /proc/kmsg
started.

Here is the output of the dmesg.txt that accompanied the crash for the
occurrence referenced in the host logs above.

<0>[15509.707616] ------------[ cut here ]------------
<2>[15509.707625] kernel BUG at
/usr/src/packages/BUILD/kernel-default-3.0.76/linux-3.0/kernel/timer.c:1037!
<0>[15509.707633] invalid opcode: 0000 [#1] SMP
<4>[15509.707640] CPU 1
<4>[15509.707642] Modules linked in: dm_round_robin edd nfs lockd fscache
auth_rpcgss nfs_acl sunrpc dm_multipath af_packet rdma_ucm rdma_cm iw_cm
ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad
iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_sa ib_mthca ib_mad ib_core mperf microcode
fuse loop pciehp dm_mod ixgbe igb joydev usbhid hid ipv6 ipv6_lib dca
usb_storage shpchp rtc_cmos ptp pci_hotplug pps_core mei dcdbas(X) iTCO_wdt
iTCO_vendor_support mlx4_core mdio sr_mod acpi_power_meter sg acpi_pad button
pcspkr cdrom ext3 jbd mbcache ttm drm_kms_helper drm i2c_algo_bit sysimgblt
sysfillrect i2c_core syscopyarea sd_mod crc_t10dif ehci_hcd usbcore usb_common
processor thermal_sys hwmon scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua ahci libahci
libata scsi_dh_rdac scsi_dh scsi_mod
<4>[15509.707750] Supported: Yes, External
<4>[15509.707754]
<4>[15509.707759] Pid: 15183, comm: LinuxSmash Tainted: G X
3.0.76-0.11-default #1 Dell Inc. PowerEdge R720/061P35
<4>[15509.707769] RIP: 0010:[<ffffffff8106ddd1>] [<ffffffff8106ddd1>]
cascade+0x91/0xa0
<4>[15509.707784] RSP: 0000:ffff88082fc03e40 EFLAGS: 00010083
<4>[15509.707789] RAX: ffffffff81ddd680 RBX: ffff8803fabb1ae0 RCX:
00000001003a0580
<4>[15509.707794] RDX: ffff88082fc03e40 RSI: ffff8803fabb1ae0 RDI:
ffff880817af0000
<4>[15509.707800] RBP: ffff880817af0000 R08: 0000000000000080 R09:
ffff88082fc117f0
<4>[15509.707805] R10: 0000000000000c00 R11: ffffffff81025700 R12:
ffff88082fc03e40
<4>[15509.707811] R13: 0000000000000005 R14: 0000000000000001 R15:
ffffffff81a02108
<4>[15509.707817] FS: 00007f228b4d1700(0000) GS:ffff88082fc00000(0000)
knlGS:0000000000000000
<4>[15509.707823] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[15509.707828] CR2: 00007f7b9d2728f0 CR3: 000000040125c000 CR4:
00000000000407e0
<4>[15509.707834] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
<4>[15509.707839] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
<4>[15509.707845] Process LinuxSmash (pid: 15183, threadinfo ffff8808034ac000,
task ffff880803f402c0)
<0>[15509.707850] Stack:
<4>[15509.707853] ffff880815af3b48 ffff8803fabb1ae0 ffff880803f402c0
0000000000000000
<4>[15509.707866] ffff880817af0000 0000000000000008 ffff88082fc03ea0
ffffffff810700b4
<4>[15509.707875] ffff880817af1c20 ffff880817af1820 ffff880817af1420
ffff880817af1020
<0>[15509.707885] Call Trace:
<4>[15509.707902] [<ffffffff810700b4>] run_timer_softirq+0x1a4/0x240
<4>[15509.707915] [<ffffffff81066eaf>] __do_softirq+0xef/0x220
<4>[15509.707928] [<ffffffff814657dc>] call_softirq+0x1c/0x30
<4>[15509.707943] [<ffffffff81004445>] do_softirq+0x65/0xa0
<4>[15509.707952] [<ffffffff81066ca5>] irq_exit+0xc5/0xe0
<4>[15509.707961] [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
<4>[15509.707971] [<ffffffff81464f73>] apic_timer_interrupt+0x13/0x20
<4>[15509.707982] [<0000000000411516>] 0x411515
<0>[15509.707986] Code: de 48 83 e0 fe 48 39 c5 75 21 48 89 d3 48 89 ef e8 a5
fe ff ff 4c 39 e3 48 8b 13 75 dd 48 83 c4 18 44 89 e8 5b 5d 41 5c 41 5d c3 <0f>
0b eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 00
<1>[15509.708039] RIP [<ffffffff8106ddd1>] cascade+0x91/0xa0
<4>[15509.708046] RSP <ffff88082fc03e40>

--
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
< Previous Next >
This Thread
  • No further messages