[Bug 826486] New: SLES 11.3 x64 with DMMP crashes while adding paths to map
https://bugzilla.novell.com/show_bug.cgi?id=826486 https://bugzilla.novell.com/show_bug.cgi?id=826486#c0 Summary: SLES 11.3 x64 with DMMP crashes while adding paths to map Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: SLES 11 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: garrett.marks@netapp.com QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=545385) --> (http://bugzilla.novell.com/attachment.cgi?id=545385) Output of multipath -ll command from one host. User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 While running an automated test with two hosts with SLES 11 SP3 (GMC2) connected to two NetApp E-series storage arrays using infiniband connections the SUSE hosts have crashed and rebooted on several occasions. The hosts have crashed independently and so far not at the same time. The test that is run that causes the issue fails one of the two controllers from each storage array and then fails some drives from the volume groups on the array. Then the script recovers the controllers and then begins reconstructing the drives. Every time this problem has been observed has been during the process of bringing the array controllers back online. Configuration description Two x64 hosts, each from a different vendor and each running SLES 11 SP3 (GMC2). HCA: One host has ConnectX and ConnecX-3 adapters from Mellanox, the other host has a ConnectX-2 adapter. OFED: inbox (ofed-1.5.4.1-0.11.5). Failover: DM-MP (device-mapper-1.02.77-0.11.33) Using scsi_dh_rdac. Connection: Connected to the storage through a Mellanox infiniband switch. Switch: Mellanox SX6036 with MLNX_OS SX_3.3.3500 IBstat output: CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.11.500 Hardware version: 0 Node GUID: 0x0002c9030038c8e0 System image GUID: 0x0002c9030038c8e3 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 25 LMC: 0 SM lid: 4 Capability mask: 0x02514868 Port GUID: 0x0002c9030038c8e1 Link layer: InfiniBand Port 2: State: Down Physical state: Disabled Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02514868 Port GUID: 0x0002c9030038c8e2 Link layer: InfiniBand CA 'mlx4_1' CA type: MT26428 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: a0 Node GUID: 0x0002c90300087a12 System image GUID: 0x0002c90300087a15 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 19 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c90300087a13 Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300087a14 Link layer: InfiniBand Storage Two NetApp E-series SRP storage arrays. LUNs: 32 volumes mapped to each host from each array (64 volumes in total mapped to each host) Reproducible: Sometimes Steps to Reproduce: 1. Install SLES 11 SP3 and configure infiniband OFED and DMMP. 2. Map 32 LUNs to each host from each of two E-series arrays. 3. Start IO on each host to the LUNs. 4. Fail one controller from each storage array. 5. Fail a couple of drives leaving each volume group useable. 6. Set one of the failed controllers online. 7. Wait a couple of minutes. 8. Set the second failed controller online. Actual Results: When the problem occurs the host will reboot and leave a crash dump behind. Expected Results: The host should not crash and should run IO without any problems while the controllers come back online. Attached is the output of "multipath -ll" from one of the hosts at the start of the test. Every occurrence of this the host has been in the process of adding paths to device maps. This is an example from one host’s messages file right before the crash and then the host booting back up. Jun 22 14:19:14 wicb-catawba multipathd: sddg path added to devmap mpathai Jun 22 14:19:14 wicb-catawba multipathd: sddk: add path (uevent) Jun 22 14:19:14 wicb-catawba multipathd: sddk: using deprecated getuid callout Jun 22 14:19:14 wicb-catawba multipathd: mpathbk: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 2 1 round-robin 0 2 1 65:32 14 69:32 14 round-robin 0 1 1 71:32 9] Jun 22 14:19:14 wicb-catawba multipathd: sddk path added to devmap mpathbk Jun 22 14:19:14 wicb-catawba multipathd: sday: add path (uevent) Jun 22 14:19:14 wicb-catawba multipathd: sday: using deprecated getuid callout Jun 22 14:19:14 wicb-catawba multipathd: mpathbk: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 2 1 round-robin 0 2 1 65:32 14 69:32 14 round-robin 0 2 1 71:32 9 67:32 9] Jun 22 14:19:14 wicb-catawba multipathd: sday path added to devmap mpathbk Jun 22 14:19:14 wicb-catawba multipathd: sdbb: add path (uevent) Jun 22 14:19:14 wicb-catawba multipathd: sdbb: using deprecated getuid callout Jun 22 14:23:00 wicb-catawba syslog-ng[1809]: syslog-ng starting up; version='2.0.9' Jun 22 14:23:01 wicb-catawba rchal: CPU frequency scaling is not supported by your processor. Jun 22 14:23:01 wicb-catawba rchal: boot with 'CPUFREQ=no' in to avoid this warning. Jun 22 14:23:01 wicb-catawba rchal: Cannot load cpufreq governors - No cpufreq driver available Jun 22 14:23:05 wicb-catawba kernel: klogd 1.4.1, log source = /proc/kmsg started. Here is the output of the dmesg.txt that accompanied the crash for the occurrence referenced in the host logs above. <0>[15509.707616] ------------[ cut here ]------------ <2>[15509.707625] kernel BUG at /usr/src/packages/BUILD/kernel-default-3.0.76/linux-3.0/kernel/timer.c:1037! <0>[15509.707633] invalid opcode: 0000 [#1] SMP <4>[15509.707640] CPU 1 <4>[15509.707642] Modules linked in: dm_round_robin edd nfs lockd fscache auth_rpcgss nfs_acl sunrpc dm_multipath af_packet rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_sa ib_mthca ib_mad ib_core mperf microcode fuse loop pciehp dm_mod ixgbe igb joydev usbhid hid ipv6 ipv6_lib dca usb_storage shpchp rtc_cmos ptp pci_hotplug pps_core mei dcdbas(X) iTCO_wdt iTCO_vendor_support mlx4_core mdio sr_mod acpi_power_meter sg acpi_pad button pcspkr cdrom ext3 jbd mbcache ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea sd_mod crc_t10dif ehci_hcd usbcore usb_common processor thermal_sys hwmon scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua ahci libahci libata scsi_dh_rdac scsi_dh scsi_mod <4>[15509.707750] Supported: Yes, External <4>[15509.707754] <4>[15509.707759] Pid: 15183, comm: LinuxSmash Tainted: G X 3.0.76-0.11-default #1 Dell Inc. PowerEdge R720/061P35 <4>[15509.707769] RIP: 0010:[<ffffffff8106ddd1>] [<ffffffff8106ddd1>] cascade+0x91/0xa0 <4>[15509.707784] RSP: 0000:ffff88082fc03e40 EFLAGS: 00010083 <4>[15509.707789] RAX: ffffffff81ddd680 RBX: ffff8803fabb1ae0 RCX: 00000001003a0580 <4>[15509.707794] RDX: ffff88082fc03e40 RSI: ffff8803fabb1ae0 RDI: ffff880817af0000 <4>[15509.707800] RBP: ffff880817af0000 R08: 0000000000000080 R09: ffff88082fc117f0 <4>[15509.707805] R10: 0000000000000c00 R11: ffffffff81025700 R12: ffff88082fc03e40 <4>[15509.707811] R13: 0000000000000005 R14: 0000000000000001 R15: ffffffff81a02108 <4>[15509.707817] FS: 00007f228b4d1700(0000) GS:ffff88082fc00000(0000) knlGS:0000000000000000 <4>[15509.707823] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b <4>[15509.707828] CR2: 00007f7b9d2728f0 CR3: 000000040125c000 CR4: 00000000000407e0 <4>[15509.707834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[15509.707839] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>[15509.707845] Process LinuxSmash (pid: 15183, threadinfo ffff8808034ac000, task ffff880803f402c0) <0>[15509.707850] Stack: <4>[15509.707853] ffff880815af3b48 ffff8803fabb1ae0 ffff880803f402c0 0000000000000000 <4>[15509.707866] ffff880817af0000 0000000000000008 ffff88082fc03ea0 ffffffff810700b4 <4>[15509.707875] ffff880817af1c20 ffff880817af1820 ffff880817af1420 ffff880817af1020 <0>[15509.707885] Call Trace: <4>[15509.707902] [<ffffffff810700b4>] run_timer_softirq+0x1a4/0x240 <4>[15509.707915] [<ffffffff81066eaf>] __do_softirq+0xef/0x220 <4>[15509.707928] [<ffffffff814657dc>] call_softirq+0x1c/0x30 <4>[15509.707943] [<ffffffff81004445>] do_softirq+0x65/0xa0 <4>[15509.707952] [<ffffffff81066ca5>] irq_exit+0xc5/0xe0 <4>[15509.707961] [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0 <4>[15509.707971] [<ffffffff81464f73>] apic_timer_interrupt+0x13/0x20 <4>[15509.707982] [<0000000000411516>] 0x411515 <0>[15509.707986] Code: de 48 83 e0 fe 48 39 c5 75 21 48 89 d3 48 89 ef e8 a5 fe ff ff 4c 39 e3 48 8b 13 75 dd 48 83 c4 18 44 89 e8 5b 5d 41 5c 41 5d c3 <0f> 0b eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 00 <1>[15509.708039] RIP [<ffffffff8106ddd1>] cascade+0x91/0xa0 <4>[15509.708046] RSP <ffff88082fc03e40> -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com