[Bug 1184436] New: QLogic QLE2672 does not work in VMWare
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 Bug ID: 1184436 Summary: QLogic QLE2672 does not work in VMWare Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.2 Hardware: VMWare OS: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: sbt79@mail.ru QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 848036 --> http://bugzilla.opensuse.org/attachment.cgi?id=848036&action=edit logs Default driver qla2xxx has WARNIG during starting system: -------------------------- [ 4.377934] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.02.00.104-k. [ 4.378500] qla2xxx [0000:00:00.0]-011c: : MSI-X vector count: 32. [ 4.378511] qla2xxx [0000:00:00.0]-001d: : Found an ISP2031 irq 21 iobase 0x00000000d70d7486. [ 4.379711] qla2xxx [0000:01:00.0]-00c6:6: MSI-X: Using 2 vectors ... [ 4.469312] qla2xxx [0000:01:00.0]-0075:6: ZIO mode 6 enabled; timer delay (200 us). [ 4.469313] qla2xxx [0000:01:00.0]-ffff:6: FC4 priority set to NVMe [ 4.471862] bochs-drm 0000:00:02.0: fb0: bochs-drmdrmfb frame buffer device [ 4.932190] pcieport 0000:00:01.3: pciehp: Failed to check link status [ 6.340323] scsi host6: qla2xxx [ 6.340385] WARNING: CPU: 1 PID: 532 at ../drivers/pci/msi.c:1303 pci_irq_get_affinity+0x3b/0x80 [ 6.340386] Modules linked in: bochs_drm drm_vram_helper ttm intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support hid_generic ghash_clmulni_intel aesni_intel drm_kms_helper qla2xxx(+) aes_x86_64 crypto_simd cryptd usbhid glue_helper nvme_fc drm nvme_fabrics joydev i2c_i801 pcspkr lpc_ich button nvme_core fb_sys_fops syscopyarea sysfillrect sysimgblt scsi_transport_fc virtio_balloon ehci_pci uhci_hcd ehci_hcd crc32c_intel serio_raw virtio_rng virtio_blk virtio_net net_failover failover qemu_fw_cfg usbcore sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 6.340411] CPU: 1 PID: 532 Comm: systemd-udevd Not tainted 5.3.18-lp152.66-default #1 openSUSE Leap 15.2 [ 6.340412] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 6.340414] RIP: 0010:pci_irq_get_affinity+0x3b/0x80 [ 6.340416] Code: 48 8b 87 c0 02 00 00 48 81 c7 c0 02 00 00 48 39 c7 74 17 85 f6 74 50 31 d2 eb 04 39 d6 74 48 48 8b 00 83 c2 01 48 39 f8 75 f1 <0f> 0b 31 c0 c3 83 e2 10 48 c7 c0 80 96 f6 a5 74 2a 48 8b 87 c0 02 [ 6.340417] RSP: 0018:ffffb78d805cf9a8 EFLAGS: 00010246 [ 6.340418] RAX: ffff9ff847f372c0 RBX: 0000000000000000 RCX: ffff9ff8b998d000 [ 6.340418] RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9ff847f372c0 [ 6.340419] RBP: ffff9ff8b7db80b8 R08: ffff9ff8bbb32040 R09: ffff9ff847c03c00 [ 6.340420] R10: 0000000000000000 R11: 00000000000017fc R12: 0000000000000002 [ 6.340420] R13: ffff9ff847f37000 R14: 00000000ffffffff R15: ffff9ff8b7db80a8 [ 6.340422] FS: 00007fdcfd357d40(0000) GS:ffff9ff8bbb00000(0000) knlGS:0000000000000000 [ 6.340422] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.340423] CR2: 0000564c76e853b8 CR3: 000000017b330000 CR4: 00000000001406e0 [ 6.340427] Call Trace: [ 6.340442] blk_mq_pci_map_queues+0x37/0xd0 [ 6.340449] blk_mq_alloc_tag_set+0x133/0x2f0 [ 6.340455] scsi_add_host_with_dma+0x7e/0x2f0 [ 6.340482] qla2x00_probe_one+0x1829/0x23b0 [qla2xxx] [ 6.340487] ? _cond_resched+0x15/0x40 [ 6.340490] ? kmem_cache_alloc_trace+0x189/0x270 [ 6.340492] ? create_pinctrl+0x31/0x3e0 [ 6.340496] local_pci_probe+0x42/0x90 [ 6.340499] pci_device_probe+0x10b/0x1c0 [ 6.340504] really_probe+0xef/0x430 [ 6.340506] driver_probe_device+0x110/0x120 [ 6.340508] device_driver_attach+0x4f/0x60 [ 6.340510] __driver_attach+0x51/0x130 [ 6.340512] ? device_driver_attach+0x60/0x60 [ 6.340514] bus_for_each_dev+0x76/0xc0 [ 6.340516] bus_add_driver+0x144/0x220 [ 6.340518] ? 0xffffffffc0544000 [ 6.340520] driver_register+0x5b/0xf0 [ 6.340521] ? 0xffffffffc0544000 [ 6.340532] qla2x00_module_init+0x1a7/0x20f [qla2xxx] [ 6.340536] do_one_initcall+0x46/0x1f4 [ 6.340538] ? _cond_resched+0x15/0x40 [ 6.340539] ? kmem_cache_alloc_trace+0x189/0x270 [ 6.340544] ? do_init_module+0x22/0x22a [ 6.340545] do_init_module+0x5b/0x22a [ 6.340548] load_module+0x1d74/0x2260 [ 6.340552] ? ima_post_read_file+0xe2/0x120 [ 6.340554] ? __do_sys_finit_module+0xe9/0x110 [ 6.340555] __do_sys_finit_module+0xe9/0x110 [ 6.340558] do_syscall_64+0x65/0x1f0 [ 6.340560] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 6.340562] RIP: 0033:0x7fdcfc19c759 [ 6.340564] Code: 00 48 81 c4 80 00 00 00 89 f0 c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0f d7 2b 00 f7 d8 64 89 01 48 [ 6.340565] RSP: 002b:00007ffe47458ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 6.340566] RAX: ffffffffffffffda RBX: 0000564c76e3f340 RCX: 00007fdcfc19c759 [ 6.340567] RDX: 0000000000000000 RSI: 00007fdcfcad787d RDI: 0000000000000011 [ 6.340567] RBP: 00007fdcfcad787d R08: 0000000000000000 R09: 0000564c76e24840 [ 6.340568] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000 [ 6.340569] R13: 0000564c76e707f0 R14: 0000000000000000 R15: 0000000003938700 [ 6.340570] ---[ end trace 272efd67891a6f5d ]--- [ 6.342232] qla2xxx [0000:01:00.0]-00fb:6: QLogic QLE2672 - QLE2672 QLogic 2-port 16Gb Fibre Channel Adapter. [ 6.342277] qla2xxx [0000:01:00.0]-00fc:6: ISP2031: PCIe (8.0GT/s x8) @ 0000:01:00.0 hdma+ host#=6 fw=8.08.231 (d0d5). [ 6.342902] qla2xxx [0000:00:00.0]-011c: : MSI-X vector count: 32. [ 6.342905] qla2xxx [0000:00:00.0]-001d: : Found an ISP2031 irq 22 iobase 0x000000007521d3c1. [ 6.344281] qla2xxx [0000:02:00.0]-00c6:7: MSI-X: Using 2 vectors [ 6.424202] qla2xxx [0000:02:00.0]-0075:7: ZIO mode 6 enabled; timer delay (200 us). [ 6.424204] qla2xxx [0000:02:00.0]-ffff:7: FC4 priority set to NVMe [ 8.292287] scsi host7: qla2xxx [ 8.294192] qla2xxx [0000:02:00.0]-00fb:7: QLogic QLE2672 - QLE2672 QLogic 2-port 16Gb Fibre Channel Adapter. [ 8.294243] qla2xxx [0000:02:00.0]-00fc:7: ISP2031: PCIe (8.0GT/s x8) @ 0000:02:00.0 hdma+ host#=7 fw=8.08.231 (d0d5). ... [ 28.550787] qla2xxx [0000:01:00.0]-8038:6: Cable is unplugged... [ 30.651367] qla2xxx [0000:02:00.0]-8038:7: Cable is unplugged... -------------------------- After FC ports finish LOGIN process then we have crash in the BLOCK subsystem: -------------------------- [ 632.401232] qla2xxx [0000:02:00.0]-2134:7: FCPort 10:00:14:52:90:00:5b:28 disc_state transition: GPDB to UPD_FCPORT - portid=000002. [ 632.401243] qla2xxx [0000:02:00.0]-20ef:7: qla2x00_update_fcport 10:00:14:52:90:00:5b:28 [ 632.401245] qla2xxx [0000:02:00.0]-2134:7: FCPort 10:00:14:52:90:00:5b:28 disc_state transition: UPD_FCPORT to UPD_FCPORT - portid=000002. [ 632.401630] qla2xxx [0000:02:00.0]-20ee:7: qla2x00_reg_remote_port 1000145290005b28. rport 00000000a21ecdd5 is tgt mode [ 632.401641] qla2xxx [0000:02:00.0]-207d:7: FCPort 10:00:14:52:90:00:5b:28 state transitioned from UNCONFIGURED to ONLINE - portid=000002. [ 632.401643] qla2xxx [0000:02:00.0]-2134:7: FCPort 10:00:14:52:90:00:5b:28 disc_state transition: UPD_FCPORT to LOGIN_COMPLETE - portid=000002. [ 632.401740] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 632.401750] #PF: supervisor read access in kernel mode [ 632.401753] #PF: error_code(0x0000) - not-present page [ 632.401757] PGD 0 P4D 0 [ 632.401762] Oops: 0000 [#1] SMP PTI [ 632.401768] CPU: 1 PID: 116 Comm: kworker/u4:5 Tainted: G W 5.3.18-lp152.66-default #1 openSUSE Leap 15.2 [ 632.401773] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 632.401790] Workqueue: scsi_wq_7 fc_scsi_scan_rport [scsi_transport_fc] [ 632.401836] RIP: 0010:qla2xxx_queuecommand+0x175/0x3f0 [qla2xxx] [ 632.401842] Code: ba 03 30 00 00 bf 00 80 00 08 e8 56 e2 03 00 e9 68 ff ff ff 48 8b bb 20 01 00 00 e8 95 29 f7 ee 48 8b 95 b8 00 00 00 c1 e8 10 <48> 8b 14 c2 48 85 d2 0f 84 02 ff ff ff 48 83 c4 08 48 89 de 4c 89 [ 632.401850] RSP: 0000:ffff9ea240207a58 EFLAGS: 00010246 [ 632.401854] RAX: 0000000000000000 RBX: ffff91cf0f6c4220 RCX: ffff91cf0f6c4220 [ 632.401857] RDX: 0000000000000000 RSI: ffff91cf0f6c4220 RDI: ffff91cf0f6c4100 [ 632.401861] RBP: ffff91cf0f512000 R08: 0000000000000020 R09: ffff91cf0f73ead8 [ 632.401865] R10: 0000000000000000 R11: ffffe2a0c5d46880 R12: ffff91cf3872e000 [ 632.401868] R13: ffff91cf3a471c00 R14: ffff91cf3146c060 R15: ffff91cf3872e7c8 [ 632.401873] FS: 0000000000000000(0000) GS:ffff91cf3bb00000(0000) knlGS:0000000000000000 [ 632.401877] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 632.401880] CR2: 0000000000000000 CR3: 0000000171508000 CR4: 00000000001406e0 [ 632.401890] Call Trace: [ 632.401912] scsi_queue_rq+0x6f0/0xab0 [ 632.401922] blk_mq_dispatch_rq_list+0x90/0x560 [ 632.401928] ? blk_mq_flush_busy_ctxs+0xf3/0x110 [ 632.401934] blk_mq_sched_dispatch_requests+0x153/0x170 [ 632.401939] __blk_mq_run_hw_queue+0x2b/0xa0 [ 632.401944] __blk_mq_delay_run_hw_queue+0x100/0x150 [ 632.401949] blk_mq_run_hw_queue+0x50/0x100 [ 632.401953] blk_mq_sched_insert_request+0x10f/0x180 [ 632.401958] blk_execute_rq+0x4b/0xa0 [ 632.401964] __scsi_execute+0x113/0x250 [ 632.401968] scsi_probe_and_add_lun+0x254/0xdc0 [ 632.401974] __scsi_scan_target+0x106/0x620 [ 632.401981] ? __pm_runtime_resume+0x54/0x70 [ 632.401986] scsi_scan_target+0x105/0x110 [ 632.401993] fc_scsi_scan_rport+0xa6/0xb0 [scsi_transport_fc] [ 632.402003] process_one_work+0x1f4/0x3e0 [ 632.402008] worker_thread+0x2d/0x3e0 [ 632.402012] ? process_one_work+0x3e0/0x3e0 [ 632.402017] kthread+0x10d/0x130 [ 632.402021] ? kthread_park+0xa0/0xa0 [ 632.402028] ret_from_fork+0x35/0x40 [ 632.402033] Modules linked in: qla2xxx xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables x_tables bpfilter br_netfilter bridge stp llc overlay scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs rfkill intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel bochs_drm drm_vram_helper ttm aes_x86_64 drm_kms_helper crypto_simd joydev iTCO_wdt iTCO_vendor_support hid_generic drm cryptd pcspkr glue_helper usbhid lpc_ich nvme_fc nvme_fabrics i2c_i801 fb_sys_fops virtio_balloon nvme_core syscopyarea sysfillrect sysimgblt scsi_transport_fc button crc32c_intel serio_raw virtio_rng virtio_blk ehci_pci virtio_net net_failover failover uhci_hcd ehci_hcd usbcore qemu_fw_cfg sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [last unloaded: qla2xxx] [ 632.402111] CR2: 0000000000000000 [ 632.402115] ---[ end trace a3f25d08d43276be ]--- [ 632.402129] RIP: 0010:qla2xxx_queuecommand+0x175/0x3f0 [qla2xxx] [ 632.402134] Code: ba 03 30 00 00 bf 00 80 00 08 e8 56 e2 03 00 e9 68 ff ff ff 48 8b bb 20 01 00 00 e8 95 29 f7 ee 48 8b 95 b8 00 00 00 c1 e8 10 <48> 8b 14 c2 48 85 d2 0f 84 02 ff ff ff 48 83 c4 08 48 89 de 4c 89 [ 632.402140] RSP: 0000:ffff9ea240207a58 EFLAGS: 00010246 [ 632.402144] RAX: 0000000000000000 RBX: ffff91cf0f6c4220 RCX: ffff91cf0f6c4220 [ 632.402147] RDX: 0000000000000000 RSI: ffff91cf0f6c4220 RDI: ffff91cf0f6c4100 [ 632.402151] RBP: ffff91cf0f512000 R08: 0000000000000020 R09: ffff91cf0f73ead8 [ 632.402154] R10: 0000000000000000 R11: ffffe2a0c5d46880 R12: ffff91cf3872e000 [ 632.402158] R13: ffff91cf3a471c00 R14: ffff91cf3146c060 R15: ffff91cf3872e7c8 [ 632.402162] FS: 0000000000000000(0000) GS:ffff91cf3bb00000(0000) knlGS:0000000000000000 [ 632.402166] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 632.402169] CR2: 0000000000000000 CR3: 0000000171508000 CR4: 00000000001406e0 [ 632.705981] qla2xxx [0000:02:00.0]-400d:7: Relogin scheduled. [ 632.705994] qla2xxx [0000:02:00.0]-400e:7: Relogin end. -------------------------- If we use qla2xxx driver from Marvell (Marvell download center: 10.01.00.63.15.2-k) then everything works well and storage is available. Full logs and lspcsi is attached. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c1 Sergey Samoylenko <sbt79@mail.ru> changed: What |Removed |Added ---------------------------------------------------------------------------- Hardware|VMWare |x86 --- Comment #1 from Sergey Samoylenko <sbt79@mail.ru> --- Driver uses in initiator mode. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 Roman Bolshakov <r.bolshakov@yadro.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |r.bolshakov@yadro.com Summary|QLogic QLE2672 does not |QLogic QLE2672 does not |work in VMWare |work in QEMU -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |daniel.wagner@suse.com, | |hare@suse.com, | |tiwai@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 Roman Bolshakov <r.bolshakov@yadro.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|QLogic QLE2672 does not |qla2xxx crashes in |work in QEMU |scsi_queue_rq() Severity|Critical |Major -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c2 --- Comment #2 from Roman Bolshakov <r.bolshakov@yadro.com> --- Created attachment 848042 --> http://bugzilla.opensuse.org/attachment.cgi?id=848042&action=edit support config from SLE15 SP2 Similar issue happens with QLE2742 on SLE15 SP2, kernel 5.3.18-24.52-default -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c3 Daniel Wagner <daniel.wagner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS Assignee|kernel-bugs@opensuse.org |daniel.wagner@suse.com --- Comment #3 from Daniel Wagner <daniel.wagner@suse.com> ---
WARNING: CPU: 1 PID: 532 at ../drivers/pci/msi.c:1303 pci_irq_get_affinity+0x3b/0x80
You don't see this with the out of box driver from Marvell? Which upstream version of the driver does the out of box driver? Just in case you know this, then I don't have to find it out myself. Thanks! It looks the IRQ placement with qemu together isn't really working. Anyway, need to look at the code. BTW, do you have a crash dump? Might help to figure out what's going on. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c4 --- Comment #4 from Sergey Samoylenko <sbt79@mail.ru> --- Yes, when we use qla2xxx driver from Marvell site then all looks well. dmesg-qla2xxx-10.01.00.63.15.2-k is attached. Marvell driver we have took here: https://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/SearchByProduct.... It is 'FC Driver for SLES 15 SP2 / FC-NVMe - TAR'. Version is '10.01.00.63.15.2-k'. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c5 --- Comment #5 from Sergey Samoylenko <sbt79@mail.ru> --- Created attachment 848053 --> http://bugzilla.opensuse.org/attachment.cgi?id=848053&action=edit dmesg for qla2xxx-10.01.00.63.15.2-k from Marvell site -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c6 --- Comment #6 from Daniel Wagner <daniel.wagner@suse.com> --- Thanks for the pointers. The two versions of the driver differ a bit. So it's not easy to see if there is something missing. Also the 10.02.00.106-k driver update doesn't have anything which I could easily match to the call traces. Need to take a closer look. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c7 --- Comment #7 from Sergey Samoylenko <sbt79@mail.ru> --- Hi Daniel, The last working kernel version is 'kernel-default-5.3.18-lp152.57.1'. The 'kernel-default-5.3.18-lp152.60.1' and next are not working whit QLogic QLE2672. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c8 --- Comment #8 from Roman Bolshakov <r.bolshakov@yadro.com> --- Daniel, the WARN_ON_ONCE happens inside pci_irq_get_affinity(): int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev, int offset) { const struct cpumask *mask; unsigned int queue, cpu; for (queue = 0; queue < qmap->nr_queues; queue++) { mask = pci_irq_get_affinity(pdev, queue + offset); if (!mask) goto fallback; for_each_cpu(cpu, mask) qmap->mq_map[cpu] = qmap->queue_offset + queue; } return 0; fallback: WARN_ON_ONCE(qmap->nr_queues > 1); blk_mq_clear_mq_map(qmap); return 0; } qmap is cleared afterwards: static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap) { int cpu; for_each_possible_cpu(cpu) qmap->mq_map[cpu] = 0; } Number of hw queue pairs is adjusted to 0 on the scsi_host: [ 173.696610] qla2xxx [0000:01:00.0]-00c6:6: MSI-X: Using 2 vectors [ 173.696612] qla2xxx [0000:00:00.0]-0990: : Adjusted Max no of queues pairs: 0. [ 173.697463] qla2xxx [0000:01:00.0]-c005:6: mqiobase=000000001ac0cfa5, max_rsp_queues=1, max_req_queues=1. [ 173.697467] qla2xxx [0000:01:00.0]-0055:6: mqiobase=000000001ac0cfa5, max_rsp_queues=1, max_req_queues=1. [ 173.697469] qla2xxx [0000:01:00.0]-0036:6: MSI-X: Enabled (0x0, 0x0). [ 173.697475] qla2xxx [0000:01:00.0]-0192:6: blk/scsi-mq enabled, HW queues = 0. And then the crash happens in qla2xxx_queuecommand(): if (ha->mqenable) { uint32_t tag; uint16_t hwq; struct qla_qpair *qpair = NULL; tag = blk_mq_unique_tag(cmd->request); hwq = blk_mq_unique_tag_to_hwq(tag); qpair = ha->queue_pair_map[hwq]; # <- HERE if (qpair) return qla2xxx_mqueuecommand(host, cmd, qpair); } But I think it's quite clear that qla2xxx shouldn't be in mq mode on dual core CPU because only two available interrupt vectors are used for default response/request and for mailboxes. I'm not yet sure if the earlier WARN_ON is directly related to the issue, but queue map is apparently NULL: crash> dev -p | grep 0000:01:00.0 ���crash> struct pci_dev.dev.driver_data ffffa0a447f2b000 dev.driver_data = 0xffffa0a4bcef27c8, ���crash> struct scsi_qla_host 0xffffa0a4bcef27c8 crash> struct scsi_qla_host.hw 0xffffa0a4bcef27c8 ���crash> struct qla_hw_data.queue_pair_map 0xffffa0a499ef5000 queue_pair_map = 0x0 I've got the core file with debugging logs enabled. I'll try to upload it here. Thanks, Roman -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c9 --- Comment #9 from Daniel Wagner <daniel.wagner@suse.com> --- Thanks a lot for the debug effort. I'll start looking into it now. I don't know if you can upload the core dump here due the size limit. If you put it somewhere I can download that should do the trick. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c10 --- Comment #10 from Roman Bolshakov <r.bolshakov@yadro.com> --- I think qla2x00_alloc_queues() was invoked with max_qpairs == 0 as outlined in earlier comment, so queue_pair_map was never allocated: if ((ql2xmqsupport || ql2xnvmeenable) && ha->max_qpairs) { ha->queue_pair_map = kcalloc(ha->max_qpairs, sizeof(struct qla_qpair *), GFP_KERNEL); if (!ha->queue_pair_map) { ql_log(ql_log_fatal, vha, 0x0180, "Unable to allocate memory for queue pair ptrs.\n"); goto fail_qpair_map; } } Likely there are two possible workarounds on the latest Leap 15.2 kernel: 1) Use qla2xxx with ql2xmqsupport=0 2) Provide more CPUs to the VM I'll try both now. Meanwhile, here's the core: https://drive.yadro.com/s/Z5iN9Dfg7cAdH8k Thanks! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c11 --- Comment #11 from Daniel Wagner <daniel.wagner@suse.com> --- (In reply to Roman Bolshakov from comment #10)
I think qla2x00_alloc_queues() was invoked with max_qpairs == 0 as outlined
I wonder if we should set ql2xmqsupport to 0 automatically, fallback using the single queue mode. Let me check what the out of box driver does here.
Likely there are two possible workarounds on the latest Leap 15.2 kernel: 1) Use qla2xxx with ql2xmqsupport=0 2) Provide more CPUs to the VM
I'll try both now.
Great!
Meanwhile, here's the core: https://drive.yadro.com/s/Z5iN9Dfg7cAdH8k
Thanks. Though I think you have already done the analysis :) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c12 --- Comment #12 from Roman Bolshakov <r.bolshakov@yadro.com> --- # modprobe qla2xxx ql2xmqsupport=0 loggging=0x7fffffff(In reply to Daniel Wagner from comment #11)
(In reply to Roman Bolshakov from comment #10)
I think qla2x00_alloc_queues() was invoked with max_qpairs == 0 as outlined
I wonder if we should set ql2xmqsupport to 0 automatically, fallback using the single queue mode. Let me check what the out of box driver does here.
Yeah, I think so. Likely qla2xxx is typically tested on fat servers that has much more CPUs so the use corner-case went unnoticed. But small VMs with VFIO are great for testing the driver.
Likely there are two possible workarounds on the latest Leap 15.2 kernel: 1) Use qla2xxx with ql2xmqsupport=0 2) Provide more CPUs to the VM
I'll try both now.
Great!
1) Driver loads without the crash despite the WARNING is still there (we may assume now the warning is not related to the crash): # modprobe qla2xxx ql2xmqsupport=0 2) a VM with three vCPUs but default qla2xxx parameters (i.e. ql2xmqsupport=1) doesn't crash either :) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c13 --- Comment #13 from Daniel Wagner <daniel.wagner@suse.com> --- The only place where mqenable is set to 1 is in qla24xx_enable_msix() Both version of the driver have the same checks:
/* Enable MSI-X vector for response queue update for queue 0 */ if (IS_QLA83XX(ha) || IS_QLA27XX(ha) || IS_QLA28XX(ha)) { if (ha->msixbase && ha->mqiobase && (ha->max_rsp_queues > 1 || ha->max_req_queues > 1 || ql2xmqsupport)) ha->mqenable = 1; } else if (ha->mqiobase && (ha->max_rsp_queues > 1 || ha->max_req_queues > 1 || ql2xmqsupport)) ha->mqenable = 1;
mq is enabled if the HW supports it and ql2xmqsupport is set. After looking through the code I think we could try to set the number of supported hw queues. Maybe something like this would already do the trick:
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 02d632b77ec7..f91c1f0feb79 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3210,7 +3210,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
if (ha->mqenable) { /* number of hardware queues supported by blk/scsi-mq*/ - host->nr_hw_queues = ha->max_qpairs; + host->nr_hw_queues = max(ha->max_qpairs, num_present_cpus());
ql_dbg(ql_dbg_init, base_vha, 0x0192, "blk/scsi-mq enabled, HW queues = %d.\n", host->nr_hw_queues);
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c14 --- Comment #14 from Daniel Wagner <daniel.wagner@suse.com> --- On a second thought, let's do something way simpler and avoid any complex logic dependency:
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 02d632b77ec7..a5c4e949e094 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3092,6 +3092,9 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id) ha->isp_ops, ha->flash_conf_off, ha->flash_data_off, ha->nvram_conf_off, ha->nvram_data_off);
+ if (num_present_cpus() == 1) + ql2xmqsupport = 0; + /* Configure PCI I/O space */ ret = ha->isp_ops->iospace_config(ha); if (ret)
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c15 --- Comment #15 from Roman Bolshakov <r.bolshakov@yadro.com> --- (In reply to Daniel Wagner from comment #13)
The only place where mqenable is set to 1 is in qla24xx_enable_msix() Both version of the driver have the same checks:
/* Enable MSI-X vector for response queue update for queue 0 */ if (IS_QLA83XX(ha) || IS_QLA27XX(ha) || IS_QLA28XX(ha)) { if (ha->msixbase && ha->mqiobase && (ha->max_rsp_queues > 1 || ha->max_req_queues > 1 || ql2xmqsupport)) ha->mqenable = 1; } else if (ha->mqiobase && (ha->max_rsp_queues > 1 || ha->max_req_queues > 1 || ql2xmqsupport)) ha->mqenable = 1;
mq is enabled if the HW supports it and ql2xmqsupport is set.
After looking through the code I think we could try to set the number of supported hw queues.
I don't think it's correct to permit MQ mode with two extra queue pairs on machine with two CPUs because two available interrupt vectors are already taken by default response queue and mailbox: It's described well in qla83xx_iospace_config() but qla24xx_enable_msix() and qla2x00_iospace_config() have effectively the same math and interrupt allocation: /* * By default, driver uses at least two msix vectors * (default & rspq) */ if (ql2xmqsupport || ql2xnvmeenable) { /* MB interrupt uses 1 vector */ ha->max_req_queues = ha->msix_count - 1; /* ATIOQ needs 1 vector. That's 1 less QPair */ if (QLA_TGT_MODE_ENABLED()) ha->max_req_queues--; ha->max_rsp_queues = ha->max_req_queues; /* Queue pairs is the max value minus * the base queue pair */ ha->max_qpairs = ha->max_req_queues - 1; ql_dbg_pci(ql_dbg_init, ha->pdev, 0x00e3, "Max no of queues pairs: %d.\n", ha->max_qpairs); } So, given that, fallback to single queue mode looks saner.
Maybe something like this would already do the trick:
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 02d632b77ec7..f91c1f0feb79 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3210,7 +3210,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
if (ha->mqenable) { /* number of hardware queues supported by blk/scsi-mq*/ - host->nr_hw_queues = ha->max_qpairs; + host->nr_hw_queues = max(ha->max_qpairs, num_present_cpus());
ql_dbg(ql_dbg_init, base_vha, 0x0192, "blk/scsi-mq enabled, HW queues = %d.\n", host->nr_hw_queues);
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c16 --- Comment #16 from Roman Bolshakov <r.bolshakov@yadro.com> --- (In reply to Daniel Wagner from comment #14)
On a second thought, let's do something way simpler and avoid any complex logic dependency:
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 02d632b77ec7..a5c4e949e094 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3092,6 +3092,9 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id) ha->isp_ops, ha->flash_conf_off, ha->flash_data_off, ha->nvram_conf_off, ha->nvram_data_off);
+ if (num_present_cpus() == 1) + ql2xmqsupport = 0; + /* Configure PCI I/O space */ ret = ha->isp_ops->iospace_config(ha); if (ret)
Yeah, something like this except we should also disable it on system with two CPUs (like the VM we hit the bug on): if (num_present_cpus() <= 2) { ql2xmqsupport = 0; } -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c17 --- Comment #17 from Roman Bolshakov <r.bolshakov@yadro.com> --- And another, albeit rare, case is qla2xxx in qlini_mode=dual, in that case we shouldn't enable mq mode unless the system has strictly more than three CPUs. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c18 --- Comment #18 from Daniel Wagner <daniel.wagner@suse.com> --- (In reply to Roman Bolshakov from comment #16)
Yeah, something like this except we should also disable it on system with two CPUs (like the VM we hit the bug on):
if (num_present_cpus() <= 2) { ql2xmqsupport = 0; }
Sounds reasonable.
qlini_mode=dual
I haven't thought about this, sounds again reasonable. Are you going to send a patch upstream? I think you are deeper in the topic. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c19 --- Comment #19 from Roman Bolshakov <r.bolshakov@yadro.com> --- (In reply to Daniel Wagner from comment #18)
Are you going to send a patch upstream? I think you are deeper in the topic.
Sure, I can do that, although I have to check if it works the same way with the latest mkp/scsi-fixes or mkp/scsi-queue on the VMs with small core count. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c20 --- Comment #20 from Daniel Wagner <daniel.wagner@suse.com> --- (In reply to Roman Bolshakov from comment #19)
Sure, I can do that, although I have to check if it works the same way with the latest mkp/scsi-fixes or mkp/scsi-queue on the VMs with small core count.
Upstream is very likely to behave the same. We try to stick as close as possible with upstream and add downstream patches only as last resort. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c29 --- Comment #29 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1184436) was mentioned in https://build.opensuse.org/request/show/892132 15.2 / kernel-source -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1184436 http://bugzilla.opensuse.org/show_bug.cgi?id=1184436#c55 --- Comment #55 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1184436) was mentioned in https://build.opensuse.org/request/show/904571 15.2 / kernel-source -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com