[Bug 1106005] New: NVME RAID throws trace and goes into read only
http://bugzilla.suse.com/show_bug.cgi?id=1106005 Bug ID: 1106005 Summary: NVME RAID throws trace and goes into read only Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.0 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: markus.zimmermann@nethead.at QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 780786 --> http://bugzilla.suse.com/attachment.cgi?id=780786&action=edit Full dmesg output I just upgraded a server from 42.3 to 15.0 and after some time the server could not write data because the RAID went into read-only. The attachment has the full dmesg output but the important part I guess is this: [36956.643403] print_req_error: I/O error, dev nvme1n1, sector 67374104 [36956.643408] device-mapper: multipath: Failing path 259:1. [36956.643562] WARNING: CPU: 3 PID: 570 at ../drivers/nvme/host/core.c:571 nvme_setup_cmd+0x3dd/0x430 [nvme_core] [36956.643562] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_REJECT tun devlink ebtable_filter ebtables fuse af_packet nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter [36956.643570] print_req_error: I/O error, dev dm-1, sector 67374104 [36956.643680] ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_physdev br_netfilter bridge stp llc xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iscsi_ibft iscsi_boot_sysfs libcrc32c iptable_filter ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt mei_wdt iTCO_vendor_support ppdev kvm irqbypass pcspkr i2c_i801 shpchp mei_me intel_pch_thermal mei ie31200_edac wmi parport_pc parport video acpi_pad button raid1 md_mod scsi_transport_iscsi dm_service_time crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd xhci_pci glue_helper cryptd nvme xhci_hcd e1000e ptp serio_raw pps_core ahci nvme_core libahci usbcore sunrpc dm_mirror dm_region_hash dm_log [36956.643712] sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [36956.643717] CPU: 3 PID: 570 Comm: kdmwork-254:0 Not tainted 4.12.14-lp150.12.16-default #1 openSUSE Leap 15.0 [36956.643718] Hardware name: FUJITSU /D3417-B2, BIOS V5.0.0.12 R1.8.0.SR.1 for D3417-B2x 05/16/2017 [36956.643719] task: ffff880fe0b06080 task.stack: ffffc90006f98000 [36956.643722] RIP: 0010:nvme_setup_cmd+0x3dd/0x430 [nvme_core] [36956.643723] RSP: 0018:ffffc90006f9bca0 EFLAGS: 00010202 [36956.643724] RAX: 0000000000000000 RBX: ffffc90006f9bd38 RCX: 0000000000000000 [36956.643725] RDX: 0000000000000002 RSI: ffff880fe169c6c0 RDI: 0000000004040c18 [36956.643726] RBP: ffff880fdf03e400 R08: ffff880fe169c6c0 R09: 0000000000000008 [36956.643727] R10: ffff880fdf44cc40 R11: 0000000000000024 R12: ffff880fdf3aef00 [36956.643728] R13: 0000000000000001 R14: 0000000000000001 R15: ffff880fdf03e400 [36956.643729] FS: 0000000000000000(0000) GS:ffff88102e4c0000(0000) knlGS:0000000000000000 [36956.643730] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [36956.643731] CR2: 00007fea987f5624 CR3: 000000000200a004 CR4: 00000000003626e0 [36956.643732] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [36956.643733] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [36956.643733] Call Trace: [36956.643738] ? blk_mq_run_hw_queue+0x10/0x10 [36956.643741] nvme_queue_rq+0x52/0xbc0 [nvme] [36956.643744] ? kmem_cache_alloc+0xea/0x510 [36956.643746] ? __sbitmap_queue_get+0x24/0x90 [36956.643748] ? wait_woken+0x80/0x80 [36956.643751] ? mempool_alloc+0x55/0x160 [36956.643753] ? blk_mq_run_hw_queue+0x10/0x10 [36956.643754] blk_mq_dispatch_rq_list+0x194/0x300 [36956.643757] blk_mq_sched_dispatch_requests+0x11c/0x1d0 [36956.643759] __blk_mq_delay_run_hw_queue+0x73/0x80 [36956.643761] blk_insert_cloned_request+0x9a/0x1e0 [36956.643766] map_request+0xc1/0x1d0 [dm_mod] [36956.643770] ? kthread_create_worker_on_cpu+0x50/0x50 [36956.643773] map_tio_request+0x12/0x30 [dm_mod] [36956.643775] kthread_worker_fn+0xe5/0x190 [36956.643778] kthread+0x11a/0x130 [36956.643779] ? kthread_create_on_node+0x40/0x40 [36956.643782] ? do_syscall_64+0x7b/0x150 [36956.643785] ? SyS_exit_group+0x10/0x10 [36956.643787] ret_from_fork+0x1f/0x40 [36956.643789] Code: 5b 89 c8 83 ce 10 c1 e0 10 09 c2 83 f9 04 0f 87 b1 fe ff ff 8b 45 58 48 8b 7d 30 c1 e8 09 48 01 84 cf 20 08 00 00 e9 9a fe ff ff <0f> 0b 4c 89 c7 41 bd 0a 00 00 00 e8 f3 66 fa e0 e9 9d fc ff ff -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c1
Markus Zimmermann
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c2
--- Comment #2 from Markus Zimmermann
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c3
--- Comment #3 from Markus Zimmermann
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c4
--- Comment #4 from Markus Zimmermann
http://bugzilla.suse.com/show_bug.cgi?id=1106005
Hannes Reinecke
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c5
--- Comment #5 from Coly Li
Interesting enough, the problem occurred last Sunday->Mondeay but not the Sunday->Monday before.
This could also be triggered by some mdadm cronjob which would run in the same timeframe... right?
Hi Markus, Do you use multipath ? (I see some multipath related symbols show up). I see there are md raid1 errors, but it seems from underlying I/O errors. So let me confirm them firstly. Thanks. Coly Li -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c6
--- Comment #6 from Markus Zimmermann
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c7
--- Comment #7 from Coly Li
Thanks for looking into this.
- `service multipathd status` tells me that this service is inactive since about two weeks. Which is interesting since the crash on this device did not happen this weekend but the weekend before that.
- `multipath -ll` gives the following output:
``` eui.00080d02001b6dd8 dm-0 NVME,THNSN5512GPU7 TOSHIBA size=477G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:1:1 nvme0n1 259:0 active ready running eui.00080d02001b6e33 dm-1 NVME,THNSN5512GPU7 TOSHIBA size=477G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 1:0:1:1 nvme1n1 259:1 active ready running ```
- `dmsetup ls --tree`
``` eui.00080d02001b6e33-part3 (254:7) └─eui.00080d02001b6e33 (254:1) └─ (259:1) eui.00080d02001b6e33-part2 (254:6) └─eui.00080d02001b6e33 (254:1) └─ (259:1) eui.00080d02001b6e33-part1 (254:5) └─eui.00080d02001b6e33 (254:1) └─ (259:1) eui.00080d02001b6dd8-part3 (254:4) └─eui.00080d02001b6dd8 (254:0) └─ (259:0) eui.00080d02001b6dd8-part2 (254:3) └─eui.00080d02001b6dd8 (254:0) └─ (259:0) eui.00080d02001b6dd8-part1 (254:2) └─eui.00080d02001b6dd8 (254:0) └─ (259:0) ```
Was this the information you were looking for?
It seems multipath is used here. Do you really deploy dm multipath on your system ? I suspect this is some wrong in dm multipath, but not sure. Thanks. Coly Li -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c8
Coly Li
http://bugzilla.suse.com/show_bug.cgi?id=1106005
http://bugzilla.suse.com/show_bug.cgi?id=1106005#c9
--- Comment #9 from Markus Zimmermann
participants (1)
-
bugzilla_noreply@novell.com