[Bug 1148562] New: BTRFS - watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562 Bug ID: 1148562 Summary: BTRFS - watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153] Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: aarch64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: guillaume.gardet@arm.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- While installing Mate desktop in openQA, it triggers a soft lockup in BTRFS. https://openqa.opensuse.org/tests/1019162#step/consoletest_finish/2 Traces: [ 973.697661] watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153] [ 973.864753] Modules linked in: fuse scsi_transport_iscsi af_packet xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iscsi_ibft iscsi_boot_sysfs ip_tables x_tables aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce aes_arm64 sha2_ce sha256_arm64 joydev snd_intel8x0 sha1_ce snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore nls_iso8859_1 nls_cp437 vfat fat btrfs libcrc32c xor xor_neon zlib_deflate hid_generic usbhid sr_mod cdrom raid6_pq xhci_pci virtio_gpu xhci_hcd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk virtio_scsi usbcore virtio_net net_failover failover drm virtio_mmio sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 973.864843] efivarfs virtio_rng [ 973.878100] CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased) [ 973.879517] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 973.880514] pstate: 40000005 (nZcv daif -PAN -UAO) [ 973.881238] pc : __ll_sc_arch_atomic_sub_return+0x14/0x20 [ 973.882082] lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs] [ 973.883017] sp : ffff00001273b7e0 [ 973.883503] x29: ffff00001273b7e0 x28: 0000000000000000 [ 973.884277] x27: ffffb29bba1fe800 x26: 0000000000000000 [ 973.885051] x25: ffff00001273b9f0 x24: 0000000000000001 [ 973.885824] x23: ffffb29b95ba2948 x22: 0000000000000000 [ 973.886597] x21: ffffb29b95ba29f0 x20: ffffb29b95ba29e0 [ 973.887371] x19: ffff7eca6e5e5300 x18: 0000000000000000 [ 973.888144] x17: 0000000000000000 x16: 0000000000000000 [ 973.888918] x15: 0000000000000000 x14: 0000000000000000 [ 973.889691] x13: 0000000000000000 x12: 0000000000000000 [ 973.908600] x11: 0000000000000000 x10: 0000000000000040 [ 973.909374] x9 : ffffb29bb7f796d0 x8 : ffffb29bb7f796c8 [ 973.910147] x7 : 0000000000000010 x6 : ffffb29bbfbeb2e8 [ 973.910921] x5 : 0000000000000003 x4 : ffff5656ced021d0 [ 973.911700] x3 : ffff5656ceef7478 x2 : dead0000000000ff [ 973.912471] x1 : ffff7eca6e5e5334 x0 : 0000000000000001 [ 973.913246] Call trace: [ 973.913612] __ll_sc_arch_atomic_sub_return+0x14/0x20 [ 973.914382] release_extent_buffer+0xdc/0x120 [btrfs] [ 973.915143] free_extent_buffer.part.0+0xb0/0x118 [btrfs] [ 973.915954] free_extent_buffer+0x24/0x30 [btrfs] [ 973.916664] btrfs_release_path+0x4c/0xa0 [btrfs] [ 973.917374] btrfs_free_path.part.0+0x20/0x40 [btrfs] [ 973.918134] btrfs_free_path+0x24/0x30 [btrfs] [ 973.918806] get_inode_info+0xa8/0xf8 [btrfs] [ 973.919467] finish_inode_if_needed+0xe0/0x6d8 [btrfs] [ 973.920241] changed_cb+0x9c/0x410 [btrfs] [ 973.920866] btrfs_compare_trees+0x284/0x648 [btrfs] [ 973.921621] send_subvol+0x33c/0x520 [btrfs] [ 973.922270] btrfs_ioctl_send+0x8a0/0xaf0 [btrfs] [ 973.922981] btrfs_ioctl+0x199c/0x2288 [btrfs] [ 973.923640] do_vfs_ioctl+0x4b0/0x820 [ 973.924180] ksys_ioctl+0x84/0xb8 [ 973.944409] __arm64_sys_ioctl+0x28/0x38 [ 973.945959] el0_svc_common.constprop.0+0x7c/0x188 [ 973.947575] el0_svc_handler+0x34/0x90 [ 973.949006] el0_svc+0x8/0xc -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
Jeff Mahoney
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c1
--- Comment #1 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c2
--- Comment #2 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c3
--- Comment #3 from Guillaume GARDET
I assume the machine is some old spinning rust so perhaps it's triggering because in btrfs_compare_trees we do not have a cond_resched. If the issue is reliably reproducible can you try the attached patch ?
This is in openQA, so a QEMU VM. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c4
--- Comment #4 from Guillaume GARDET
(In reply to Nikolay Borisov from comment #2)
I assume the machine is some old spinning rust so perhaps it's triggering because in btrfs_compare_trees we do not have a cond_resched. If the issue is reliably reproducible can you try the attached patch ?
This is in openQA, so a QEMU VM.
And the host machine uses an SSD. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c5
--- Comment #5 from Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c6
--- Comment #6 from Nikolay Borisov
By reducing the qemu snapshot saving time (by increasing QEMU_COMPRESS_THREADS from 3 to 5), the problem is gone.
So, it looks like a qemu problem. Not sure if we could have it in real life? @Nicolay, what do you think?
Be that is it may it doesn't preclude a "bug" in btrfs' code. Essentially the softlockup detector kicks in when it detects a particular CPU hasn't yielded for 24 seconds ( or whatever is the configured threshold). It's a fact that the code dealing with send with a parent snapshot involved doesn't contain a cond_resched so in case where large trees have to be compared the softlockup detector can kick in. For multicore systems this is not really an issue - it just means a particular CPU will be hogged for as long as tree comparison lasts. By increasing the the QEMU_COMPRESS_THREADS you make the comparison faster so the soft lock up doesn't trigger. But according to the kernel programming paradigms the softlock detector shouldn't have kicked in the first place. Given this then I consider the patch valid and needed. So I'd still like you to test it with QEMU_COMPRESS_THREADS set to 3 or even 1-2 so that we are sure even if the machine is dead-slow the code is acting appropriately. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c7
--- Comment #7 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c8
--- Comment #8 from Guillaume GARDET
Are you guys willing to test that specific patch? If not it has already landed in the upstream kernel so if there is no movement on this bug I will close it with 'UPSTREAM' in a week.
I cannot test easily another kernel, sorry. The patch is already in kernel 5.4-rc1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.4-rc1&id=6af112b11a4bc1b560f60a618ac9c1dcefe9836e -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562
http://bugzilla.opensuse.org/show_bug.cgi?id=1148562#c9
Guillaume GARDET
participants (1)
-
bugzilla_noreply@novell.com