[Bug 1198972] New: btrfs send slow
http://bugzilla.opensuse.org/show_bug.cgi?id=1198972 Bug ID: 1198972 Summary: btrfs send slow Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: x86-64 OS: openSUSE Leap 15.3 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jon.brightwell@zpg.co.uk QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- We have a btrfs send that starts fast (25MiBs) but quickly crawls (100KiBs~). example command: /sbin/btrfs send -v "/sqldata/central/.snapshots/112/snapshot" | pv -btraT -B50m | ssh backup3.repl.x.net sudo btrfs receive "/backup/dbshard1a.prod.x.net/central/112" At subvol /sqldata/central/.snapshots/112/snapshot At subvol snapshot 777MiB {----} 0:02:55 [ 130KiB/s] [4.44MiB/s] It slows down after 400MiB. I added a buffer size to pv check to see if it was network but the buffer never fills. htop doesn't show any kernel threads struggling and dstat and iostat don't show much. It's a live DB server but there's plenty of IO capacity and we have 5 identical servers that cap out the network. # iostat -x 5 Linux 5.3.18-150300.59.54-default (dbshard1a.prod.x.net) 04/28/22 _x86_64_ (48 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.22 0.00 1.27 0.09 0.00 96.43 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 58.24 441.63 791.34 8244.19 5.67 51.31 8.88 10.41 0.18 0.10 0.06 13.59 18.67 0.55 27.45 nvme0n1 86.05 441.56 1379.26 8244.25 5.18 51.38 5.67 10.42 0.14 0.10 0.06 16.03 18.67 0.53 28.00 nvme1n1 58.27 441.53 791.50 8245.03 5.67 51.32 8.87 10.41 0.18 0.10 0.06 13.58 18.67 0.55 27.40 nvme2n1 86.08 441.46 1379.37 8244.97 5.18 51.38 5.67 10.43 0.14 0.10 0.06 16.02 18.68 0.53 28.04 sda 11.45 14.98 712.51 79.42 0.45 0.08 3.79 0.52 4.37 0.15 0.05 62.20 5.30 1.07 2.84 sdb 11.30 14.98 699.35 79.42 0.44 0.08 3.72 0.52 4.44 0.15 0.05 61.90 5.30 1.08 2.83 avg-cpu: %user %nice %system %iowait %steal %idle 3.85 0.00 3.40 0.12 0.00 92.63 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 88.80 867.80 1016.00 11651.20 0.00 68.00 0.00 7.27 0.10 0.08 0.07 11.44 13.43 0.51 48.56 nvme0n1 66.00 867.40 728.80 11651.20 0.00 68.40 0.00 7.31 0.10 0.08 0.07 11.04 13.43 0.52 48.88 nvme1n1 93.00 863.00 1048.00 11732.80 0.00 72.00 0.00 7.70 0.10 0.08 0.08 11.27 13.60 0.51 48.64 nvme2n1 65.20 862.80 768.80 11732.80 0.00 72.20 0.00 7.72 0.18 0.08 0.08 11.79 13.60 0.53 48.72 sda 41.00 72.20 3498.40 445.10 0.00 0.00 0.00 0.00 0.85 0.11 0.05 85.33 6.16 1.28 14.48 sdb 36.60 72.20 2855.20 445.10 0.00 0.00 0.00 0.00 0.79 0.10 0.04 78.01 6.16 1.31 14.24 avg-cpu: %user %nice %system %iowait %steal %idle 4.43 0.00 3.30 0.12 0.00 92.15 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 70.80 761.20 769.60 10829.60 0.00 63.40 0.00 7.69 0.10 0.09 0.08 10.87 14.23 0.50 41.92 nvme0n1 92.20 761.00 869.60 10829.60 0.00 63.60 0.00 7.71 0.09 0.09 0.08 9.43 14.23 0.52 44.64 nvme1n1 68.80 751.20 767.20 10528.80 0.00 61.80 0.00 7.60 0.10 0.13 0.10 11.15 14.02 0.51 42.08 nvme2n1 91.00 751.00 931.20 10528.80 0.00 62.00 0.00 7.63 0.09 0.13 0.11 10.23 14.02 0.53 44.56 sda 14.00 46.60 1089.60 297.40 0.00 0.20 0.00 0.43 0.69 0.11 0.02 77.83 6.38 1.48 8.96 sdb 10.00 46.60 552.80 297.40 0.00 0.20 0.00 0.43 0.66 0.11 0.01 55.28 6.38 1.55 8.80 # dstat 5 You did not select any stats, using -cdngy by default. --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai stl| read writ| recv send| in out | int csw 2 1 96 0 0|5753k 32M| 0 0 | 86B 376B| 17k 14k 4 3 92 0 0|8436k 95M|2705k 3862k| 0 0 | 39k 29k 5 3 93 0 0|2510k 34M|2119k 3624k| 819B 0 | 30k 22k 3 3 94 0 0|2274k 39M|3914k 4882k| 0 0 | 37k 27k 3 3 94 0 0|1385k 34M|2466k 3345k| 0 0 | 32k 22k # ps -aux | grep send root 7557 94.8 0.0 24572 172 pts/1 Rl+ 11:40 8:06 /sbin/btrfs send -v /sqldata/central/.snapshots/112/snapshot does seem to be maxing out one core. # ps -auxf | grep "btrfs" root 1679 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-worker] root 1680 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-worker-hi] root 1681 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-delalloc] root 1682 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-flush_del] root 1683 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-cache] root 1684 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-submit] root 1685 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-fixup] root 1686 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio] root 1687 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio-met] root 1688 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio-met] root 1689 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio-rai] root 1690 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio-rep] root 1691 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-rmw] root 1692 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-endio-wri] root 1693 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-freespace] root 1694 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-delayed-m] root 1695 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-readahead] root 1696 0.0 0.0 0 0 ? I< Mar19 0:00 \_ [btrfs-qgroup-re] root 1716 0.0 0.0 0 0 ? S Mar19 0:01 \_ [btrfs-cleaner] root 1717 1.5 0.0 0 0 ? S Mar19 872:22 \_ [btrfs-transacti] root 22056 0.8 0.0 0 0 ? I 05:23 3:28 \_ [kworker/u96:10-btrfs-endio-write] root 28815 0.0 0.0 0 0 ? I< 06:41 0:00 \_ [kworker/u97:2-btrfs-worker-high] root 43020 1.5 0.0 0 0 ? I 09:22 2:22 \_ [kworker/u96:12-btrfs-endio-meta] root 44734 1.5 0.0 0 0 ? I 09:41 2:05 \_ [kworker/u96:2-btrfs-endio-meta] root 45641 0.0 0.0 0 0 ? I< 09:51 0:00 \_ [kworker/u97:6-btrfs-worker-high] root 578 0.0 0.0 0 0 ? I< 10:34 0:00 \_ [kworker/u97:1-btrfs-worker-high] root 3129 0.0 0.0 0 0 ? I< 10:59 0:00 \_ [kworker/u97:3-btrfs-worker-high] root 4401 4.2 0.0 0 0 ? I 11:06 1:56 \_ [kworker/u96:8-btrfs-endio] root 5660 0.0 0.0 0 0 ? I< 11:20 0:00 \_ [kworker/u97:4-btrfs-worker-high] root 6567 7.8 0.0 0 0 ? I 11:30 1:45 \_ [kworker/u96:14-btrfs-endio-meta] root 6570 7.6 0.0 0 0 ? I 11:30 1:42 \_ [kworker/u96:20-btrfs-endio-meta] root 6574 4.6 0.0 0 0 ? I 11:30 1:01 \_ [kworker/u96:24-btrfs-endio-meta] root 6584 8.2 0.0 0 0 ? I 11:30 1:49 \_ [kworker/u96:25-btrfs-endio-write] root 6679 0.0 0.0 0 0 ? I< 11:31 0:00 \_ [kworker/u97:5-btrfs-worker-high] root 6681 7.4 0.0 0 0 ? I 11:31 1:34 \_ [kworker/u96:27-btrfs-endio-meta] root 7499 1.7 0.0 0 0 ? I 11:40 0:13 \_ [kworker/u96:1-btrfs-endio-meta] root 7500 2.7 0.0 0 0 ? I 11:40 0:21 \_ [kworker/u96:3-btrfs-endio] root 8460 0.1 0.0 0 0 ? I 11:50 0:00 \_ [kworker/u96:4-btrfs-endio-meta] #iotop PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 7557 be/4 root 190.55 K/s 0.00 B/s 0.00 % 0.13 % btrfs send -v /sqldata/central/.snapshots/112/snapshot 6570 be/4 root 9.30 K/s 18.59 K/s 0.00 % 0.01 % [kworker/u96:20-bond0] 12922 be/4 mysql 1692.48 K/s 16.67 M/s 0.00 % 0.01 % mariadbd --defaults-file=/etc/my.appdata.cnf 7499 be/4 root 6.20 K/s 9.30 K/s 0.00 % 0.01 % [kworker/u96:1-btrfs-endio-write] 22056 be/4 root 3.10 K/s 15.49 K/s 0.00 % 0.00 % [kworker/u96:10-bond0] 7500 be/4 root 3.10 K/s 3.10 K/s 0.00 % 0.00 % [kworker/u96:3-btrfs-endio-write] 12761 be/4 mysql 24.79 K/s 1332.29 K/s 0.00 % 0.00 % mariadbd --defaults-file=/etc/my.central.cnf 6359 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/12:1-events] 9123 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/19:0-events] 3015 be/4 dd-agent 0.00 B/s 3.87 K/s 0.00 % 0.00 % agent run -p /opt/datadog-agent/run/agent.pid 6567 be/4 root 0.00 B/s 12.39 K/s 0.00 % 0.00 % [kworker/u96:14-btrfs-endio-write] 6584 be/4 root 0.00 B/s 9.30 K/s 0.00 % 0.00 % [kworker/u96:25-btrfs-endio-write] What diagnostics are recommended to work out why this one server is a slug and possibly bugged? :) scrub is fine but can't take server offline to do a check. No errors in dmesg or journal. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1198972 http://bugzilla.opensuse.org/show_bug.cgi?id=1198972#c1 --- Comment #1 from Jon Brightwell <jon.brightwell@zpg.co.uk> --- #perf top -p pid 4000hz 29.23% [kernel] [k] read_extent_buffer 15.74% [kernel] [k] trace_event_type_funcs_svcrdma_encode_read+0x7fffff3a5ebe 12.64% [kernel] [k] map_private_extent_buffer 12.32% [kernel] [k] nf_tables_fill_gen_info.isra.55 9.48% [kernel] [k] print_fmt_xprtrdma_rxprt+0x7fffff3a5ee8 3.97% [kernel] [k] __radix_tree_lookup 2.24% [kernel] [k] mark_page_accessed 1.50% [kernel] [k] generic_bin_search.constprop.41 1.46% [kernel] [k] print_fmt_xprtrdma_rdch_event+0x7fffff3a5f0f 1.33% [kernel] [k] rpcrdma_ep_disconnect 1.08% [kernel] [k] btrfs_comp_cpu_keys 0.92% [kernel] [k] find_extent_buffer 0.76% [kernel] [k] nf_tables_getobj 0.76% [kernel] [k] trace_event_type_funcs_svcrdma_wc_receive+0x7fffff3a5eb1 0.73% [kernel] [k] trace_event_type_funcs_svcrdma_error_event+0x7fffff3a5ec5 0.73% [kernel] [k] trace_event_type_funcs_svcrdma_dma_map_rwctx+0x7fffff3a5eab 0.72% [kernel] [k] print_fmt_xprtrdma_connect_class+0x7fffff3a5ef2 0.70% [kernel] [k] trace_event_type_funcs_svcrdma_post_recv+0x7fffff3a5ea8 0.60% [kernel] [k] read_hpet 0.50% [kernel] [k] btrfs_search_slot -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1198972 http://bugzilla.opensuse.org/show_bug.cgi?id=1198972#c2 --- Comment #2 from Jon Brightwell <jon.brightwell@zpg.co.uk> --- Left it running over night and it is down to 23KiB/s... At subvol snapshot 6.64GiB {----} 21:42:48 [23.3KiB/s] [89.1KiB/s] 29.16% /proc/kcore 0x7fffc0a45553 k [k] read_extent_buffer 14.69% /proc/kcore 0x7fffc0a3a3ea k [k] trace_event_type_funcs_svcrdma_encode_read+0x7fffff3a5ebe 13.01% /proc/kcore 0x7fffc0a45e3e k [k] map_private_extent_buffer 12.58% /proc/kcore 0x7fffc0a8a6eb k [k] nf_tables_fill_gen_info.isra.55 9.31% /proc/kcore 0x7fffc0a3a16b k [k] print_fmt_xprtrdma_rxprt+0x7fffff3a5ebf 3.97% /proc/kcore 0x7fff94f0c6ed k [k] __radix_tree_lookup 2.04% /proc/kcore 0x7fff94841159 k [k] mark_page_accessed 1.41% /proc/kcore 0x7fffc0a3a08d k [k] print_fmt_xprtrdma_rdch_event+0x7fffff3a5f21 1.38% /proc/kcore 0x7fffc0a131ee k [k] rpcrdma_ep_disconnect 1.36% /proc/kcore 0x7fffc09fb9fd k [k] generic_bin_search.constprop.41 1.16% /proc/kcore 0x7fff9467a567 k [k] read_hpet 1.09% /proc/kcore 0x7fffc09fb8ed k [k] btrfs_comp_cpu_keys 1.01% /proc/kcore 0x7fffc0a44758 k [k] find_extent_buffer 0.83% /proc/kcore 0x7fffc0a8a64a k [k] nf_tables_getobj 0.79% /proc/kcore 0x7fffc0a3a3d1 k [k] trace_event_type_funcs_svcrdma_error_event+0x7fffff3a5ec5 0.77% /proc/kcore 0x7fffc0a3a2e3 k [k] trace_event_type_funcs_svcrdma_wc_receive+0x7fffff3a5eb7 0.72% /proc/kcore 0x7fffc0a3a2f4 k [k] trace_event_type_funcs_svcrdma_post_recv+0x7fffff3a5ea8 0.70% /proc/kcore 0x7fffc0a3a377 k [k] trace_event_type_funcs_svcrdma_dma_map_rwctx+0x7fffff3a5eab 0.65% /proc/kcore 0x7fffc0a3a11e k [k] print_fmt_xprtrdma_connect_class+0x7fffff3a5ef2 Not much has changed. Would a full flamegraph/call graph help? The filesystem setup is 4x NVME in BTRFS RAID 10 root@dbshard1a.prod:~ # btrfs filesystem usage /sqldata Overall: Device size: 13.97TiB Device allocated: 4.04TiB Device unallocated: 9.93TiB Device missing: 0.00B Used: 3.42TiB Free (estimated): 5.20TiB (min: 5.20TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID10: Size:1.79TiB, Used:1.56TiB /dev/nvme0n1 917.00GiB /dev/nvme1n1 917.00GiB /dev/nvme2n1 917.00GiB /dev/nvme3n1 917.00GiB Metadata,RAID10: Size:237.00GiB, Used:158.95GiB /dev/nvme0n1 118.50GiB /dev/nvme1n1 118.50GiB /dev/nvme2n1 118.50GiB /dev/nvme3n1 118.50GiB System,RAID10: Size:16.00MiB, Used:272.00KiB /dev/nvme0n1 8.00MiB /dev/nvme1n1 8.00MiB /dev/nvme2n1 8.00MiB /dev/nvme3n1 8.00MiB Unallocated: /dev/nvme0n1 2.48TiB /dev/nvme1n1 2.48TiB /dev/nvme2n1 2.48TiB /dev/nvme3n1 2.48TiB -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1198972 http://bugzilla.opensuse.org/show_bug.cgi?id=1198972#c3 --- Comment #3 from Jon Brightwell <jon.brightwell@zpg.co.uk> --- I left it running whilst I went on holiday and came back to BTRFS_IOC_SEND returned 0[ 499KiB/s] [98.7KiB/s] joining genl thread 63.9GiB { 0%} 188:39:59 [98.7KiB/s] [98.7KiB/s] -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1198972 http://bugzilla.opensuse.org/show_bug.cgi?id=1198972#c4 Ian Christian <ian.christian@zoopla.co.uk> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ian.christian@zoopla.co.uk --- Comment #4 from Ian Christian <ian.christian@zoopla.co.uk> --- Further to this, it can be reduced down to just the send, taking out of scope the SSH connection. Again it's observed that the first chunk of data moves quickly - 34s in and 806M has been written: 806MiB 0:00:34 [93.9KiB/s] However, then things then slow down significantly, while there are occasional bursts of speed, generally speaking the speed can be sene in 100-200KiB/s After 4 minutes, only 200MiB of extra data has been written: 989MiB 0:04:04 [ 144KiB/s] Here's the details on the devices: root@dbshard1a.prod:~ # btrfs device stats /sqldata/ [/dev/nvme0n1].write_io_errs 0 [/dev/nvme0n1].read_io_errs 0 [/dev/nvme0n1].flush_io_errs 0 [/dev/nvme0n1].corruption_errs 0 [/dev/nvme0n1].generation_errs 0 [/dev/nvme1n1].write_io_errs 0 [/dev/nvme1n1].read_io_errs 0 [/dev/nvme1n1].flush_io_errs 0 [/dev/nvme1n1].corruption_errs 0 [/dev/nvme1n1].generation_errs 0 [/dev/nvme2n1].write_io_errs 0 [/dev/nvme2n1].read_io_errs 0 [/dev/nvme2n1].flush_io_errs 0 [/dev/nvme2n1].corruption_errs 0 [/dev/nvme2n1].generation_errs 0 [/dev/nvme3n1].write_io_errs 0 [/dev/nvme3n1].read_io_errs 0 [/dev/nvme3n1].flush_io_errs 0 [/dev/nvme3n1].corruption_errs 0 [/dev/nvme3n1].generation_errs 0 root@dbshard1a.prod:~ # btrfs device usage /sqldata/ /dev/nvme0n1, ID: 1 Device size: 3.49TiB Device slack: 0.00B Data,RAID10: 751.00GiB Metadata,RAID10: 187.00GiB System,RAID10: 32.00MiB Unallocated: 2.58TiB /dev/nvme1n1, ID: 2 Device size: 3.49TiB Device slack: 0.00B Data,RAID10: 751.00GiB Metadata,RAID10: 187.00GiB System,RAID10: 32.00MiB Unallocated: 2.58TiB /dev/nvme2n1, ID: 3 Device size: 3.49TiB Device slack: 0.00B Data,RAID10: 751.00GiB Metadata,RAID10: 187.00GiB System,RAID10: 32.00MiB Unallocated: 2.58TiB /dev/nvme3n1, ID: 4 Device size: 3.49TiB Device slack: 0.00B Data,RAID10: 751.00GiB Metadata,RAID10: 187.00GiB System,RAID10: 32.00MiB Unallocated: 2.58TiB Any suggestions on how to diagnose this speed issue would be much appreciated! -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com