[kernel-bugs] [Bug 1175893] New: question: load insanely high
https://bugzilla.suse.com/show_bug.cgi?id=1175893 Bug ID: 1175893 Summary: question: load insanely high Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: aarch64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: ro@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- obs-arm-1:~ # uname -a Linux obs-arm-1 5.8.2-1-default #1 SMP Wed Aug 19 09:43:15 UTC 2020 (71b519a) aarch64 aarch64 aarch64 GNU/Linux obs-arm-1:~ # cat /proc/loadavg 848.03 848.11 847.01 8/660 35750 obs-arm-1:~ # ps awux | wc -l 526 obs-arm-1:~ # pstree -apl | wc -l 252 so I have 526 processes and a load of about 850 ... what am I missing here ? top - 13:00:18 up 1 day, 17:02, 1 user, load average: 845.34, 847.11, 846.76 Tasks: 524 total, 1 running, 523 sleeping, 0 stopped, 0 zombie %Cpu(s): 12.8 us, 0.3 sy, 0.0 ni, 86.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 257195.5+total, 13658.55+free, 30579.12+used, 212957.8+buff/cache MiB Swap: 1000.066 total, 956.066 free, 44.000 used. 223341.5+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18257 qemu 20 0 9553800 4.102g 22944 S 113.2 1.633 1317:44 qemu-system-aar 21778 qemu 20 0 9689056 4.106g 22956 S 107.2 1.635 1328:27 qemu-system-aar 9249 qemu 20 0 9566116 4.106g 23208 S 102.0 1.635 1318:27 qemu-system-aar 23827 qemu 20 0 9424236 4.059g 23048 S 102.0 1.616 23:30.35 qemu-system-aar 18143 qemu 20 0 9560948 4.104g 22880 S 100.7 1.634 1316:43 qemu-system-aar 22072 qemu 20 0 9465680 4.098g 23052 S 100.7 1.631 1329:10 qemu-system-aar 36915 root 20 0 9428 3852 2900 R 1.316 0.001 0:00.16 top 41766 qemu 20 0 9475952 4.099g 22728 S 0.658 1.632 41:10.28 qemu-system-aar 6601 root 20 0 31536 23576 7452 S 0.329 0.009 0:20.33 bs_worker 1 root 20 0 172424 13108 9568 S 0.000 0.005 3:31.34 systemd 2 root 20 0 0 0 0 S 0.000 0.000 0:00.99 kthreadd 3 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 rcu_par_gp [... skipping the rest of zeroes ...] 8 VMs with about 24 qemu threads each, so I would not be surprised about a load of ~200 but why are we seeing > 800 now ? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c1 --- Comment #1 from Ruediger Oertel <ro@suse.com> --- looks like it would be leaking processes ... obs-arm-1:~ # uname -a Linux obs-arm-1 5.8.2-1-default #1 SMP Wed Aug 19 09:43:15 UTC 2020 (71b519a) aarch64 aarch64 aarch64 GNU/Linux obs-arm-1:~ # cat /proc/loadavg 1050.90 1044.60 1034.32 45/1185 9618 obs-arm-1:~ # ps awux | wc -l 628 obs-arm-1:~ # pstree -apl | wc -l 530 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c2 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tiwai@suse.com --- Comment #2 from Takashi Iwai <tiwai@suse.com> --- Is it specific to this machine, or found on other platforms / archs? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c3 --- Comment #3 from Ruediger Oertel <ro@suse.com> --- I've seen it on all 4 thunderx1 systems (obs-arm-[1234]) -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c4 --- Comment #4 from Ruediger Oertel <ro@suse.com> --- same on 5.8.4 by the way. it does not happen on the small mustang boards. will try on thunderx2 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c5 --- Comment #5 from Ruediger Oertel <ro@suse.com> --- can't test thunderx2 yet because of bug#1175054 (the coresight crash and I don't know a way of turning off coresight on cmdline completely) -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c6 --- Comment #6 from Ruediger Oertel <ro@suse.com> --- no issues on power9 or s390x -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c7 Ruediger Oertel <ro@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|question: load insanely |load explosion with |high |kernel-5.8.X on | |cavium/thunderX1 --- Comment #7 from Ruediger Oertel <ro@suse.com> --- no issues on the same machines with sle15-sp2 (5.3.18-lp152.36-default) # for i in 1 2 3 4 ; do ssh obs-arm-$i cat /proc/loadavg ; done 85.19 77.67 61.55 95/1105 8237 63.98 48.96 25.18 56/1046 12135 44.91 38.70 20.71 41/1041 28733 45.64 33.35 16.82 44/777 34389 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c8 --- Comment #8 from Guillaume GARDET <guillaume.gardet@arm.com> --- I have seen performance issues on ThunderX2 with Tumbleweed. The problem I have is openQA tests (using qemu) are ~2x slower. So, it would match this bug report. I did not find the root cause yet. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c9 --- Comment #9 from Ruediger Oertel <ro@suse.com> --- actually I'm not 100% sure this is the same issue. What I was seeing with the TW kernel was limited to ThunderX1, not on X2 and it was just increasing load levels without the machine "suffering interactively" as I would have expected from a 4-digit load, it felt more like the kernel process counter not being decreased when certain processes finish ... but no data to back this up. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c10 --- Comment #10 from Ruediger Oertel <ro@suse.com> --- looks like we now inherited this problem in leap-15.2: obs-arm-1:~ # uname -a Linux obs-arm-1 5.3.18-lp152.47-default #1 SMP Thu Oct 15 16:05:25 UTC 2020 (41f7396) aarch64 aarch64 aarch64 GNU/Linux obs-arm-1:~ # rpm -q qemu qemu-4.2.1-lp152.9.6.1.aarch64 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c11 --- Comment #11 from Ruediger Oertel <ro@suse.com> --- Created attachment 842996 --> https://bugzilla.suse.com/attachment.cgi?id=842996&action=edit increasing load of -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c12 --- Comment #12 from Ruediger Oertel <ro@suse.com> --- as a general question : is load > "number of procs" a valid state ? obs-arm-3:~ # cat /proc/loadavg ; ps waux | wc -l 693.77 693.95 694.28 43/973 43433 592 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c13 --- Comment #13 from Ruediger Oertel <ro@suse.com> --- looks like qualcomm centriq qdf2400 is also affected: ibs-centriq-3, all workers (qemu processes) stopped: load stays at >650 (with the CPUs being at 0%) top - 16:37:17 up 6 days, 22:42, 1 user, load average: 668.99, 755.45, 800.31 Tasks: 482 total, 1 running, 481 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 96213.84+total, 1096.566 free, 82666.77+used, 12450.50+buff/cache MiB Swap: 1003.902 total, 954.340 free, 49.562 used. 12136.80+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1104 root 20 0 9100 3808 2828 R 0.660 0.004 0:00.40 top 10 root 20 0 0 0 0 I 0.330 0.000 10:15.34 rcu_sched 1 root 20 0 162872 6508 4592 S 0.000 0.007 19:06.77 systemd 2 root 20 0 0 0 0 S 0.000 0.000 0:09.05 kthreadd 3 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 rcu_par_gp 6 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 kworker/0:0H-kblockd 8 root 0 -20 0 0 0 I 0.000 0.000 0:00.00 mm_percpu_wq -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 Ruediger Oertel <ro@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1178227 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c14 --- Comment #14 from Takashi Iwai <tiwai@suse.com> --- (In reply to Ruediger Oertel from comment #10)
looks like we now inherited this problem in leap-15.2: obs-arm-1:~ # uname -a Linux obs-arm-1 5.3.18-lp152.47-default #1 SMP Thu Oct 15 16:05:25 UTC 2020 (41f7396) aarch64 aarch64 aarch64 GNU/Linux obs-arm-1:~ # rpm -q qemu qemu-4.2.1-lp152.9.6.1.aarch64
And this didn't appear with the previous kernel (5.3.18-lp152.44)? It'll help narrowing down the regression range. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c16 --- Comment #16 from Ruediger Oertel <ro@suse.com> --- no the .44 kernel did not show this problem, only the .47 (and the similar 5.3.18-24.29-default on sle15sp2) before this I had only seen this with the 5.8+ kernels in tumbleweed -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c17 Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jslaby@suse.com --- Comment #17 from Jiri Slaby <jslaby@suse.com> --- (In reply to Ruediger Oertel from comment #16)
no the .44 kernel did not show this problem,
914f31e37c30f32baa66f8440141b852519f4c47
only the .47
41f7396defec6284cc6873246964a0b9e9ca9a6d -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c18 Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mgorman@suse.com --- Comment #18 from Jiri Slaby <jslaby@suse.com> --- (In reply to Jiri Slaby from comment #17)
(In reply to Ruediger Oertel from comment #16)
no the .44 kernel did not show this problem,
914f31e37c30f32baa66f8440141b852519f4c47
only the .47
41f7396defec6284cc6873246964a0b9e9ca9a6d
I cannot see anything obvious in that range:
git one --no-merges 914f31e37c30f32baa66f8440141b852519f4c47..41f7396defec6284cc6873246964a0b9e9ca9a6d|grep -viE 'scsi|media|crypto|btrfs|rbd|mmc|spi|bluetooth|serial|nfs'
Sched changes seem to be only numa-specific by Mel:
30c0b50f mm: call cond_resched() from deferred_init_memmap() (git fixes (mm/init), bsc#1177697). 759de680 Delete patches.suse/sched-fair-update_pick_idlest-Select-group-with-lowest-group_util-when-idle_cpus-are-equal.patch. a9a7020e sched/fair: Ignore cache hotness for SMT migration (bnc#1155798 (CPU scheduler functional and performance backports)). c9c51db4 sched/numa: Use runnable_avg to classify node (bnc#1155798 (CPU scheduler functional and performance backports)). cedbcc9e sched/fair: Use dst group while checking imbalance for NUMA balancer (bnc#1155798 (CPU scheduler functional and performance backports)). aa3fc2ac sched/numa: Avoid creating large imbalances at task creation time (bnc#1176588). 576f70b9 sched/numa: Check numa balancing information only when enabled (bnc#1176588).
Looking at them, I don't think they could cause this. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c19 --- Comment #19 from Takashi Iwai <tiwai@suse.com> --- The bug culprit is tracked in bug 1178227. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c20 --- Comment #20 from Mel Gorman <mgorman@suse.com> --- (In reply to Takashi Iwai from comment #19)
The bug culprit is tracked in bug 1178227.
Upstream fix is f97bb5272d9e95d400d6c8643ebb146b3e3e7842 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 Mel Gorman <mgorman@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks|1178227 | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c21 Matthias Brugger <mbrugger@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #21 from Matthias Brugger <mbrugger@suse.com> --- Fix is be part of 5.9.11 and should be released within the next days. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c22 --- Comment #22 from Jiri Slaby <jslaby@suse.com> --- (In reply to Matthias Brugger from comment #21)
Fix is be part of 5.9.11 and should be released within the next days.
FWIW https://build.opensuse.org/request/show/850892 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1175893 https://bugzilla.suse.com/show_bug.cgi?id=1175893#c23 --- Comment #23 from Mel Gorman <mgorman@suse.com> --- (In reply to Jiri Slaby from comment #22)
(In reply to Matthias Brugger from comment #21)
Fix is be part of 5.9.11 and should be released within the next days.
Thanks. The correct fix is definitely included in 5.9.11. The upstream commit ec618b84f6e15281cc3660664d34cd0dd2f2579e (0481a0358d4268e5502a3fcecef4ac6f2668fd26 in stable) is also included. While it did not crop up in this bug, it also fixes up a small anomaly where IO wait figures (wa column in vmstat) could be higher than expected. I only mention it in case someone on the cc is aware of a bug where IO wait times are too high or higher than expected. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com