[Bug 1169774] New: Slow down in OBS since kernel 5.6.0 on 32bit
http://bugzilla.suse.com/show_bug.cgi?id=1169774 Bug ID: 1169774 Summary: Slow down in OBS since kernel 5.6.0 on 32bit Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: adrian.schroeter@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- We noticed an increase of build time by factor 5 on some OBS workers. Exmaple: # eosc jobhistory openSUSE:Factory installation-images -M openSUSE standard i586 The goat & sheep systems suddenly need 2.5h instead of 0.5h build time for the same package. This is 32bit/i586 only. These systems used to be fine before, it seems the guest kernel update to 5.6.0 did trigger it (time matches). The slowed odwn goat and sheep systems are AMD Epyc 3.01/3.02. While the un-affected lamb systems are AMD Opteron systems. The host systems did stay on openSUSE Leap 15.1. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1169774
Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
Fabian Vogt
http://bugzilla.suse.com/show_bug.cgi?id=1169774
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c1
--- Comment #1 from Borislav Petkov
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c2
--- Comment #2 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c3
--- Comment #3 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c4
--- Comment #4 from Borislav Petkov
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c5
--- Comment #5 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c6
--- Comment #6 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c7
--- Comment #7 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c8
--- Comment #8 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c9
--- Comment #9 from Borislav Petkov
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c10
--- Comment #10 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c11
--- Comment #11 from Adrian Schröter
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c12
--- Comment #12 from Jiri Slaby
# osc build --vm-type=kvm -M openSUSE standard i586
and add --userootforbuild --vm-disk-size=30000. On epyc machine (remus), I had to add -j 32, as -j 33 (or more) results in:
smpboot: Total of 33 processors activated (148221.48 BogoMIPS) BUG: kernel NULL pointer dereference, address: 00000d24 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page *pdpt = 0000000000000000 *pde = f000ff53f000ff53 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 1 Comm: swapper/0 Tainted: G S W 5.6.4-1-pae #1 openSUSE Tumbleweed (unreleased) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.suse.com 04/01/2014 EIP: __alloc_pages_nodemask+0xd6/0x2b0
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c13
--- Comment #13 from Jiri Slaby
EIP: __alloc_pages_nodemask+0xd6/0x2b0
It dies somewhere in __alloc_pages_nodemask, trying to next_zones_zonelist, perhaps via for_each_zone_zonelist_nodemask. But that is not new... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c14
--- Comment #14 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c15
--- Comment #15 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c16
--- Comment #16 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c18
--- Comment #18 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c19
--- Comment #19 from Jiri Slaby
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd vda 436.66 167.78 3114.85 0.00 935805 17372932 0 vda 270.82 822.20 9404.21 0.00 760877 8702840 0
The first is 5.6 = bad, the second is good = 5.5. 3 times slower writes, 5 times slower reads. But it could be due to accumulation of I/O buffers. Let's see if I can bisect it at last as I failed 2 times already (bisection lead to a merge commit or so...). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c20
--- Comment #20 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c21
--- Comment #21 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c22
--- Comment #22 from Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c23
Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c24
--- Comment #24 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c25
Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c26
--- Comment #26 from Jiri Slaby
2020-05-01 10:14:21 installation-images:openSUSE source change succeeded 33m 13s lamb67:2 2020-05-01 14:12:51 installation-images:openSUSE new build succeeded 3h 1m 3s goat01:4 2020-05-01 16:50:14 installation-images:openSUSE new build succeeded 36m 13s lamb63:4 2020-05-01 19:23:50 installation-images:openSUSE rebuild counter succeeded 25m 10s lamb71:3 2020-05-03 00:52:49 installation-images:openSUSE new build succeeded 33m 20s goat17:4 2020-05-03 03:39:12 installation-images:openSUSE rebuild counter succeeded 24m 29s lamb69:3
Now, we will work on fixing it upstream. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c28
--- Comment #28 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c29
--- Comment #29 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c30
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=1169774
http://bugzilla.suse.com/show_bug.cgi?id=1169774#c31
--- Comment #31 from Jiri Slaby
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c32
Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c34
Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c35
Fabian Vogt
Fabian, are you sure this is the same problem? I mean was this consistently failing like this for past 6 months when the change was introduced? AFAIU you're speaking about the 'failed' builds on 'sheep' and 'goat' hosts, aren't you? 'lamb', 'cloud', and 'build' hosts seem to work fine.
The osc jobhist only goes back until beginning of october. In that timeframe, it's been consistently slow/failing i586 goat/sheep builds. I did some tests using different filesystems inside the VM using "Buildflags: vmfstype:foo" in the prjconf. Using btrfs made no difference (compared to ext4). XFS appears to work much better. So the issue apparently impacts ext4 and btrfs the most. Kiwi is still using ext4 for the filesystem on the livecd though, so I could only compare the time until the final rsync. XFS on sheep81, 23s for kernel-firmware: [ 94s] [ DEBUG ]: 16:02:24 | system: ( 47/1065) Installing: kbd-legacy-2.3.0-1.1.noarch [............done] [ 94s] [ DEBUG ]: 16:02:24 | system: Additional rpm output: [ 94s] [ DEBUG ]: 16:02:24 | system: warning: /var/cache/kiwi/packages/f796d7d2bc4daf38063ff386ebbc072d/kbd-legacy.rpm: Header V3 RSA/SHA256 Signature, key ID 3dbdc284: NOKEY [ 117s] [ DEBUG ]: 16:02:47 | system: ( 48/1065) Installing: kernel-firmware-20201023-1.1.noarch [.......................done] [ 117s] [ INFO ]: Processing: [# ] 2%[ DEBUG ]: 16:02:47 | system: Additional rpm output: [ 117s] [ DEBUG ]: 16:02:47 | system: warning: /var/cache/kiwi/packages/f796d7d2bc4daf38063ff386ebbc072d/kernel-firmware.rpm: Header V3 RSA/SHA256 Signature, key ID 3dbdc284: NOKEY btrfs on sheep82, 141s for kernel-firmware: [ 108s] [ DEBUG ]: 16:27:22 | system: ( 47/1065) Installing: kbd-legacy-2.3.0-1.1.noarch [............done] [ 108s] [ DEBUG ]: 16:27:22 | system: Additional rpm output: [ 108s] [ DEBUG ]: 16:27:22 | system: warning: /var/cache/kiwi/packages/f796d7d2bc4daf38063ff386ebbc072d/kbd-legacy.rpm: Header V3 RSA/SHA256 Signature, key ID 3dbdc284: NOKEY [ 249s] [ DEBUG ]: 16:29:43 | system: ( 48/1065) Installing: kernel-firmware-20201023-1.1.noarch [...................................................................done] [ 249s] [ INFO ]: Processing: [# ] 2%[ DEBUG ]: 16:29:43 | system: Additional rpm output: [ 249s] [ DEBUG ]: 16:29:43 | system: warning: /var/cache/kiwi/packages/f796d7d2bc4daf38063ff386ebbc072d/kernel-firmware.rpm: Header V3 RSA/SHA256 Signature, key ID 3dbdc284: NOKEY
If the start of the problem indeed dates back 6 months, I'd like to check whether mounting the filesystem with 'dioread_lock' mount option indeed fixes the issue. What would be the easiest way to try that with OBS? The option needs to be passed either to mount, to mke2fs (as part of default mount options), or I can provide patched kernel with modified defaults...
I don't think it's possible for users to influence mount options directly, so if you think that test makes sense, a kernel-obs-build package for i586 would be ideal. FTR, my test projects are at https://build.opensuse.org/project/show/home:favogt:boo1169774 (xfs) https://build.opensuse.org/project/show/home:favogt:boo1169774-btrfs (btrfs) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c36
--- Comment #36 from Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c37
--- Comment #37 from Fabian Vogt
If this is happening on both ext4 and btrfs it seems unlikely to be caused by ext4 specific problem (provided btrfs ever worked before). But anyway, let's debug this and we'll see where the problem is. The difference between XFS and ext4 shouldn't indeed be that big.
The werird part is IMO that this only affects i586 builds on goat and sheep workers. x86_64 on goat/sheep and i586 on other workers are just fine.
Now I'm mostly ignorant of OBS (well, I use it to build simple packages but that's all). So if I understand right 'vmfstype' influences what is used as a root filesystem of the VM that is doing the build of the package? I.e. in your case of live CD images?
Yep!
What exactly happens between installation of kbd-legacy-2.3.0-1.1.noarch and kernel-firmware-20201023-1.1.noarch in the build system? This is the difference you care about right?
The main issue is "general slowness" of builds, with the rsync difference being the most obvious. That always uses ext4 (not configurable) and takes too long for quick testing, so I picked some other part of the build for comparison.
But even in the "fast" case it takes 23 seconds so I'd like to understand what the system is doing during this time...
Probably just installation of the kernel-firmware rpm. It's quite massive (zypper says "219.3 MiB (563.5 MiB unpacked)") and contains 2554 files. Other package installations are affected as well, but most are just smaller library packages and a sub-second difference isn't that visible in the log.
Also where can I see the full logs you've pasted from?
btrfs: https://build.opensuse.org/package/live_build_log/home:favogt:boo1169774-btr... ext4: https://build.opensuse.org/package/live_build_log/home:favogt:boo1169774-ext... xfs: https://build.opensuse.org/package/live_build_log/home:favogt:boo1169774/liv... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c38
--- Comment #38 from Adrian Schr�ter
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c39
--- Comment #39 from Ruediger Oertel
The werird part is IMO that this only affects i586 builds on goat and sheep workers. x86_64 on goat/sheep and i586 on other workers are just fine.
this basically says 98% of all workers, IMHO this only leaves the cloud??? (101-138) and build?? (70-85) machines, all older Intel CPUs and all sheep/lamb machines are Opteron and goat are EPYC. Something known with 32bit on AMD ? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c40
--- Comment #40 from Dominique Leuenberger
The werird part is IMO that this only affects i586 builds on goat and sheep workers. x86_64 on goat/sheep and i586 on other workers are just fine.
this basically says 98% of all workers, IMHO this only leaves the cloud??? (101-138) and build?? (70-85) machines, all older Intel CPUs and all sheep/lamb machines are Opteron and goat are EPYC.
osc jobhist openSUSE:Factory installation-images:openSUSE standard i586 (in difference to the live images, this at least succeeds, but 2.5 hours instead of 0.5 is a big difference; also the package used initially to report
lamb is fine as far as I know. cloud is significantly slower than lamb, but iiuc, that's the older hardware responsible for that difference A good example of timing differences is: this bug) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c41
--- Comment #41 from Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c42
--- Comment #42 from Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c43
--- Comment #43 from Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c44
--- Comment #44 from Jan Kara
https://bugzilla.suse.com/show_bug.cgi?id=1169774
https://bugzilla.suse.com/show_bug.cgi?id=1169774#c45
Jan Kara
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com