[Bug 1181479] New: LLVM 11 can not be built because of a kernel or hardware problem
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 Bug ID: 1181479 Summary: LLVM 11 can not be built because of a kernel or hardware problem Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: sarah.kriesch@ibm.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0 Build Identifier: OBS has got a failed build of LLVM 11 for s390x with the following message: [41008s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [41008s] or the build host has a kernel or hardware problem... Reproducible: Always Steps to Reproduce: see: https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/... Actual Results: faild builds of LLVM11 because of a kernel or hardware problem on the base system (the VM) Expected Results: successful builds of LLVM11 in the OBS -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 Sarah Julia Kriesch <sarah.kriesch@ibm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High CC| |azouhr@opensuse.org, | |ihno@suse.com Hardware|Other |S/390-64 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c1 --- Comment #1 from Ruediger Oertel <ro@suse.com> --- the last lines in that log are: [10986s] [4035/4215] : && /usr/bin/cmake -E rm -f lib64/libclangStaticAnalyzerFrontend.a && /home/abuild/rpmbuild/BUILD/llvm-11.0.1.src/stage1/bin/llvm-ar Dqc lib64/libclangStaticAnalyzerFrontend.a tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/AnalysisConsumer.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/AnalyzerHelpFlags.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/CheckerRegistry.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/CreateCheckerManager.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/FrontendActions.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/ModelConsumer.cpp.o tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/ModelInjector.cpp.o && /home/abuild/rpmbuild/BUILD/llvm-11.0.1.src/stage1/bin/llvm-ranlib -D lib64/libclangStaticAnalyzerFrontend.a && : [40994s] qemu-system-s390x: terminating on signal 15 from pid 125036 (<unknown process>) Job seems to be stuck here, killed. (after 30000 seconds of inactivity) [41007s] ### VM INTERACTION END ### [41008s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [41008s] or the build host has a kernel or hardware problem... so llvm is stuck, possibly because of memory pressure for 30000 seconds and after that timeout the process gets killed (and subsequently fails with a dubious error message as the build script can't find out what exactly happened). this is s390zp2a with worker VMs like OBS_INSTANCE_MEMORY=$(( 12 * 1024 )) OBS_WORKER_ROOT_SIZE=60000 OBS_WORKER_SWAP_SIZE=2000 OBS_WORKER_JOBS=6 so a VM with 12G memory gets stuck with 6 jobs. might be a compiler bug or the processes really get large enough to stall the machine due to memory pressure. these are already the largest worker instances we run, so you might try with manually setting a different (lower) number of jobs in the specfile or similar. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c2 Aaron Puchert <aaronpuchert@alice-dsl.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |aaronpuchert@alice-dsl.net --- Comment #2 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- (In reply to Sarah Julia Kriesch from comment #0)
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:zSystems/...
Checked the link right now, and the latest build was successful, also the history shows the previous build as the only failing build. (In reply to Ruediger Oertel from comment #1)
so a VM with 12G memory gets stuck with 6 jobs. might be a compiler bug or the processes really get large enough to stall the machine due to memory pressure. these are already the largest worker instances we run, so you might try with manually setting a different (lower) number of jobs in the specfile or similar.
Generally I would have gone with OOM, but 2G per job should be more than enough. Ninja's job pools are strange, so we could have up to 5 compile jobs plus 1 ThinLTO link job running at the same time. But even that is unlikely to consume something remotely close to 12G. Though a ThinLTO job being involved is likely: this is pretty late in the build, and while the compiler itself is single-threaded, so whatever it does is pretty well reproducible, ThinLTO jobs are multi-threaded. So it could be a deadlock, but also a race condition if s390x has weaker memory ordering than x86. (Which it likely has, x86 is pretty constrained.) So there could be a bug, but unless we can observe this live there is too little to go on I think. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c3 --- Comment #3 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- The 3 latest builds have succeeded, and it's almost certainly not a kernel bug. More likely it's a hang caused by a race condition, but since LLVM 12 is around the corner I'll probably not have the time to look into it. It's also not exactly a part of LLVM I'm familiar with. So I suggest to close this bug. If anyone is curious, the bug is most likely in the gold linker plugin LLVMgold.so or the LTO library, whichever does the actual parallelization. ThreadSanitizer is sadly not supported on s390x (https://clang.llvm.org/docs/ThreadSanitizer.html), but the tool is sensitive enough to detect the race on other platforms. So this could be worth a try. However, my machine isn't big enough for that. ;) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c4 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |INVALID --- Comment #4 from Sarah Kriesch <ada.lovelace@gmx.de> --- Thank you for watching! We can create a new bug report if that will happen again. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c5 --- Comment #5 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- By the way, I just got an actual kernel issue in devel:tools:compiler/llvm12: [ 172s] [ 159.153691] User process fault: interruption code 0010 ilc:3 in cc1plus[1000000+13a6000] [ 172s] [ 159.153864] Failing address: 0000000000000000 TEID: 0000000000000800 [ 172s] [ 159.153959] Fault in primary space mode while using user ASCE. [ 172s] [ 159.154055] AS:00000000828781c7 R3:0000000082d20007 S:0000000000000020 [...] [ 173s] c++: internal compiler error: Segmentation fault signal terminated program cc1plus [ 173s] Please submit a full bug report, [ 173s] with preprocessed source if appropriate. [ 173s] See <https://bugs.opensuse.org/> for instructions. [...] [ 180s] [ 166.845717] BUG: Bad page state in process cc1plus pfn:165f01 [ 180s] [ 166.847446] BUG: Bad page state in process cc1plus pfn:165f02 [ 180s] [ 166.847937] BUG: Bad page state in process cc1plus pfn:165f03 [ 180s] [ 166.849085] BUG: Bad page state in process cc1plus pfn:165f04 [ 180s] [ 166.849462] BUG: Bad rss-counter state mm:00000000c03d85c2 type:MM_FILEPAGES val:-256 [ 180s] [ 166.850834] BUG: Bad page state in process cc1plus pfn:165f05 [... all numbers in between ...] [ 180s] [ 166.898083] BUG: Bad page state in process cc1plus pfn:165f3c But there is little to go on. This is started on s390zp29 via [ 13s] /usr/bin/qemu-system-s390x -nodefaults -no-reboot -nographic -vga none -cpu host -enable-kvm -object rng-random,filename=/dev/random,id=rng0 -device virtio-rng-ccw,rng=rng0 -runas qemu -net none -kernel /var/cache/obs/worker/root_2/.mount/boot/kernel -initrd /var/cache/obs/worker/root_2/.mount/boot/initrd -append root=/dev/disk/by-id/virtio-0 rootfstype=ext4 rootflags=noatime ext4.allow_unsupported=1 mitigations=off panic=1 quiet no-kvmclock elevator=noop nmi_watchdog=0 rw rd.driver.pre=binfmt_misc console=hvc0 init=/.build/build -m 12288 -drive file=/var/cache/obs/worker/root_2/root,format=raw,if=none,id=disk,cache=unsafe -device virtio-blk-ccw,drive=disk,serial=0 -drive file=/var/cache/obs/worker/root_2/swap,format=raw,if=none,id=swap,cache=unsafe -device virtio-blk-ccw,drive=swap,serial=1 -device virtio-serial-ccw -device virtconsole,chardev=virtiocon0 -chardev stdio,mux=on,id=virtiocon0 -mon chardev=virtiocon0 -chardev socket,id=monitor,server,nowait,path=/var/cache/obs/worker/root_2/root.qemu/monitor -mon chardev=monitor,mode=readline -smp 6 [ 20s] ### VM INTERACTION END ### [ 20s] 2nd stage started in virtual machine [ 20s] machine type: s390x [ 20s] Linux version: 5.12.0-2-default #1 SMP Thu Apr 29 12:08:56 UTC 2021 (c4830af) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c6 --- Comment #6 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Created attachment 849275 --> http://bugzilla.opensuse.org/attachment.cgi?id=849275&action=edit OBS build log with bad page state. Let's just attach the actual log. (That's https://build.opensuse.org/public/build/devel:tools:compiler/openSUSE_Factor... compressed.) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 Aaron Puchert <aaronpuchert@alice-dsl.net> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #849275|text/plain |application/x-xz; mime type| |charset=binary -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1181479 http://bugzilla.opensuse.org/show_bug.cgi?id=1181479#c7 --- Comment #7 from Sarah Kriesch <ada.lovelace@gmx.de> --- Can you create a new bug report, please? This bug is closed. Thank you! -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com