[Bug 1190740] New: Race condition when linking to LLVM on s390x
https://bugzilla.suse.com/show_bug.cgi?id=1190740 Bug ID: 1190740 Summary: Race condition when linking to LLVM on s390x Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: S/390-64 URL: https://openqa.opensuse.org/tests/1734657/modules/post gresql_server/steps/6 OS: Other Status: CONFIRMED Severity: Normal Priority: P5 - None Component: Development Assignee: matz@suse.com Reporter: max@suse.com QA Contact: qa-bugs@suse.de CC: aaronpuchert@alice-dsl.net, ada.lovelace@gmx.de, azouhr@opensuse.org, ihno@suse.com, martin.liska@suse.com, matz@suse.com, max@suse.com, okurz@suse.com Depends on: 1185952 Found By: openQA Blocker: Yes +++ This bug was initially created as a clone of Bug #1185952 +++ When being built with LLVM support on s390x, PostgreSQL 11, 12 and 13 most of the time, but not always fails with the following error: /usr/lib64/gcc/s390x-suse-linux/11/../../../../s390x-suse-linux/bin/ld: @GLIBCXX_3.4.11: TLS reference in /usr/lib64/libLLVM.so mismatches non-TLS reference in /usr/lib64/libLLVM.so /usr/lib64/gcc/s390x-suse-linux/11/../../../../s390x-suse-linux/bin/ld: /usr/lib64/libLLVM.so: error adding symbols: bad value collect2: error: ld returned 1 exit status make[2]: *** [../../../../src/Makefile.shlib:309: llvmjit.so] Error 1 We suspect a race condition that might be related to link time optimisation. So far it was not reproducible in local builds, only in OBS workers. For the details gathered so far, see bug 1185952 comment 39 and following. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190740
Reinhard Max
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c1
--- Comment #1 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c2
--- Comment #2 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c3
--- Comment #3 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c4
--- Comment #4 from Reinhard Max
Is this bug report reproducible with SLES?
I don't think I've seen it happen there so far, but AFAICS even the upcoming SLE15-SP4 is still on llvm11, so the situation is not comparable. Maybe we should check if it also happens with llvm11 on openSUSE, or only with llvm12. One thing I noticed is that the VMs for s390x in OBS seem to be a lot slower than the ones that build SLE, and maybe also compared to the ones that Michael was using in his manual tests where he couldn't reproduce it. So, maybe the race is more likely to get triggered on less powerful machines. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190740
Ihno Krumreich
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c5
--- Comment #5 from Richard Biener
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c7
--- Comment #7 from Michael Matz
There's not really anything "racy" in how GCC LTO is set up - the actual
That's correct. As usual we use the term 'race' for anything unexplainable and unreproducable, where data doesn't match anything reasonable. We could call it faulty memory, or moon phase, or bad luck, or maybe undefined behaviour :-)
error message from ld is also not exactly helpful, referencing libLLVM.so twice:
Actually it's helpful a bit in showing that something strange is going on:
[ 994s] /usr/lib64/gcc/s390x-suse-linux/11/../../../../s390x-suse-linux/bin/ld: @GLIBCXX_3.4.11: TLS reference in /usr/lib64/libLLVM.so mismatches non-TLS reference in /usr/lib64/libLLVM.so
Here, the symbol for which there are TLS and non-TLS references is mentioned, it's '@GLIBCXX_3.4.11'. That is of course not any sensible symbol, it's an empty symbol name with a symbol version, or the symbol version itself interpreted as symbol name. My guess is that the linker plugin does something strange (aka undefined), either caused by a problem in llvms use of it, or in ld. It manipulates the symbol tables and the above definitely is something strange with the symbol table.
[ 994s] /usr/lib64/gcc/s390x-suse-linux/11/../../../../s390x-suse-linux/bin/ld: /usr/lib64/libLLVM.so: error adding symbols: bad value
But given that /usr/lib64/libLLVM.so is not built when postgresql is the error should be visible there. There's exactly two TLS references in libLLVM.so.12
23: 0000000000000000 0 TLS GLOBAL DEFAULT UND _ZSt11__once_call@GLIBCXX_3.4.11 (16) 29: 0000000000000000 0 TLS GLOBAL DEFAULT UND _ZSt15__once_callable@GLIBCXX_3.4.11 (16)
No no, if it were anything you can see directly in the input files, then the error would be reproducable. Of course the inputs are completely fine.
to simplify things I would also try to remove the -Wl,--as-needed and try to filter out -ffat-lto-objects from the parts that don't need it.
OTOH the error is from _bfd_elf_merge_symbol but calling that with two same bfds is a bit odd.
All the more reason to believe in some error of the linker plugin. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c8
--- Comment #8 from Richard Biener
(In reply to Richard Biener from comment #5)
OTOH the error is from _bfd_elf_merge_symbol but calling that with two same bfds is a bit odd.
All the more reason to believe in some error of the linker plugin.
Yes, I suspect some memory corruption/overflow issue is at play. It does seem that 'llvm' merely acts as source code providing the linker inputs and the compile is entirely under GCCs control. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c12
--- Comment #12 from openQA Review
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c13
--- Comment #13 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c14
--- Comment #14 from Reinhard Max
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c15
--- Comment #15 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c16
--- Comment #16 from Reinhard Max
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c17
--- Comment #17 from Reinhard Max
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c18
--- Comment #18 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c19
--- Comment #19 from Reinhard Max
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c23
--- Comment #23 from openQA Review
https://bugzilla.suse.com/show_bug.cgi?id=1190740
Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c25
--- Comment #25 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c27
--- Comment #27 from Michael Matz
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c31
--- Comment #31 from Sarah Kriesch
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c32
Fabian Vogt
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c39
--- Comment #39 from Fabian Vogt
But shouldn't we see stuff like OOM kills if there was a RAM shortage?
Yes. (Not 100% true, but close enough in this case).
Could it be a build parallelity issue? How about trying a "make -j1" build to see if that makes results more stable ��� either all succeeding or all failing? That would of course also reduce the memory footprint compared to parallel builds.
In my tests, it worked fine when building the llvmjit backend first without parallelism and then the rest. Though the randomness of the failure makes all tests questionable to a certain extent. make -C src/backend/jit/llvm PACKAGE_TARNAME=%pgname make %{?_smp_mflags} PACKAGE_TARNAME=%pgname What points against (make) parallelism as the cause of this though is that the objects being linking are identical in working and failed builds and a retry after a failed link also fails. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c42
--- Comment #42 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c43
--- Comment #43 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c44
--- Comment #44 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c45
--- Comment #45 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c46
--- Comment #46 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c47
--- Comment #47 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c48
--- Comment #48 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c49
--- Comment #49 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c51
--- Comment #51 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c52
--- Comment #52 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c59
--- Comment #59 from Swamp Workflow Management
https://bugzilla.suse.com/show_bug.cgi?id=1190740
https://bugzilla.suse.com/show_bug.cgi?id=1190740#c60
--- Comment #60 from Swamp Workflow Management
participants (1)
-
bugzilla_noreply@suse.com