[Bug 1206780] New: "zypper refresh" segfaults in ARM64 container
https://bugzilla.suse.com/show_bug.cgi?id=1206780 Bug ID: 1206780 Summary: "zypper refresh" segfaults in ARM64 container Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: libzypp Assignee: zypp-maintainers@suse.de Reporter: martin.wilck@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Happening on latest TW. I'm running TW/aarch64 in a container with qemu-linux-user.
podman run --arch arm64 -it --rm registry.opensuse.org/opensuse/tumbleweed:latest :/ # zypper refresh Segmentation fault (core dumped)
I don't think this is a qemu problem because it happens both on TW/x86_64 with native qemu-linux-user and on Leap 15.34 with https://github.com/multiarch/qemu-user-static, and even on an Ubuntu system in GitHub actions (https://github.com/mwilck/build-multipath/actions/runs/3820271701/jobs/65007...). I can't test this on native TW/aarch64 because I don't have a suitable test system. I can provide core dumps, but I am unsure if they'll be helpful or easily debuggable, as they are formally core dumps from qemu-aarch64.
coredumpctl list TIME PID UID GID SIG COREFILE EXE SIZE Mon 2023-01-02 10:51:06 CET 26071 17326 50 SIGSEGV present /usr/bin/qemu-aarch64 6.6M
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c1 --- Comment #1 from Martin Wilck <martin.wilck@suse.com> --- On GitHub, I am running the workflow in question every week. It can be seen that the workflow first failed a week ago (Dec 26th). The run on Dec 19th was successful. https://github.com/mwilck/build-multipath/actions/workflows/rolling.yaml The failing Dockerfile looks like this:
FROM registry.opensuse.org/opensuse/tumbleweed:latest LABEL org.opencontainers.image.title="build-multipath-tools" LABEL org.opencontainers.image.description="container for building multipath-tools on opensuse/tumbleweed" RUN zypper --quiet --non-interactive --gpg-auto-import-keys refresh && \ zypper --quiet --non-interactive --gpg-auto-import-keys --no-refresh install --no-recommends --force-resolution --allow-downgrade \ make gcc clang pkg-config gzip gawk \ device-mapper-devel libaio-devel systemd-devel libjson-c-devel liburcu-devel readline-devel libedit-devel libmount-devel libcmocka-devel VOLUME /build RUN zypper --quiet --non-interactive --gpg-auto-import-keys clean --all WORKDIR /build
It's the "zypper refresh" that fails; the "zypper install" step is never reached. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c2 --- Comment #2 from Martin Wilck <martin.wilck@suse.com> --- The issue occured between opensuse/tumbleweed:20221214 and opensuse/tumbleweed:20221219. Comparing the package lists: -cracklib-dict-small-2.9.7-3.6.aarch64 +cracklib-dict-small-2.9.8-1.1.aarch64 -openSUSE-release-appliance-docker-20221214-1435.1.aarch64 +openSUSE-release-appliance-docker-20221219-1442.1.aarch64 -libz1-1.2.12-2.1.aarch64 +libz1-1.2.12-3.1.aarch64 -libsqlite3-0-3.40.0-1.1.aarch64 +libsqlite3-0-3.40.0-2.1.aarch64 -libpcre2-8-0-10.42-1.1.aarch64 +libpcre2-8-0-10.42-2.1.aarch64 -liblzma5-5.2.8-1.1.aarch64 +liblzma5-5.2.8-2.1.aarch64 -libgcc_s1-12.2.1+git537-1.1.aarch64 +libgcc_s1-13.0.0+git197351-4.1.aarch64 -libstdc++6-12.2.1+git537-1.1.aarch64 +libstdc++6-13.0.0+git197351-4.1.aarch64 -libprotobuf-lite3_21_11-21.11-1.1.aarch64 +libprotobuf-lite3_21_12-21.12-1.1.aarch64 -libprocps8-3.3.17-10.3.aarch64 +libprocps8-3.3.17-11.1.aarch64 -procps-3.3.17-10.3.aarch64 +procps-3.3.17-11.1.aarch64 -bash-5.2.12-6.1.aarch64 +bash-5.2.15-7.1.aarch64 -bash-sh-5.2.12-6.1.noarch +bash-sh-5.2.15-7.1.noarch -xz-5.2.8-1.1.aarch64 +bash-sh-5.2.15-7.1.noarch -xz-5.2.8-1.1.aarch64 +xz-5.2.8-2.1.aarch64 -openSUSE-release-20221214-1435.1.aarch64 +openSUSE-release-20221219-1442.1.aarch64 -libopenssl1_1-1.1.1s-1.1.aarch64 +libopenssl1_1-1.1.1s-2.1.aarch64 -libopenssl1_1-hmac-1.1.1s-1.1.aarch64 +libopenssl1_1-hmac-1.1.1s-2.1.aarch64 -krb5-1.20.1-1.1.aarch64 +krb5-1.20.1-2.1.aarch64 -libzypp-17.31.6-1.3.aarch64 +libzypp-17.31.6-1.4.aarch64 libgcc_s1 and libstdc++6 are the only package updates that catch the eye. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c3 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenther@suse.com Flags| |needinfo?(rguenther@suse.co | |m) --- Comment #3 from Martin Wilck <martin.wilck@suse.com> --- Adding Richard as the gcc maintainer. This smells like a toolchain / C library issue to me. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 Guillaume GARDET <guillaume.gardet@arm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |guillaume.gardet@arm.com -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c5 --- Comment #5 from Martin Wilck <martin.wilck@suse.com> --- Not much to see in gdb. Program was killed with signal 11 (SIGSEGV). Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000070bf39 in __sigsuspend (set=set@entry=0x7ffe1e40bf88) at ../sysdeps/unix/sysv/linux/sigsuspend.c:26 26 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. [Current thread is 1 (Thread 0x1d03480 (LWP 21))] (gdb) bt #0 0x000000000070bf39 in __sigsuspend (set=set@entry=0x7ffe1e40bf88) at ../sysdeps/unix/sysv/linux/sigsuspend.c:26 #1 0x000000000062e16a in dump_core_and_abort (target_sig=target_sig@entry=11) at ../linux-user/signal.c:776 #2 0x000000000062e4e2 in handle_pending_signal (cpu_env=cpu_env@entry=0x1d2f8a0, sig=sig@entry=11, k=k@entry=0x1d76110) at ../linux-user/signal.c:1097 #3 0x00000000006301a5 in process_pending_signals (cpu_env=cpu_env@entry=0x1d2f8a0) at ../linux-user/signal.c:1172 #4 0x000000000040539a in cpu_loop (env=env@entry=0x1d2f8a0) at ../linux-user/aarch64/cpu_loop.c:186 #5 0x00000000004021e1 in main (argc=<optimized out>, argv=0x7ffe1e40cb48, envp=<optimized out>) at ../linux-user/main.c:918 (gdb) info threads Id Target Id Frame * 1 Thread 0x1d03480 (LWP 21) 0x000000000070bf39 in __sigsuspend (set=set@entry=0x7ffe1e40bf88) at ../sysdeps/unix/sysv/linux/sigsuspend.c:26 2 Thread 0x7efe31d436c0 (LWP 23) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 3 Thread 0x7efe314c16c0 (LWP 24) safe_syscall_base () at ../common-user/host/x86_64/safe-syscall.inc.S:75 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c6 --- Comment #6 from Martin Wilck <martin.wilck@suse.com> --- Trying to run qemu-aarch64 with debug options:
podman run -v /usr/bin/qemu-aarch64:/usr/bin/qemu-aarch64 -v /tmp:/tmp --arch arm64 -it --rm registry.opensuse.org/opensuse/tumbleweed:latest :/ # /usr/bin/qemu-aarch64 -d int,strace,tid -D /tmp/z%d.str /usr/bin/zypper --verbose ref
There are two alive threads when the SEGV occurs. One (123) is calling ppoll(), read(), and write(), the other one (120, main thread) is doing more work. Last lines of thread 120:
120 openat(AT_FDCWD,"/var/tmp/AP_0xzw3xCb/geoip",O_RDONLY|O_CLOEXEC) = 13 120 read(13,0x2a718b0,5) = 0 120 close(13) = 0 120 openat(AT_FDCWD,"/var/tmp/AP_0xzw3xCb/geoip",O_RDONLY|O_CLOEXEC) = 13 120 lseek(13,0,SEEK_CUR) = 0 120 newfstatat(AT_FDCWD,"/var/tmp/AP_0xzw3xCb/geoip",0x0000005502a71990,0) = 0 120 openat(AT_FDCWD,"/var/tmp/AP_0xzw3xCb/geoip",O_RDONLY|O_CLOEXEC) = 17 120 read(17,0x2a71920,5) = 0 120 close(17) = 0 120 read(13,0x3cf230,8192) = 0 120 clock_gettime(CLOCK_REALTIME_COARSE,0x0000005502a716a8) = 0 ({tv_sec = 1672739909,tv_nsec = 34004325}) 120 newfstatat(AT_FDCWD,"/etc/localtime",0x0000005502a71520,0) = 0 120 uname(0x5502a71530) = 0 120 getpid() = 120 120 sendto(11,365076633824,118,16384,0,0) = 118 120 clock_gettime(CLOCK_REALTIME_COARSE,0x0000005502a71618) = 0 ({tv_sec = 1672739909,tv_nsec = 34004325}) 120 newfstatat(AT_FDCWD,"/etc/localtime",0x0000005502a71490,0) = 0 120 uname(0x5502a714a0) = 0 120 getpid() = 120 120 sendto(11,365075816992,125,16384,0,0) = 125 120 futex(0x00000055035500a4,FUTEX_PRIVATE_FLAG|FUTEX_WAKE,2147483647,NULL,NULL,0) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=1, si_addr=NULL} --- 120 clock_gettime(CLOCK_REALTIME_COARSE,0x0000005502a6f748) = 0 ({tv_sec = 1672739909,tv_nsec = 38004368}) 120 newfstatat(AT_FDCWD,"/etc/localtime",0x0000005502a6f5c0,0) = 0 120 uname(0x5502a6f5d0) = 0 120 getpid() = 120 120 sendto(11,365075366992,102,16384,0,0) = 102 --- SIGSEGV {si_signo=SIGSEGV, si_code=1, si_addr=NULL} ---
Last lines of thread 123:
120 ppoll(365206462224,4,0,0,0,0) = 1 120 write(7,0x56fe1c0,8) = 8 120 ioctl(12,FIONREAD,0x00000055056fe004) = 0 (118) 120 read(12,0x8006800,118) = 118 120 write(10,0x2d77d0,118) = 118 120 write(7,0x56fe1c0,8) = 8 120 ppoll(365206462224,4,0,0,0,0) = 1 120 read(7,0x56fe1d8,16) = 8 120 ppoll(365206462224,4,0,0,0,0) = 1 120 write(7,0x56fe1c0,8) = 8 120 ioctl(12,FIONREAD,0x00000055056fe004) = 0 (125) 120 read(12,0x8006800,125) = 125 120 write(10,0x2d77d0,125) = 125 120 write(7,0x56fe1c0,8) = 8 120 ppoll(365206462224,4,0,0,0,0) = 1 120 read(7,0x56fe1d8,16) = 8 120 ppoll(365206462224,4,0,0,0,0) = 1 120 write(7,0x56fe1c0,8) = 8 120 ioctl(12,FIONREAD,0x00000055056fe004) = 0 (102) 120 read(12,0x8006800,102) = 102 120 write(10,0x2d77d0,102) = 102 120 write(7,0x56fe1c0,8) = 8 120 ppoll(365206462224,4,0,0,0,0) = 1 120 read(7,0x56fe1d8,16) = 8 120 ppoll(365206462224,4,0,0,0,0)
The main thread opens a geoip-related file and sends several pieces of data to the other thread, which receives them (118, 125, and 102 bytes, respectively), forwards them to another fd (10) and triggers an eventfd (fd 7). The main thread receives SIGSEGV twice (??). The first time while the main thread has called futex() and the other thread ppoll(). The futex() call is probably the most likely candidate for causing the issue. Loooking at the futex() calls made by the main thread, it appears that the first argument (futex address) is always distinct. Not sure why these futex calls happen at all. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c7 --- Comment #7 from Martin Wilck <martin.wilck@suse.com> --- Created attachment 863788 --> https://bugzilla.suse.com/attachment.cgi?id=863788&action=edit "strace" output of qemu-aarch64 full output from previous comment. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c8 Benjamin Zeller <bzeller@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bzeller@suse.com --- Comment #8 from Benjamin Zeller <bzeller@suse.com> --- (In reply to Martin Wilck from comment #6)
Trying to run qemu-aarch64 with debug options:
podman run -v /usr/bin/qemu-aarch64:/usr/bin/qemu-aarch64 -v /tmp:/tmp --arch arm64 -it --rm registry.opensuse.org/opensuse/tumbleweed:latest :/ # /usr/bin/qemu-aarch64 -d int,strace,tid -D /tmp/z%d.str /usr/bin/zypper --verbose ref
Thanks I will try that locally too!
There are two alive threads when the SEGV occurs. One (123) is calling ppoll(), read(), and write(), the other one (120, main thread) is doing more work.
Yes, one of the threads that does the poll is the log thread, waiting for new log lines on a unix domain socket where the main ( and possibly other threads/subprocesses ) can write into.
The main thread opens a geoip-related file and sends several pieces of data to the other thread, which receives them (118, 125, and 102 bytes, respectively), forwards them to another fd (10) and triggers an eventfd (fd 7).
The segfault is not related to the geoip file, even running
zypper --version will show a segfault
The main thread receives SIGSEGV twice (??). The first time while the main thread has called futex() and the other thread ppoll().
I was trying to figure out where the futex() call comes from, judging from the log file of the zypper --version call I currently suspect it is triggered by throwing a exception, since the C++ throw (at least on x86) uses a futex call and it crashes right after that.
The futex() call is probably the most likely candidate for causing the issue. Loooking at the futex() calls made by the main thread, it appears that the first argument (futex address) is always distinct. Not sure why these futex calls happen at all.
I'm not sure this is issue, as previously mentioned throw() uses a futex, so every time zypp throws or rethrows a exception you probably will see one of those. Also there are other stdlib functions / syscalls that use it, setlocale for example. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c9 --- Comment #9 from Martin Wilck <martin.wilck@suse.com> --- The problem doesn't occur if I set any specific CPU other than "max".
:/ # /usr/bin/qemu-aarch64 -cpu cortex-a76 -d unimp /usr/bin/zypper --version Unsupported syscall: 293 zypper 1.14.58 :/ # /usr/bin/qemu-aarch64 -cpu max -d unimp /usr/bin/zypper --version Unsupported syscall: 293 zypper 1.14.58 Segmentation fault (core dumped)
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c10 --- Comment #10 from Guillaume GARDET <guillaume.gardet@arm.com> --- (In reply to Martin Wilck from comment #9)
The problem doesn't occur if I set any specific CPU other than "max".
What host machine is it? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c11 --- Comment #11 from Benjamin Zeller <bzeller@suse.com> --- (In reply to Guillaume GARDET from comment #10)
(In reply to Martin Wilck from comment #9)
The problem doesn't occur if I set any specific CPU other than "max".
What host machine is it?
For me it happens on a recently updated Tumbleweed install:
Linux linux-2ggq 6.1.1-1-default #1 SMP PREEMPT_DYNAMIC Thu Dec 22 15:37:40 UTC 2022 (e71748d) x86_64 x86_64 x86_64 GNU/Linux
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c12 --- Comment #12 from Guillaume GARDET <guillaume.gardet@arm.com> --- (In reply to Benjamin Zeller from comment #11)
(In reply to Guillaume GARDET from comment #10)
(In reply to Martin Wilck from comment #9)
The problem doesn't occur if I set any specific CPU other than "max".
What host machine is it?
For me it happens on a recently updated Tumbleweed install:
Linux linux-2ggq 6.1.1-1-default #1 SMP PREEMPT_DYNAMIC Thu Dec 22 15:37:40 UTC 2022 (e71748d) x86_64 x86_64 x86_64 GNU/Linux
So, fully emulated (non-KVM) machine. So, it runs PAC+BTI instructions (-mbranch-protection=standard). Could you try to disable Pointer Auth and/or BTI with arm64.nopauth and arm64.nobti to check if it improves things? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c13 --- Comment #13 from Benjamin Zeller <bzeller@suse.com> ---
So, fully emulated (non-KVM) machine. So, it runs PAC+BTI instructions (-mbranch-protection=standard). Could you try to disable Pointer Auth and/or BTI with arm64.nopauth and arm64.nobti to check if it improves things?
Is this a compile time setting or something that I can set on the VM? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c14 --- Comment #14 from Guillaume GARDET <guillaume.gardet@arm.com> --- (In reply to Benjamin Zeller from comment #13)
So, fully emulated (non-KVM) machine. So, it runs PAC+BTI instructions (-mbranch-protection=standard). Could you try to disable Pointer Auth and/or BTI with arm64.nopauth and arm64.nobti to check if it improves things?
Is this a compile time setting or something that I can set on the VM?
This is a kernel parameter [0] So, you can set it on grub, at the end of the 'linux' line. [0]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Docume... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c15 --- Comment #15 from Benjamin Zeller <bzeller@suse.com> ---
This is a kernel parameter [0]
So, you can set it on grub, at the end of the 'linux' line.
[0]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/ Documentation/admin-guide/kernel-parameters.txt
Ah but this happens inside a qemu-user-static execution. Its not a full VM with its own kernel. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c16 --- Comment #16 from Martin Wilck <martin.wilck@suse.com> --- I've taken a lengthy look at the output of /usr/bin/qemu-aarch64 -cpu max -d in_asm,strace /usr/bin/zypper --version after building qemu-aarch64 with capstone (disassembly) support. The last thing zypper does is apparently a throw() from a function called _ZN6Zypper12shellCleanupEv@@Base(). From there, the code enters __cxa_throw() (libstdc++6.so) _Unwind_RaiseException (libgcc_s1.so) After that, the code moves around in a code segment that (according to objdump) belongs to _Unwind_GetTextRelBase(). At some point I see a call to pthread_once() __enable_execute_stack() At this point the last system call of zypper is observed, futex(0x00000055035500a4,FUTEX_PRIVATE_FLAG|FUTEX_WAKE,2147483647,NULL,NULL,0) = 0 apparently as part of pthread_once(). After that, the code jumps around in libgcc_s1.so after _Unwind_GetTextRelBase(), again. Finally I see a call chain _Unwind_GetTextRelBase _Unwind_find_FDE __GI__dl_find_object The SIGSEGV then happens in _Unwind_GetTextRelBase, probably because _dl_find_object() returned an invalid object: 0x550352e300: f9418ea0 ldr x0, [x21, #0x318] 0x550352e304: 52822d01 movz w1, #0x1168_dl_find_object 0x550352e308: f9418aa5 ldr x5, [x21, #0x310] 0x550352e30c: 72ba5001 movk w1, #0xd280, lsl #16 0x550352e310: b9400002 ldr w2, [x0] <==== BAD VALUE IN x0 0x550352e314: 6b01005f cmp w2, w1 0x550352e318: 54000b21 b.ne #0x550352e47c --- SIGSEGV {si_signo=SIGSEGV, si_code=1, si_addr=NULL} --- The bad x0 value had been found by Fabiano Rosas with gdb. So something is going wrong in the unwind code that follows the throw(). But the details are still shady, and it is unclear how this relates to the pauth feature of qemu-aarch64. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c17 --- Comment #17 from Guillaume GARDET <guillaume.gardet@arm.com> --- It looks like a duplicate of https://bugzilla.suse.com/show_bug.cgi?id=1206684 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1206780 https://bugzilla.suse.com/show_bug.cgi?id=1206780#c18 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #18 from Martin Wilck <martin.wilck@suse.com> --- Right. *** This bug has been marked as a duplicate of bug 1206684 *** -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com