[Bug 918158] New: error 4 segfault in systemd and systemd-udev
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Bug ID: 918158 Summary: error 4 segfault in systemd and systemd-udev Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: x86-64 OS: openSUSE 13.2 Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: ivan.topolsky@isb-sib.ch QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0 Build Identifier: systemd-udevd[772]: segfault at 18 ip 00007fa5e0700a55 sp 00007fffc57d9f70 error 4 in libc-2.19.so[7fa5e0688000+19e000] followed a little bit later by: kernel:[73256.715384] systemd[1]: segfault at 7ce00000 ip 0000000000460384 sp 00007fff16f78d70 error 4 in systemd[400000+114000] broadcast on all terminals. Reproducible: Sometimes Steps to Reproduce: (0. not sure about this step: the first crash in systemd-udev might happen after resume from "Suspend-to-RAM") 1. "zypper update" 2. zypper updates a service (cups in my last occurence) 3. zypper restarts a service Actual Results: segfault broadcast on all terminal. systemd dies. systemctl is unresponsive. SysRq magic key is the only way to (more or less) cleanly restart the system Expected Results: systemd should NOT die. (Or at least, there should be a clean way to reload it) started following a recent upgrade of the system from the official opensuse update repository. might be related to glib2 problems that plague my KDE desktop. Similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1071128 https://bbs.archlinux.org/viewtopic.php?id=177801 Related to this (similar results, different component of systemd) : https://bugzilla.opensuse.org/show_bug.cgi?id=905369 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Martin Pluskal <mpluskal@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mpluskal@suse.com Assignee|bnc-team-screening@forge.pr |systemd-maintainers@suse.de |ovo.novell.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Martin Schröder <martin@oneiros.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |martin@oneiros.de --- Comment #11 from Martin Schröder <martin@oneiros.de> --- The new systemd also segfaults on 13.1: See bug 918226 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Stefan Botter <jsj@jsj.dyndns.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jsj@jsj.dyndns.org -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #13 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Hello, Sorry, I was busy at work. I had newer occurence of the same bug. (This time no resume from suspend, no systemd-udev crash. Only "segfault at e0000000 ip 0000000000460384 sp 00007ffff5624a20 error 4 in systemd[400000+114000]") I'm still running the stock systemd, straigh from the official "OpenSuse 13.2 Updates" repository. I hadn't had time to try the "Base:system:legacy" repo yet. I've managed to catch the core dump this time. (I'll try attaching it here). Regarding the rest: - journalctl : no other message to report on both occasions. - hardware: The hardware has not been changed recently and the system used to work until before the update. - BIOS: no changes in BIOS settings. Recent switch from Classic BIOS to EFI booting, but at that time problems didn't appear. Only 2 weeks later once after upgrading - no filesystem corruption after checking for modification from RPM databse but: - embed controller: no idea, the laptop's EC might be acting strange. I won't be able to detect if its sending badly formatted data over ACPI. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #14 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 624300 --> http://bugzilla.opensuse.org/attachment.cgi?id=624300&action=edit systemd core dump core dump from systemd 210-25.12.1 (the stock one from the official "OpenSuse 13.2 Update" repository) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #15 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> ---
- hardware: The hardware has not been changed recently and the system used to work until before the update.
Forgot to say: - memtest86+ ran without problems for several cycles (once the laptop is booted into Classic BIOS modes, for obvious reason). RAM corruption is thus very unlikely
- no filesystem corruption after checking for modification from RPM databse
Cron jobs also run periodic SMART tests (nightly shorts, weekly long) and btrfs scrubs (weekly, too). Data /filesystem corrupt is very unlikely to go unnoticed, even prior to the RPM database compare. Appart from aging component on the Mobo or on the power supply, I can't manage to see where the problem could coming from. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #21 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 625063 --> http://bugzilla.opensuse.org/attachment.cgi?id=625063&action=edit systemd core dump (xz-compressed) New coredump of systemd (this time xz-compressed because of the 10MiB limitations on bugzilla). I did upgrade system to the 210-25.15.1 version found in the tsaupe home repo. It did crash during one update. Given the sequence of message in the terminal, it probably crashed during the installation of VirtualBox. I would imaging: restarting which ever unit is responsible of the host drivers (the modern systemd equivalent of /etc/init.d/vboxdrv ). I haven't installed the debugs symbols, but according to gdb's backtrace it crash on a "free" call. Double free? Free of corrupt buffer ? Seems problematic. (Even when exposed to corrupt data, systemd should handle it gracefully and report an error. Not barf and crash on it) Today, I noticed a new 210-25.15.2 version was available. We'll see how this one survives. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #23 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 625769 --> http://bugzilla.opensuse.org/attachment.cgi?id=625769&action=edit systemd core dump (xz-compressed) crash on systemd (210-25.15.2) this time, I managed to manuually trigger it: I noticed I still had consolekit lying around. While stoping it, systemd crashed. (Some conflicts with the systemd-logind service that is supposed to handle the same tasks and was running normally at the time ?) Anyway, I'm going to remove the useless service. I just find it strange that systemd segfaults, even given bogus input from wrong services. I would expect it to at least to some sanity checks on its input and handle corrupted input gracefully (error message, eventually restart. no crash)
Hm, the stack seems corrupted:
That's strange, I've always been able to run the backtrace to the end. I don't have the debug symbols installed. So I have less information, but my stack doesn't seem corrupted. Just travelling several symbols which are not publicly exported and named. Here's the current one:
#0 0x00007f4d9333575b in raise () from /lib64/libpthread.so.0 #1 0x000000000040d047 in ?? () #2 <signal handler called> #3 0x000000000049c050 in ?? () #4 0x000000000049c5e0 in ?? () #5 0x000000000040f5d8 in ?? () #6 0x0000000000413367 in ?? () #7 0x000000000040bea6 in ?? () #8 0x00007f4d92fa0b05 in __libc_start_main () from /lib64/libc.so.6 #9 0x000000000040c5cc in ?? ()
I'm pretty sure that I'm running the stock version of most libraries. I'll be looking if there are more unnecessary services laying around. And if removing them makes the problems go away. Strange that ConsoleKit didn't get removed during the past updates. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #24 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 626122 --> http://bugzilla.opensuse.org/attachment.cgi?id=626122&action=edit systemd core dump (xz-compressed) Again crash. As usual: during a system update (to be precies during VirtualBox's update, at the moment the corresponding unit is reloaded). It's still the 210-25.15.2 version from tsauppe repository A different unit is the one causing the crash, this time the Swap device. I've double checked: - no file installed by the RPMs is reported to fail checksum (beside a few unrelated configuration that I might have edited) - there are no funny package (e.g.: left-over from before the upgrades, like consolekit) when comparing to a freshly installed machine - nearly all the files in /usr/bin ../sbin ../lib ../lib64 come from RPMs (I very well know the few files not accounted for) The system should be clean. I really doubt that there's something else causing the crash. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #27 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Model name: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz Stepping: 5 CPU MHz: 1199.000 CPU max MHz: 2400.0000 CPU min MHz: 1199.0000 BogoMIPS: 4788.50 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #29 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 629235 --> http://bugzilla.opensuse.org/attachment.cgi?id=629235&action=edit systemd core dump (xz-compressed)
Also I'd like to know if this also happens with systemd-210 from Base:System:Legacy as there are a lot of upstram bug fixes included.
Things were stable for quite some time including going through a few updates flawlessly, so seemingly my problems have been solved. But then, systemd crashed again, and this time simply as I was hitting "tab" in bash to autocomplete the name of a service to be restarted (dnsmasq.service, I think).
Do you have some optimization enabled in the CMOS/BIOS setup that is timing of RAM and/or cache lines? Which kind of RAM this is with or without ECC?
It's a Dell Lattitude E6510 laptop running 2x4GB non-ECC RAM from Crucial. As with most professionnal "enterprise"-y laptops, there aren't much RAM configuration in the BIOS. The only settings I've changed was turning on EFI boot a few months back when I've moved my system to a SSD. I have been running memtest86+ scan for several hours without any reported problem. Also, the only affected software are: - systemd, which in a few rare occasions segfaults when receiving resquest on the DBus (usually requests to restart services, but the other day's was simply answering to an autocomplete bash plugin). - a few KDE application, which could segfault when quitting, or a few of them when waking from suspend-to-ram (either a problem while freeing a memory buffer, or a pointer problem when handling an event received from glib). These are the only problems I've noticed. None of the remaining software I run has any problems (I can edit images with GIMP, compile software with GCC, the background services are running fine, etc. all without crash). I've had RAM problems in another machine in the past, and the behaviour is completely different (after a while, random crashes happens, affecting nearly any program). This isn't the case on the laptop. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #31 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 629804 --> http://bugzilla.opensuse.org/attachment.cgi?id=629804&action=edit systemd core dump (xz-compressed) Crash again. Happened during upgrade. clamav was the service being restarted. I didn't notice if this happened before or after the upgrade to 210-12.1
The userdata should have been retrieved in function property_get_all_callbacks_run in src/libsystemd/sd-bus/bus-objects.c But they aren't We should check for userdata before we try to append a property.
Hey, at least I can serve as a convoluted strategy for fudge testing ! For the betterment of error handling and weird data handling of systemd, for the benefits of all users !
Do you have some optimization enabled in the CMOS/BIOS setup that is timing of RAM and/or cache lines? Which kind of RAM this is with or without ECC?
Just to be sur that RAM isn't the problem, I ran memtest86+ overnight. No problem found. Because memtest86+ is old BIOS *only*, and I'm currently running EFI, the next reboot was over classic MBR (just in case there a subtle initialisation bug that appears only with EFI). The crash still happened anyway. At least I think we can definitely rule out RAM-based problems. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #34 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 632025 --> http://bugzilla.opensuse.org/attachment.cgi?id=632025&action=edit systemd core dump (xz-compressed) Well, got "lucky": did manage to gather a core dump, not too long after moving to your version of systemd (without all the "symbol otpmized out") These kind of weird mis-behaviour are still exclusively happenning after suspend-to-ram/wake up cycles. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #37 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 633311 --> http://bugzilla.opensuse.org/attachment.cgi?id=633311&action=edit systemd core dump (xz-compressed) Next crash (while updating with zypper, at the moment when a service needed to be restarted). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #38 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 633318 --> http://bugzilla.opensuse.org/attachment.cgi?id=633318&action=edit full log since reboot (xz-compressed) Full system log since last boot associated with crash from dump #633311 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Ivan Topolsky <ivan.topolsky@isb-sib.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #633318|0 |1 is obsolete| | --- Comment #39 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 633326 --> http://bugzilla.opensuse.org/attachment.cgi?id=633326&action=edit rsyslog (xz-compresesd) I just realized that, as systemd has crashed, journald would be missing some parts. (systemd's segfault isn't in there). Here's rsyslog's output in /var/log/messages of the same time frame. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #43 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> ---
RAM corruption is thus very unlikely It becomes more and more likely ...
I still don't buy dammage RAM sticks, as I've tried again memtest86+ for extended periods of time and still no reported problem. I'll side with Thomas on that one.
The question is: do you have any optimization around, that is e.g. preloaded libraries (echo $LD_LIBRARY_PATH $LD_PRELOAD) for the used CPU/chipset and/or selfcompiled and/or third party kernel,
No, the installation is stock. I use my laptop in production and thus haven't optimized much.
third party modules
*That* is what I've been thinking too. Two third party modules : - Nvidia closed source kernel driver. And unlike their good reputation on laptops, Nvidia binary on laptops are usually very problematic. (All the laptops I've been through display random texture corruption, problems text-mode console as they don't do KMS, etc. it's far from being specific to that laptop. Seems more likely that Nvidia concentrates their effort at supporting Desktop/Workstation on Linux and neglect laptops, see the debacle around Optimus) As Nvidia has moved to G04 and excluded my chip from that set, I've switched to opensource Nouveau driver. All the segfaults that happen here and there happen AFTER this switch. (Although I don't have texture corruption anymore. The only problem I have is that the very specific kernel version that opensuse uses, the 3.16 has a bug that prevents nouveau from completely shutting down the backlight when putting the screen in DPMS OFF) - WL restricted source driver for the broadcom WiFi b/g/N mini-PCIe card built-in by Dell. That's what opensuse installed by default back when I acquired this laptop. Back then I tried experimenting with alternative, but b43 was reverse engineered and was missing a lot of feature, and the brdcmsmac opensource driver was too new, and had speed problems (missing 5Ghz, speed droping to 10mbits on 2.5Ghz N). I've noticed in the logs that the wl driver makes the kernel spit messages. As nowadays brcmsmac seems to be in much better shape (only power saving, AP mode and some special channel mode are missing- N is fully functionning, including at 5Ghz), I'll try switching and see if the machine becomes stabler. That's also the main difference between this problematic laptop (has BCM wireless) and the older laptops (external Wifi PC-CARD that I plug when I actually need it) and other destkop/Workstations I use (no Wifi).
Also which modules are unloaded before hibernate/suspend and reload afterwards if any ?
I there a cannonical way to list them ? I would suspect the USB drivers. Currently, the bluetooth (BCM5880) is wired over USB in the laptop, and I have a USB-using microSD reader mini card. They use stock kernel drivers. Also, last but not least: ACPI is of course also 3rd party (byte) code running here. Not sure if I can do anything about that one (BIOS/Firmware from Dell is up-to-date). I'll see what I can do. If the segfaults disappear, It might have been the BCM restricted source driver sliently corrupting the rest. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #44 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- Created attachment 633507 --> http://bugzilla.opensuse.org/attachment.cgi?id=633507&action=edit core and logs No luck. :-( even with Nouveau and brcmsmac (thus no 3rd party modules). still segfault in systemd. I'm at a loss, I don't know what to do. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 --- Comment #46 from Ivan Topolsky <ivan.topolsky@isb-sib.ch> --- @Thomas: Yesterday's update from your repository produces "Invalid Hash Bucket Pointer" upon attempts to log in. Reversing to stock opensuse systemd fixes the problem. @Werner: Any further pointers for that direction: - lists with known good/bad drivers ? - recommendation what to put in my /etc/pm/sleep.d ? -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com