[Bug 918158] New: error 4 segfault in systemd and systemd-udev
http://bugzilla.opensuse.org/show_bug.cgi?id=918158 Bug ID: 918158 Summary: error 4 segfault in systemd and systemd-udev Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: x86-64 OS: openSUSE 13.2 Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: ivan.topolsky@isb-sib.ch QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0 Build Identifier: systemd-udevd[772]: segfault at 18 ip 00007fa5e0700a55 sp 00007fffc57d9f70 error 4 in libc-2.19.so[7fa5e0688000+19e000] followed a little bit later by: kernel:[73256.715384] systemd[1]: segfault at 7ce00000 ip 0000000000460384 sp 00007fff16f78d70 error 4 in systemd[400000+114000] broadcast on all terminals. Reproducible: Sometimes Steps to Reproduce: (0. not sure about this step: the first crash in systemd-udev might happen after resume from "Suspend-to-RAM") 1. "zypper update" 2. zypper updates a service (cups in my last occurence) 3. zypper restarts a service Actual Results: segfault broadcast on all terminal. systemd dies. systemctl is unresponsive. SysRq magic key is the only way to (more or less) cleanly restart the system Expected Results: systemd should NOT die. (Or at least, there should be a clean way to reload it) started following a recent upgrade of the system from the official opensuse update repository. might be related to glib2 problems that plague my KDE desktop. Similar to this: https://bugzilla.redhat.com/show_bug.cgi?id=1071128 https://bbs.archlinux.org/viewtopic.php?id=177801 Related to this (similar results, different component of systemd) : https://bugzilla.opensuse.org/show_bug.cgi?id=905369 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
Martin Pluskal
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
Martin Schröder
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
Stefan Botter
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #13 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #14 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #15 from Ivan Topolsky
- hardware: The hardware has not been changed recently and the system used to work until before the update.
Forgot to say: - memtest86+ ran without problems for several cycles (once the laptop is booted into Classic BIOS modes, for obvious reason). RAM corruption is thus very unlikely
- no filesystem corruption after checking for modification from RPM databse
Cron jobs also run periodic SMART tests (nightly shorts, weekly long) and btrfs scrubs (weekly, too). Data /filesystem corrupt is very unlikely to go unnoticed, even prior to the RPM database compare. Appart from aging component on the Mobo or on the power supply, I can't manage to see where the problem could coming from. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #21 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #23 from Ivan Topolsky
Hm, the stack seems corrupted:
That's strange, I've always been able to run the backtrace to the end. I don't have the debug symbols installed. So I have less information, but my stack doesn't seem corrupted. Just travelling several symbols which are not publicly exported and named. Here's the current one:
#0 0x00007f4d9333575b in raise () from /lib64/libpthread.so.0 #1 0x000000000040d047 in ?? () #2 <signal handler called> #3 0x000000000049c050 in ?? () #4 0x000000000049c5e0 in ?? () #5 0x000000000040f5d8 in ?? () #6 0x0000000000413367 in ?? () #7 0x000000000040bea6 in ?? () #8 0x00007f4d92fa0b05 in __libc_start_main () from /lib64/libc.so.6 #9 0x000000000040c5cc in ?? ()
I'm pretty sure that I'm running the stock version of most libraries. I'll be looking if there are more unnecessary services laying around. And if removing them makes the problems go away. Strange that ConsoleKit didn't get removed during the past updates. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #24 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #27 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #29 from Ivan Topolsky
Also I'd like to know if this also happens with systemd-210 from Base:System:Legacy as there are a lot of upstram bug fixes included.
Things were stable for quite some time including going through a few updates flawlessly, so seemingly my problems have been solved. But then, systemd crashed again, and this time simply as I was hitting "tab" in bash to autocomplete the name of a service to be restarted (dnsmasq.service, I think).
Do you have some optimization enabled in the CMOS/BIOS setup that is timing of RAM and/or cache lines? Which kind of RAM this is with or without ECC?
It's a Dell Lattitude E6510 laptop running 2x4GB non-ECC RAM from Crucial. As with most professionnal "enterprise"-y laptops, there aren't much RAM configuration in the BIOS. The only settings I've changed was turning on EFI boot a few months back when I've moved my system to a SSD. I have been running memtest86+ scan for several hours without any reported problem. Also, the only affected software are: - systemd, which in a few rare occasions segfaults when receiving resquest on the DBus (usually requests to restart services, but the other day's was simply answering to an autocomplete bash plugin). - a few KDE application, which could segfault when quitting, or a few of them when waking from suspend-to-ram (either a problem while freeing a memory buffer, or a pointer problem when handling an event received from glib). These are the only problems I've noticed. None of the remaining software I run has any problems (I can edit images with GIMP, compile software with GCC, the background services are running fine, etc. all without crash). I've had RAM problems in another machine in the past, and the behaviour is completely different (after a while, random crashes happens, affecting nearly any program). This isn't the case on the laptop. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #31 from Ivan Topolsky
The userdata should have been retrieved in function property_get_all_callbacks_run in src/libsystemd/sd-bus/bus-objects.c But they aren't We should check for userdata before we try to append a property.
Hey, at least I can serve as a convoluted strategy for fudge testing ! For the betterment of error handling and weird data handling of systemd, for the benefits of all users !
Do you have some optimization enabled in the CMOS/BIOS setup that is timing of RAM and/or cache lines? Which kind of RAM this is with or without ECC?
Just to be sur that RAM isn't the problem, I ran memtest86+ overnight. No problem found. Because memtest86+ is old BIOS *only*, and I'm currently running EFI, the next reboot was over classic MBR (just in case there a subtle initialisation bug that appears only with EFI). The crash still happened anyway. At least I think we can definitely rule out RAM-based problems. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #34 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #37 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #38 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #43 from Ivan Topolsky
RAM corruption is thus very unlikely It becomes more and more likely ...
I still don't buy dammage RAM sticks, as I've tried again memtest86+ for extended periods of time and still no reported problem. I'll side with Thomas on that one.
The question is: do you have any optimization around, that is e.g. preloaded libraries (echo $LD_LIBRARY_PATH $LD_PRELOAD) for the used CPU/chipset and/or selfcompiled and/or third party kernel,
No, the installation is stock. I use my laptop in production and thus haven't optimized much.
third party modules
*That* is what I've been thinking too. Two third party modules : - Nvidia closed source kernel driver. And unlike their good reputation on laptops, Nvidia binary on laptops are usually very problematic. (All the laptops I've been through display random texture corruption, problems text-mode console as they don't do KMS, etc. it's far from being specific to that laptop. Seems more likely that Nvidia concentrates their effort at supporting Desktop/Workstation on Linux and neglect laptops, see the debacle around Optimus) As Nvidia has moved to G04 and excluded my chip from that set, I've switched to opensource Nouveau driver. All the segfaults that happen here and there happen AFTER this switch. (Although I don't have texture corruption anymore. The only problem I have is that the very specific kernel version that opensuse uses, the 3.16 has a bug that prevents nouveau from completely shutting down the backlight when putting the screen in DPMS OFF) - WL restricted source driver for the broadcom WiFi b/g/N mini-PCIe card built-in by Dell. That's what opensuse installed by default back when I acquired this laptop. Back then I tried experimenting with alternative, but b43 was reverse engineered and was missing a lot of feature, and the brdcmsmac opensource driver was too new, and had speed problems (missing 5Ghz, speed droping to 10mbits on 2.5Ghz N). I've noticed in the logs that the wl driver makes the kernel spit messages. As nowadays brcmsmac seems to be in much better shape (only power saving, AP mode and some special channel mode are missing- N is fully functionning, including at 5Ghz), I'll try switching and see if the machine becomes stabler. That's also the main difference between this problematic laptop (has BCM wireless) and the older laptops (external Wifi PC-CARD that I plug when I actually need it) and other destkop/Workstations I use (no Wifi).
Also which modules are unloaded before hibernate/suspend and reload afterwards if any ?
I there a cannonical way to list them ? I would suspect the USB drivers. Currently, the bluetooth (BCM5880) is wired over USB in the laptop, and I have a USB-using microSD reader mini card. They use stock kernel drivers. Also, last but not least: ACPI is of course also 3rd party (byte) code running here. Not sure if I can do anything about that one (BIOS/Firmware from Dell is up-to-date). I'll see what I can do. If the segfaults disappear, It might have been the BCM restricted source driver sliently corrupting the rest. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #44 from Ivan Topolsky
http://bugzilla.opensuse.org/show_bug.cgi?id=918158
--- Comment #46 from Ivan Topolsky
participants (1)
-
bugzilla_noreply@novell.com