Bug ID | 1104331 |
---|---|
Summary | OSD Segmentation fault in thread_name:safe_timer |
Classification | openSUSE |
Product | openSUSE Distribution |
Version | Leap 42.3 |
Hardware | Other |
OS | Other |
Status | NEW |
Severity | Normal |
Priority | P5 - None |
Component | Other |
Assignee | bnc-team-screening@forge.provo.novell.com |
Reporter | eugen.block@suse.com |
QA Contact | qa-bugs@suse.de |
Found By | --- |
Blocker | --- |
We're running a Ceph Luminous Cluster on Leap 42.3: ceph1:~ # rpm -qi ceph-common Name : ceph-common Version : 12.2.5+git.1524775272.5e7ea8cf03 Release : 2.1 Architecture: x86_64 Install Date: Thu May 24 15:14:56 2018 Group : System/Filesystems Size : 18064537 License : LGPL-2.1 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-3-Clause and MIT Signature : RSA/SHA256, Fri Apr 27 01:48:28 2018, Key ID 98c97fe7324e6311 Source RPM : ceph-12.2.5+git.1524775272.5e7ea8cf03-2.1.src.rpm Build Date : Fri Apr 27 01:10:29 2018 Build Host : lamb72 Relocations : (not relocatable) Vendor : obs://build.opensuse.org/filesystems Since a couple of days we're suddenly experiencing random segfaults leading to failing OSDs. We didn't update any packages on our ceph servers, they ran stable until these segfaults. Sometimes the OSDs recover by themselves, sometimes a whole host goes down and leaves a degraded cluster. The description in [2] sounds exactly like our issue ([1] is a duplicate of [2]). I'll spare you the details since the upstream bug has lots of information. There seems to be a backport [3] for Luminous, but it's still in progress although the change has been reviewed. I just wanted to raise awareness for this issue since it impacts our production cluster that ran stable for months and we'd appreciate it if the updated packages would be available soon to fix this issue. Please let me know if any further information is needed. [1] https://tracker.ceph.com/issues/23431 [2] https://tracker.ceph.com/issues/23352 [3] https://tracker.ceph.com/issues/26871