New subject: [Bug 1099745] [kubic][transactional server] systemd fails to boot, reports $subvolume already mounted

29 Jun 2018

      http://bugzilla.opensuse.org/show_bug.cgi?id=1099745

            Bug ID: 1099745
           Summary: [kubic][transactional server] systemd fails to boot,
                    reports $subvolume already mounted
    Classification: openSUSE
           Product: openSUSE Tumbleweed
           Version: Current
          Hardware: Other
                OS: Other
            Status: NEW
          Severity: Critical
          Priority: P5 - None
         Component: Other
          Assignee: fbui@suse.com
          Reporter: rbrown@suse.com
        QA Contact: qa-bugs@suse.de
                CC: iforster@suse.com, kukuk@suse.com
          Found By: ---
           Blocker: ---

Created attachment 775783
  --> http://bugzilla.opensuse.org/attachment.cgi?id=775783&action=edit
journal -b with debug log enabled

This bug has already been debugged significantly by myself and Franck but for
completeness the report as it stands right now

# SYMPTOMS
On Transactional Tumbleweed or Kubic systems there is a clear race condition
where on some boots the mounting of the 1 or more various .mount units required
by local-fs.target fails with errors like the following:
...
mount[661]: mount: /boot/grub2/x86_64-efi: /dev/sda2 already mounted on /.
The sub-volume is not consistant, and on occasion more than one subvolume will
fail to be mounted with the same error.

Once the user enters the Emergency shell, any attempt to mount the subvolume
works perfectly fine - it is not "already mounted" by the time the sysadmin can
login to the Emergency shell

# STEPS TO REPRODUCE

Install Tumbleweed with a transactional server role, all default settings
create a cronjob to run "reboot -f" every minute
wait until the system fails to boot

This has been reproduced on an Intel i5 NUC somewhat reliably. It occurs on
average approximately every dozen reboots.
To rule out any kind of bus/hardware issues it has been successfully reproduced
on the following devices holding the systems rootfs:
SATA 7200 HDD
SATA SSD
M2 SSD
USB SD Card (Multiple)
USB Pen Drive

Enabling systemd debug logging seems to slow things down enough that this bug
is much harder to catch, but not impossible - debug logs are attached

# USER IMPACT

Due to the failure of the boot at this point, systemd drops to the emergency
shell

The system is therefore unusable, hence the critical severity of this bug.

Transactional Tumbleweed & Kubic machines reboot regularly (as a result of
every package change, update, etc), further increasing the severity of this bug
- if the bug occurs on average every 12 boots, users can expect one major
outage of each system at least every month.

Neither distribution can be considered reliable in normal operation until this
bug is mitigated or resolved.

# AFFECTED SYSTEMS (Tested, able to reproduce)
Tumbleweed (Transactional Server Role)
Kubic (All Installations)

# SUSPECTED AFFECTED SYSTEMS
Future CaaS Platform versions and SLE 15 SPx (with Transactional updates) using
systemd versions equal or later than currently in Tumbleweed

# UNAFFECTED SYSTEMS (Tested, unable to reproduce)
Leap 15 (Including Transactional Server Role)
Tumbleweed (non transactional roles)

# SUSPECTED UNAFFECTED SYSTEMS
SLE 15 GA

# PRELIMINARY HYPOTHESIS
As this doesn't happen on all Tumbleweed systems this bug is likely triggered
by the presence of /var (btrfs subvolume) and /etc (overlayfs related to var)
in fstab-sys in initrd. This is needed on a transactional system so the
contents of /etc's overlayfs can be read by the initrd. It is not present on
non-transactional Tumbleweed system roles. 

However Leap 15 transactional systems have an identical initrd & transactional
update configuration. As this doesn't happen on Leap 15 systems, this bug is
clearly triggered by changes to systemd introduced in versions later than that
in Leap 15 and SLE 15

My hypothesis is that one of the changes introduced recently seems to be
attempting to mount subvolumes too early. I suspect systemd might be unmounting
the devices uses by the initrd before remounting them as part of
local-fs.target. 

If systemd is not waiting long enough for the unmount to complete before
attempting to mount the first .mount target, that could match the observed
behaviour.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1099745] New: [kubic][transactional server] systemd fails to boot, reports $subvolume already mounted

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

tags

participants (1)