Re: [opensuse] The fight continues - sistemad troubleshooting?

9 Mar 2016

      On Wed, Mar 9, 2016 at 9:29 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
...
On Tue, Mar 8, 2016 at 9:55 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
...
On Tue, Mar 8, 2016 at 10:04 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
...
08.03.2016 17:06, Greg Freemyer пишет:
...
On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
...
08.03.2016 03:09, Greg Freemyer пишет:
...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:
> On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:
>> 07.03.2016 21:41, Greg Freemyer пишет:
>>>
>>> The destination does not exist (it used to).
>>>
>>> But ignoring that rather big issue, why does a failed mount of a
>>> tertiary filesystem cause systemd to crash and burn.
>>>
>>
>> Because all mount points in /etc/fstab are mandatory unless marked as
>> optional. There is nothing new - countless number of times I had to
>> "fix" server that stayed in "rescue mode" because someone removed access
>> to SAN storage but forgot to edit fstab. Without any systemd involved.
>
> +1.
>
> Or, RTFM, use "nofail" ... maybe
Anton,
I have read the fstab entry many times.  I've also admin'ed servers
for years (or decades).
I somehow missed:
==
If you have an fstab entry with the default "auto" mount and "fail"
behavior and the mount fails, then mail processing will be disallowed.
Further, all ssh access will be terminated.  And for good measure any
X-Windows GUI will also be terminated.
Your computer will fall-back to minimally functional systemd
maintenance mode and all troubleshooting must occur exclusively via
the console.
==
I am not sure how you managed to reach this state and I would be very
interested in seeing logs on debug level (boot with
systemd.log_level=debug added to and "quiet" removed from kernel command
line, reproduce problem and upload "journalctl -b" somewhere). Your
system should have stayed in rescue mode from the very beginning (i.e.
it would not enter normal run level 3/5).
Andrei,
Generating the logs is easy.
Would you like me to just boot and let the system sit for 15 minutes,
OR go ahead and trigger maintenance mode earlier?
If it is easy, do both :)
Pull all 3 files from:
http://www.iac-forensics.com/temporary_files/
Let me know when you have them and I will delete them.
I did:
- restore fstab to failure mode
- reboot with quiet deleted and systemd.log_level=debug
- login on console
- journalctl -b > systemd-journal-debug-immediately-after-boot
The reason you were able to get as far is another bug :)
Mar 08 13:06:00 cloud1 systemd[1]: Trying to enqueue job
multi-user.target/start/isolate
Mar 08 13:06:00 cloud1 systemd[1]: Found ordering cycle on
wickedd-nanny.service/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on local-fs.target/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on
home-portal_backup-portal_backup.mount/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on srv_new.mount/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on
network-online.target/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wicked.service/start
Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on
wickedd-nanny.service/start
Mar 08 13:06:00 cloud1 systemd[1]: Breaking ordering cycle by deleting
job local-fs.target/start
Mar 08 13:06:00 cloud1 systemd[1]: Job local-fs.target/start deleted
to break ordering cycle starting with wickedd-nanny.service/start
Mar 08 13:06:00 cloud1 systemd[1]: Deleting job postfix.service/start
as dependency of job local-fs.target/start
So you have dependency loop; sounds like some of local mount points
depend on network (NFS? iSCSI?) paths. Could you show /etc/fstab?
=== See the NFS mount comment below ===
/dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part1 swap
swap       defaults              0 0
/dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part2 /
ext4       acl,user_xattr        1 1
/dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part3 /home
ext4       acl,user_xattr        1 2
proc                 /proc                proc       defaults              0 0
sysfs                /sys                 sysfs      noauto                0 0
debugfs              /sys/kernel/debug    debugfs    noauto                0 0
usbfs                /proc/bus/usb        usbfs      noauto                0 0
devpts               /dev/pts             devpts     mode=0620,gid=5       0 0

# NFS MOUNT HERE
10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0

# NFS dependent mounts here
/srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0
/srv_new/portal_backup_container /home/portal_backup/portal_backup
ext4 nofail,loop 0 0
===

The NFS server doesn't allow me to control of the file metadata
(ownership etc.).  In particular it forces all files on it to be owned
by one UID / GID.

So as root I've created a couple large container files which I mount
via loopback.  Within the containers I have full control of the
filesystem metadata (file ownership, etc).
...
Anyway, systemd decides to remove local-fs.target from equation. Which
explains how you managed to boot at all.
...
- systemctl start postfix.service
- emergency mode triggered
Correct. postfix tries to start local-fs.target again (as dependency).
Now all other services that had been in this dependency loop are
already started, so nothing prevents systemd from attempting to start
local-fs.target again.
Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job
postfix.service/start/replace
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job
postfix.service/start as 680
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-run.mount/start as 681
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job
local-fs.target/start as 690
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job
home-portal_backup-portal_backup.mount/start as 691
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job
lvm2-activation.service/start as 774
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job
lvm2-activation-early.service/start as 776
Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-lock.mount/start as 781
Mar 08 13:08:03 cloud1 systemd[1]: Enqueued job postfix.service/start as 680
...
Mar 08 13:08:03 cloud1 systemd[1]: Failed to mount
/home/portal_backup/portal_backup.
Mar 08 13:08:03 cloud1 systemd[1]: Job local-fs.target/start finished,
result=dependency
Mar 08 13:08:03 cloud1 systemd[1]: Dependency failed for Local File Systems.
Mar 08 13:08:03 cloud1 systemd[1]: Triggering OnFailure= dependencies
of local-fs.target.
Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job
emergency.target/start/replace-irreversibly
...
- journalctl -b >
systemd-journal-debug-immediately-after-start-postfix-triggered-emergency-mode
- reboot with debug logs
- login to console
- do nothing for 15 minutes
- emergency mode triggered by 15 minute delay
This is due some timer unit that triggers after 15 minutes and causes
attempt to start local-fs.target
Mar 08 13:32:51 cloud1 systemd[1]: Timer elapsed on systemd-tmpfiles-clean.timer
Mar 08 13:32:51 cloud1 systemd[1]: Trying to enqueue job
systemd-tmpfiles-clean.service/start/replace
Mar 08 13:32:51 cloud1 systemd[1]: Installed new job
systemd-tmpfiles-clean.service/start as 900
Mar 08 13:32:51 cloud1 systemd[1]: Installed new job
local-fs.target/start as 901
... etc
So you need to fix dependency loop, probably by adding _netdev to some
filesystem to move it after network.
Should I add it to the NFS mount and the 2 NFS dependent mounts?

Thanks for looking so hard at this.
Greg
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org