On Wed, Mar 9, 2016 at 9:29 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
On Tue, Mar 8, 2016 at 9:55 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
On Tue, Mar 8, 2016 at 10:04 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
08.03.2016 17:06, Greg Freemyer пишет:
On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
08.03.2016 03:09, Greg Freemyer пишет:
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote: > On 03/07/2016 01:49 PM, Andrei Borzenkov wrote: >> 07.03.2016 21:41, Greg Freemyer пишет: >>> >>> The destination does not exist (it used to). >>> >>> But ignoring that rather big issue, why does a failed mount of a >>> tertiary filesystem cause systemd to crash and burn. >>> >> >> Because all mount points in /etc/fstab are mandatory unless marked as >> optional. There is nothing new - countless number of times I had to >> "fix" server that stayed in "rescue mode" because someone removed access >> to SAN storage but forgot to edit fstab. Without any systemd involved. > > +1. > > Or, RTFM, use "nofail" ... maybe
Anton,
I have read the fstab entry many times. I've also admin'ed servers for years (or decades).
I somehow missed:
== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.
Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==
I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).
Andrei,
Generating the logs is easy.
Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier?
If it is easy, do both :)
Pull all 3 files from:
http://www.iac-forensics.com/temporary_files/
Let me know when you have them and I will delete them.
I did:
- restore fstab to failure mode - reboot with quiet deleted and systemd.log_level=debug - login on console - journalctl -b > systemd-journal-debug-immediately-after-boot
The reason you were able to get as far is another bug :)
Mar 08 13:06:00 cloud1 systemd[1]: Trying to enqueue job multi-user.target/start/isolate Mar 08 13:06:00 cloud1 systemd[1]: Found ordering cycle on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on home-portal_backup-portal_backup.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on srv_new.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on network-online.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wicked.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Breaking ordering cycle by deleting job local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Job local-fs.target/start deleted to break ordering cycle starting with wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Deleting job postfix.service/start as dependency of job local-fs.target/start
So you have dependency loop; sounds like some of local mount points depend on network (NFS? iSCSI?) paths. Could you show /etc/fstab?
=== See the NFS mount comment below === /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part1 swap swap defaults 0 0 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part2 / ext4 acl,user_xattr 1 1 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part3 /home ext4 acl,user_xattr 1 2 proc /proc proc defaults 0 0 sysfs /sys sysfs noauto 0 0 debugfs /sys/kernel/debug debugfs noauto 0 0 usbfs /proc/bus/usb usbfs noauto 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 # NFS MOUNT HERE 10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0 # NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0 === The NFS server doesn't allow me to control of the file metadata (ownership etc.). In particular it forces all files on it to be owned by one UID / GID. So as root I've created a couple large container files which I mount via loopback. Within the containers I have full control of the filesystem metadata (file ownership, etc).
Anyway, systemd decides to remove local-fs.target from equation. Which explains how you managed to boot at all.
- systemctl start postfix.service - emergency mode triggered
Correct. postfix tries to start local-fs.target again (as dependency). Now all other services that had been in this dependency loop are already started, so nothing prevents systemd from attempting to start local-fs.target again.
Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job postfix.service/start/replace Mar 08 13:08:03 cloud1 systemd[1]: Installed new job postfix.service/start as 680 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-run.mount/start as 681 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job local-fs.target/start as 690 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job home-portal_backup-portal_backup.mount/start as 691 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation.service/start as 774 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation-early.service/start as 776 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-lock.mount/start as 781 Mar 08 13:08:03 cloud1 systemd[1]: Enqueued job postfix.service/start as 680 ...
Mar 08 13:08:03 cloud1 systemd[1]: Failed to mount /home/portal_backup/portal_backup.
Mar 08 13:08:03 cloud1 systemd[1]: Job local-fs.target/start finished, result=dependency Mar 08 13:08:03 cloud1 systemd[1]: Dependency failed for Local File Systems. Mar 08 13:08:03 cloud1 systemd[1]: Triggering OnFailure= dependencies of local-fs.target. Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job emergency.target/start/replace-irreversibly
- journalctl -b > systemd-journal-debug-immediately-after-start-postfix-triggered-emergency-mode - reboot with debug logs - login to console - do nothing for 15 minutes - emergency mode triggered by 15 minute delay
This is due some timer unit that triggers after 15 minutes and causes attempt to start local-fs.target
Mar 08 13:32:51 cloud1 systemd[1]: Timer elapsed on systemd-tmpfiles-clean.timer Mar 08 13:32:51 cloud1 systemd[1]: Trying to enqueue job systemd-tmpfiles-clean.service/start/replace Mar 08 13:32:51 cloud1 systemd[1]: Installed new job systemd-tmpfiles-clean.service/start as 900 Mar 08 13:32:51 cloud1 systemd[1]: Installed new job local-fs.target/start as 901 ... etc
So you need to fix dependency loop, probably by adding _netdev to some filesystem to move it after network.
Should I add it to the NFS mount and the 2 NFS dependent mounts? Thanks for looking so hard at this. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org