[opensuse] The fight continues - systemd troubleshooting? - openSUSE Users

newer
[opensuse] oss 13.1 and kernel 3.12

[opensuse] The fight continues - systemd troubleshooting?

Greg Freemyer

6 Mar 2016 6 Mar '16

01:10

A server I admin has been nothing but trouble today. It started having a problem, and being smarter than average bear I thought the best way out was to go ahead and upgrade from 13.1 to 13.2 and all would be good. Not so much. I'm now running 13.2. zypper dup says I'm done and so does zypper verify. The current trouble is postfix isn't running. And when I try "systemctl start postfix.service" it triggers a systemd meltdown and I get thrown into a systemd maintenance mode. There is probably useful info in the systemd journal, but there is a lot of stuff in there and I don't know what I'm looking for. Guidance appreciated. Thanks Greg -- Greg Freemyer www.IntelligentAvatar.net -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Per Jessen

7 Mar 7 Mar

11:13

New subject: [opensuse] The fight continues - sistemad troubleshooting?

Greg Freemyer wrote:

...

A server I admin has been nothing but trouble today.

It started having a problem, and being smarter than average bear I thought the best way out was to go ahead and upgrade from 13.1 to 13.2 and all would be good.

Not so much.

I'm now running 13.2. zypper dup says I'm done and so does zypper verify.

The current trouble is postfix isn't running.

And when I try "systemctl start postfix.service" it triggers a systemd meltdown and I get thrown into a systemd maintenance mode.

There is probably useful info in the systemd journal, but there is a lot of stuff in there and I don't know what I'm looking for.

Guidance appreciated.

First bit - don't write "systemd" in the subject line, using "sistemad" has proven to be a better alternative. :-) Second, have you tried starting postfix manually? Just from the console with "postfix start"? Just in case something in the postfix startup is causing the problem. I have upgraded several boxes from 13.1 to 13.2, never had an issue. -- Per Jessen, Zürich (4.5°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

13:07

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Mon, Mar 7, 2016 at 6:13 AM, Per Jessen <per@computer.org> wrote:

...

Greg Freemyer wrote:

...
A server I admin has been nothing but trouble today.

It started having a problem, and being smarter than average bear I thought the best way out was to go ahead and upgrade from 13.1 to 13.2 and all would be good.

Not so much.

I'm now running 13.2. zypper dup says I'm done and so does zypper verify.

The current trouble is postfix isn't running.

And when I try "systemctl start postfix.service" it triggers a systemd meltdown and I get thrown into a systemd maintenance mode.

There is probably useful info in the systemd journal, but there is a lot of stuff in there and I don't know what I'm looking for.

Guidance appreciated.

First bit - don't write "systemd" in the subject line, using "sistemad" has proven to be a better alternative. :-)

That is important advice!

...

Second, have you tried starting postfix manually? Just from the console with "postfix start"? Just in case something in the postfix startup is causing the problem.

I have upgraded several boxes from 13.1 to 13.2, never had an issue.

I have my server running for the last 12 hours so the urgency of troubleshooting is gone, but still this was a major bug for me. I found that even if I did nothing after re-boot the server would enter "systemd maintenance mode" after 15 minutes. Trying to start postfix (systemctl start postfix.service) simply triggered the failure sooner. I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log. I went through and fixed every minor complaint until my server would run smoothly. Commenting out this line from fstab was the "fix". === #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 === I don't now if that line was was working last week or not. When I applied the last 13.1 patches saturday it apparently started causing systemd to fail shortly after re-boot. Upgrading my box to 13.2 did not change that. I haven't done any troubleshooting of the above yet. It may have to wait a couple days before I can dedicate my brainpower to it, but I'm glad to run troubleshooting commands others propose. The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot. If I had a ssh connection going, it simply halted at that time. My only access was via an interface my cloud provider provides that lets me function as if I'm typing on the console (VNC is used for the remote console). Further, postfix wasn't running and trying to start it would trigger the same systemd failure. I assume because postfix was somehow dependent on all filesystems in fstab being mounted. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

15:50

New subject: [opensuse] The fight continues - sistemad troubleshooting?

Greg Freemyer wrote:

...

I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log.

I went through and fixed every minor complaint until my server would run smoothly.

Commenting out this line from fstab was the "fix".

=== #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 ===

So something is not quite right about "/srv_new/portal_backup_container". What happens if you try to mount it manually?

...

The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot.

Most probably failure to mount that filesystem - I've seen that before. -- Per Jessen, Zürich (6.9°C) http://www.dns24.ch/ - your free DNS host, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

17:01

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/07/2016 10:50 AM, Per Jessen wrote:

...

Greg Freemyer wrote:

...
I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log.

I went through and fixed every minor complaint until my server would run smoothly.

Commenting out this line from fstab was the "fix".

=== #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 ===

So something is not quite right about "/srv_new/portal_backup_container". What happens if you try to mount it manually?

...
The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot.

Most probably failure to mount that filesystem - I've seen that before.

Quite possibly so. Trying to mount it manually might show what is going on. On the face of it, it looks like its trying to do do a "bind" mount (q.v. man page) but without the "--bind". After all the "/srv/.." part is not a device. Does it actually exist? Does the destination exist? -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

18:41

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Mon, Mar 7, 2016 at 12:01 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...

On 03/07/2016 10:50 AM, Per Jessen wrote:

...
Greg Freemyer wrote:

...
I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log.

I went through and fixed every minor complaint until my server would run smoothly.

Commenting out this line from fstab was the "fix".

=== #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 ===

So something is not quite right about "/srv_new/portal_backup_container". What happens if you try to mount it manually?

...
The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot.

Most probably failure to mount that filesystem - I've seen that before.

Quite possibly so. Trying to mount it manually might show what is going on.

On the face of it, it looks like its trying to do do a "bind" mount (q.v. man page) but without the "--bind". After all the "/srv/.." part is not a device. Does it actually exist? Does the destination exist?

Hmm... The destination does not exist (it used to). But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn. If no one knows, I'll just open this as a bugzilla. BTW; This isn't a --bind mount. It is a loopback mount. I'm supposed to have a large file at /srv_new/portal_backup_container that itself is a filesystem that I loopback mount. I know that is strange, but it is needed to overcome an issue with the NFS mount options setup my my cloud provider. Inside the loopback mount I have total control of the mount options. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

18:49

New subject: [opensuse] The fight continues - sistemad troubleshooting?

07.03.2016 21:41, Greg Freemyer пишет:

...

The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

...

If no one knows, I'll just open this as a bugzilla.

BTW; This isn't a --bind mount. It is a loopback mount. I'm supposed to have a large file at /srv_new/portal_backup_container that itself is a filesystem that I loopback mount.

I know that is strange, but it is needed to overcome an issue with the NFS mount options setup my my cloud provider. Inside the loopback mount I have total control of the mount options.

Greg

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

19:23

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Mon, Mar 7, 2016 at 1:49 PM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...

07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

Andrei, Would a bugzilla asking for a more descriptive error log be accepted? All I got was this one liner: Mar 06 19:14:51 cloud1 mount[1633]: mount: /srv_new/portal_backup_container: failed to setup loop device: No such file or directory It would be nice to see something like: "THE ABOVE FAILED MOUNT TRIGGERED SYSTEMD MAINTENANCE MODE:" Instead, the one liner was lost in a clutter of hundreds of lines of output in the journal. In fact I started with 500 lines of Journal logs that occurred near the failure. I whittled that down to these before I found the issue, but nothing in the below jumps out at me as a "MAJOR PROBLEM, MUST FIX BEFORE COMPUTER IS USABLE" type of log entry. In fact nothing in the logs even says systemd maintenance mode is being entered. ================ Mar 06 19:10:59 cloud1 clamd[1489]: SelfCheck: Database status OK. Mar 06 19:14:10 cloud1 sudo[1620]: gaf : TTY=pts/0 ; PWD=/home/gaf ; USER=root ; COMMAND=/usr/bin/vi /usr/lib/systemd/system/systemd-udev-root-symlink.service Mar 06 19:14:51 cloud1 systemd[1]: Cannot add dependency job for unit systemd-udev-root-symlink.service, ignoring: Unit systemd-udev-root-symlink.service failed to load: Invalid argument. See system logs and 'systemctl status systemd-udev-root-symlink.service' for details. Mar 06 19:14:51 cloud1 mount[1633]: mount: /srv_new/portal_backup_container: failed to setup loop device: No such file or directory Mar 06 19:14:51 cloud1 systemd[1]: Failed to mount /home/portal_backup/portal_backup. Mar 06 19:14:51 cloud1 systemd[1]: Dependency failed for Local File Systems. Mar 06 19:14:52 cloud1 xinetd[1015]: Exiting... Mar 06 19:14:52 cloud1 dbus[502]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service' Mar 06 19:14:52 cloud1 dbus[502]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.login1.service': Refusing activation, D-Bus is shutting down. Mar 06 19:14:52 cloud1 login[1335]: pam_systemd(login:session): Failed to release session: Refusing activation, D-Bus is shutting down. Mar 06 19:14:53 cloud1 wickedd-dhcp4[565]: eth0: Request to release DHCPv4 lease with UUID 7ec4dc56-e344-0300-3702-000004000000 Mar 06 19:14:54 cloud1 systemd-journal[288]: Forwarding to syslog missed 1 messages. -- Subject: One or more messages could not be forwarded to syslog -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- One or more messages could not be forwarded to the syslog service -- running side-by-side with journald. This usually indicates that the -- syslog implementation has not been able to keep up with the speed of -- messages queued. Mar 06 19:14:54 cloud1 systemd[1379]: pam_unix(systemd-user:session): session closed for user gaf Mar 06 19:14:56 cloud1 wickedd[567]: ni_process_reap: process 1932 has not exited yet; now doing a blocking waitpid() Mar 06 19:14:56 cloud1 wicked[1640]: eth0 device-ready Mar 06 19:14:56 cloud1 wicked[1640]: eth1 device-ready Mar 06 19:14:57 cloud1 freshclam[1232]: Update process terminated Mar 06 19:16:31 cloud1 SuSEfirewall2[1974]: Not unloading firewall rules at system shutdown ============== Thanks Greg -- Greg Freemyer www.IntelligentAvatar.net -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

8 Mar 8 Mar

06:23

New subject: [opensuse] The fight continues - sistemad troubleshooting?

07.03.2016 22:23, Greg Freemyer пишет:

...

On Mon, Mar 7, 2016 at 1:49 PM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

Andrei,

Would a bugzilla asking for a more descriptive error log be accepted?

That is something for upstream. See below.

...

All I got was this one liner:

Mar 06 19:14:51 cloud1 mount[1633]: mount: /srv_new/portal_backup_container: failed to setup loop device: No such file or directory

It would be nice to see something like:

"THE ABOVE FAILED MOUNT TRIGGERED SYSTEMD MAINTENANCE MODE:"

Instead, the one liner was lost in a clutter of hundreds of lines of output in the journal.

In fact I started with 500 lines of Journal logs that occurred near the failure. I whittled that down to these before I found the issue, but nothing in the below jumps out at me as a "MAJOR PROBLEM, MUST FIX BEFORE COMPUTER IS USABLE" type of log entry. In fact nothing in the logs even says systemd maintenance mode is being entered.

================ Mar 06 19:10:59 cloud1 clamd[1489]: SelfCheck: Database status OK. Mar 06 19:14:10 cloud1 sudo[1620]: gaf : TTY=pts/0 ; PWD=/home/gaf ; USER=root ; COMMAND=/usr/bin/vi /usr/lib/systemd/system/systemd-udev-root-symlink.service Mar 06 19:14:51 cloud1 systemd[1]: Cannot add dependency job for unit systemd-udev-root-symlink.service, ignoring: Unit systemd-udev-root-symlink.service failed to load: Invalid argument. See system logs and 'systemctl status systemd-udev-root-symlink.service' for details. Mar 06 19:14:51 cloud1 mount[1633]: mount: /srv_new/portal_backup_container: failed to setup loop device: No such file or directory Mar 06 19:14:51 cloud1 systemd[1]: Failed to mount /home/portal_backup/portal_backup. Mar 06 19:14:51 cloud1 systemd[1]: Dependency failed for Local File Systems.

Every now and then request to provide information, *what* dependencies failed, appears on systemd list/tracker. Apparently it is not entirely trivial and I'm not sure if anyone offered at least prototype implementation. But as I mentioned in another reply, I do not see how you can get so far in the first place. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

15:02

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-07 20:23, Greg Freemyer wrote:

...

On Mon, Mar 7, 2016 at 1:49 PM, Andrei Borzenkov <> wrote:

...

Andrei,

Would a bugzilla asking for a more descriptive error log be accepted?

All I got was this one liner:

Mar 06 19:14:51 cloud1 mount[1633]: mount: /srv_new/portal_backup_container: failed to setup loop device: No such file or directory

It would be nice to see something like:

"THE ABOVE FAILED MOUNT TRIGGERED SYSTEMD MAINTENANCE MODE:"

Instead, the one liner was lost in a clutter of hundreds of lines of output in the journal.

I wrote one such bugzilla years ago. initd was more verbose of what the real problem was. I thought it was solved, but the situation is not simple to trigger and verify. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Anton Aylward

16:34

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/08/2016 10:02 AM, Carlos E. R. wrote:

...

I wrote one such bugzilla years ago. initd was more verbose of what the real problem was. I thought it was solved, but the situation is not simple to trigger and verify.

Its not that systemctl/logging isn't verbose. Traditionally UNIX was never verbose, compared to IBM or DEC OSes. Until you asked. The issue you face is that you're, effectively, not asking, you expect it to be in your face live old VMS. I'm still learning all the systemd tools, but I have used systemctl to diagnose written units and generated units that aren't working properly, or as expected. Systemd generates its unit files from /etc/fstab and I discovered last week that they can be found in /run/systemd/generator/ - which seems logical in retrosepct :-) While all the .mount units are there you'll want to trace though the "requires" (for the .mount) and "wants" (for the .service). You can build up the tree of how the mounts are sequenced and what their dependences are. I always find something new to learn. -- It is bad luck to be superstitious. -- Andrew W. Mathis -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

17:55

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-08 17:34, Anton Aylward wrote:

...

On 03/08/2016 10:02 AM, Carlos E. R. wrote:

...
I wrote one such bugzilla years ago. initd was more verbose of what the real problem was. I thought it was solved, but the situation is not simple to trigger and verify.

Its not that systemctl/logging isn't verbose. Traditionally UNIX was never verbose, compared to IBM or DEC OSes. Until you asked.

Well, I was not referring to "logs" here. Rather to the text printed in the screen when openSUSE booted and dumped you into emergency mode because a partition was not found or fsck failed and required a manual run. The text was rather more to the point. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Anton Aylward

7 Mar 7 Mar

23:18

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...

07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1. Or, RTFM, use "nofail" ... maybe -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

8 Mar 8 Mar

00:09

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...

On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...
07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton, I have read the fstab entry many times. I've also admin'ed servers for years (or decades). I somehow missed: == If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated. Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. == Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

06:20

New subject: [opensuse] The fight continues - sistemad troubleshooting?

08.03.2016 03:09, Greg Freemyer пишет:

...

On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...
On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...
07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5). -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

14:06

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...

08.03.2016 03:09, Greg Freemyer пишет:

...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...
On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...
07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).

Andrei, Generating the logs is easy. Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier? == My observation is that ssh in particular becomes active almost immediately after boot, then at the 15 minute mark systemd regresses to maintenance mode. Any ssh connections simply become non-responsive at that point. Postfix is NOT started during that 15 minutes. If during that 15 minutes I manually run "systemctl start postfix.service" then I trigger systemd maintenance mode. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

15:04

New subject: [opensuse] The fight continues - sistemad troubleshooting?

08.03.2016 17:06, Greg Freemyer пишет:

...

On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 03:09, Greg Freemyer пишет:

...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...
On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...
07.03.2016 21:41, Greg Freemyer пишет:

...
The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).

Andrei,

Generating the logs is easy.

Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier?

If it is easy, do both :) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

18:55

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Tue, Mar 8, 2016 at 10:04 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...

08.03.2016 17:06, Greg Freemyer пишет:

...
On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 03:09, Greg Freemyer пишет:

...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...
On 03/07/2016 01:49 PM, Andrei Borzenkov wrote:

...
07.03.2016 21:41, Greg Freemyer пишет: > > The destination does not exist (it used to). > > But ignoring that rather big issue, why does a failed mount of a > tertiary filesystem cause systemd to crash and burn. >

Because all mount points in /etc/fstab are mandatory unless marked as optional. There is nothing new - countless number of times I had to "fix" server that stayed in "rescue mode" because someone removed access to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).

Andrei,

Generating the logs is easy.

Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier?

If it is easy, do both :)

Pull all 3 files from: http://www.iac-forensics.com/temporary_files/ Let me know when you have them and I will delete them. I did: - restore fstab to failure mode - reboot with quiet deleted and systemd.log_level=debug - login on console - journalctl -b > systemd-journal-debug-immediately-after-boot - systemctl start postfix.service - emergency mode triggered - journalctl -b > systemd-journal-debug-immediately-after-start-postfix-triggered-emergency-mode - reboot with debug logs - login to console - do nothing for 15 minutes - emergency mode triggered by 15 minute delay - journalctl -b > systemd-journal-debug-immediately-after-15-minute-wait Thanks Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

9 Mar 9 Mar

14:29

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Tue, Mar 8, 2016 at 9:55 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:

...

On Tue, Mar 8, 2016 at 10:04 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 17:06, Greg Freemyer пишет:

...
On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 03:09, Greg Freemyer пишет:

...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...
On 03/07/2016 01:49 PM, Andrei Borzenkov wrote: > 07.03.2016 21:41, Greg Freemyer пишет: >> >> The destination does not exist (it used to). >> >> But ignoring that rather big issue, why does a failed mount of a >> tertiary filesystem cause systemd to crash and burn. >> > > Because all mount points in /etc/fstab are mandatory unless marked as > optional. There is nothing new - countless number of times I had to > "fix" server that stayed in "rescue mode" because someone removed access > to SAN storage but forgot to edit fstab. Without any systemd involved.

+1.

Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).

Andrei,

Generating the logs is easy.

Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier?

If it is easy, do both :)

Pull all 3 files from:

http://www.iac-forensics.com/temporary_files/

Let me know when you have them and I will delete them.

I did:

- restore fstab to failure mode - reboot with quiet deleted and systemd.log_level=debug - login on console - journalctl -b > systemd-journal-debug-immediately-after-boot

The reason you were able to get as far is another bug :) Mar 08 13:06:00 cloud1 systemd[1]: Trying to enqueue job multi-user.target/start/isolate Mar 08 13:06:00 cloud1 systemd[1]: Found ordering cycle on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on home-portal_backup-portal_backup.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on srv_new.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on network-online.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wicked.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Breaking ordering cycle by deleting job local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Job local-fs.target/start deleted to break ordering cycle starting with wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Deleting job postfix.service/start as dependency of job local-fs.target/start So you have dependency loop; sounds like some of local mount points depend on network (NFS? iSCSI?) paths. Could you show /etc/fstab? Anyway, systemd decides to remove local-fs.target from equation. Which explains how you managed to boot at all.

...

- systemctl start postfix.service - emergency mode triggered

Correct. postfix tries to start local-fs.target again (as dependency). Now all other services that had been in this dependency loop are already started, so nothing prevents systemd from attempting to start local-fs.target again. Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job postfix.service/start/replace Mar 08 13:08:03 cloud1 systemd[1]: Installed new job postfix.service/start as 680 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-run.mount/start as 681 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job local-fs.target/start as 690 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job home-portal_backup-portal_backup.mount/start as 691 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation.service/start as 774 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation-early.service/start as 776 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-lock.mount/start as 781 Mar 08 13:08:03 cloud1 systemd[1]: Enqueued job postfix.service/start as 680 ... Mar 08 13:08:03 cloud1 systemd[1]: Failed to mount /home/portal_backup/portal_backup. Mar 08 13:08:03 cloud1 systemd[1]: Job local-fs.target/start finished, result=dependency Mar 08 13:08:03 cloud1 systemd[1]: Dependency failed for Local File Systems. Mar 08 13:08:03 cloud1 systemd[1]: Triggering OnFailure= dependencies of local-fs.target. Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job emergency.target/start/replace-irreversibly

...

- journalctl -b > systemd-journal-debug-immediately-after-start-postfix-triggered-emergency-mode - reboot with debug logs - login to console - do nothing for 15 minutes - emergency mode triggered by 15 minute delay

This is due some timer unit that triggers after 15 minutes and causes attempt to start local-fs.target Mar 08 13:32:51 cloud1 systemd[1]: Timer elapsed on systemd-tmpfiles-clean.timer Mar 08 13:32:51 cloud1 systemd[1]: Trying to enqueue job systemd-tmpfiles-clean.service/start/replace Mar 08 13:32:51 cloud1 systemd[1]: Installed new job systemd-tmpfiles-clean.service/start as 900 Mar 08 13:32:51 cloud1 systemd[1]: Installed new job local-fs.target/start as 901 ... etc So you need to fix dependency loop, probably by adding _netdev to some filesystem to move it after network. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

18:11

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Wed, Mar 9, 2016 at 9:29 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...

On Tue, Mar 8, 2016 at 9:55 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:

...
On Tue, Mar 8, 2016 at 10:04 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 17:06, Greg Freemyer пишет:

...
On Tue, Mar 8, 2016 at 1:20 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...
08.03.2016 03:09, Greg Freemyer пишет:

...
On Mon, Mar 7, 2016 at 6:18 PM, Anton Aylward <opensuse@antonaylward.com> wrote: > On 03/07/2016 01:49 PM, Andrei Borzenkov wrote: >> 07.03.2016 21:41, Greg Freemyer пишет: >>> >>> The destination does not exist (it used to). >>> >>> But ignoring that rather big issue, why does a failed mount of a >>> tertiary filesystem cause systemd to crash and burn. >>> >> >> Because all mount points in /etc/fstab are mandatory unless marked as >> optional. There is nothing new - countless number of times I had to >> "fix" server that stayed in "rescue mode" because someone removed access >> to SAN storage but forgot to edit fstab. Without any systemd involved. > > +1. > > Or, RTFM, use "nofail" ... maybe

Anton,

I have read the fstab entry many times. I've also admin'ed servers for years (or decades).

I somehow missed:

== If you have an fstab entry with the default "auto" mount and "fail" behavior and the mount fails, then mail processing will be disallowed. Further, all ssh access will be terminated. And for good measure any X-Windows GUI will also be terminated.

Your computer will fall-back to minimally functional systemd maintenance mode and all troubleshooting must occur exclusively via the console. ==

I am not sure how you managed to reach this state and I would be very interested in seeing logs on debug level (boot with systemd.log_level=debug added to and "quiet" removed from kernel command line, reproduce problem and upload "journalctl -b" somewhere). Your system should have stayed in rescue mode from the very beginning (i.e. it would not enter normal run level 3/5).

Andrei,

Generating the logs is easy.

Would you like me to just boot and let the system sit for 15 minutes, OR go ahead and trigger maintenance mode earlier?

If it is easy, do both :)

Pull all 3 files from:

http://www.iac-forensics.com/temporary_files/

Let me know when you have them and I will delete them.

I did:

- restore fstab to failure mode - reboot with quiet deleted and systemd.log_level=debug - login on console - journalctl -b > systemd-journal-debug-immediately-after-boot

The reason you were able to get as far is another bug :)

Mar 08 13:06:00 cloud1 systemd[1]: Trying to enqueue job multi-user.target/start/isolate Mar 08 13:06:00 cloud1 systemd[1]: Found ordering cycle on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on home-portal_backup-portal_backup.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on srv_new.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on network-online.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wicked.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Breaking ordering cycle by deleting job local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Job local-fs.target/start deleted to break ordering cycle starting with wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Deleting job postfix.service/start as dependency of job local-fs.target/start

So you have dependency loop; sounds like some of local mount points depend on network (NFS? iSCSI?) paths. Could you show /etc/fstab?

=== See the NFS mount comment below === /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part1 swap swap defaults 0 0 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part2 / ext4 acl,user_xattr 1 1 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part3 /home ext4 acl,user_xattr 1 2 proc /proc proc defaults 0 0 sysfs /sys sysfs noauto 0 0 debugfs /sys/kernel/debug debugfs noauto 0 0 usbfs /proc/bus/usb usbfs noauto 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 # NFS MOUNT HERE 10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0 # NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0 === The NFS server doesn't allow me to control of the file metadata (ownership etc.). In particular it forces all files on it to be owned by one UID / GID. So as root I've created a couple large container files which I mount via loopback. Within the containers I have full control of the filesystem metadata (file ownership, etc).

...

Anyway, systemd decides to remove local-fs.target from equation. Which explains how you managed to boot at all.

...
- systemctl start postfix.service - emergency mode triggered

Correct. postfix tries to start local-fs.target again (as dependency). Now all other services that had been in this dependency loop are already started, so nothing prevents systemd from attempting to start local-fs.target again.

Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job postfix.service/start/replace Mar 08 13:08:03 cloud1 systemd[1]: Installed new job postfix.service/start as 680 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-run.mount/start as 681 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job local-fs.target/start as 690 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job home-portal_backup-portal_backup.mount/start as 691 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation.service/start as 774 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job lvm2-activation-early.service/start as 776 Mar 08 13:08:03 cloud1 systemd[1]: Installed new job var-lock.mount/start as 781 Mar 08 13:08:03 cloud1 systemd[1]: Enqueued job postfix.service/start as 680 ...

Mar 08 13:08:03 cloud1 systemd[1]: Failed to mount /home/portal_backup/portal_backup.

Mar 08 13:08:03 cloud1 systemd[1]: Job local-fs.target/start finished, result=dependency Mar 08 13:08:03 cloud1 systemd[1]: Dependency failed for Local File Systems. Mar 08 13:08:03 cloud1 systemd[1]: Triggering OnFailure= dependencies of local-fs.target. Mar 08 13:08:03 cloud1 systemd[1]: Trying to enqueue job emergency.target/start/replace-irreversibly

...
- journalctl -b > systemd-journal-debug-immediately-after-start-postfix-triggered-emergency-mode - reboot with debug logs - login to console - do nothing for 15 minutes - emergency mode triggered by 15 minute delay

This is due some timer unit that triggers after 15 minutes and causes attempt to start local-fs.target

Mar 08 13:32:51 cloud1 systemd[1]: Timer elapsed on systemd-tmpfiles-clean.timer Mar 08 13:32:51 cloud1 systemd[1]: Trying to enqueue job systemd-tmpfiles-clean.service/start/replace Mar 08 13:32:51 cloud1 systemd[1]: Installed new job systemd-tmpfiles-clean.service/start as 900 Mar 08 13:32:51 cloud1 systemd[1]: Installed new job local-fs.target/start as 901 ... etc

So you need to fix dependency loop, probably by adding _netdev to some filesystem to move it after network.

Should I add it to the NFS mount and the 2 NFS dependent mounts? Thanks for looking so hard at this. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

18:36

New subject: [opensuse] The fight continues - sistemad troubleshooting?

09.03.2016 21:11, Greg Freemyer пишет:

...

...
...
I did:

- restore fstab to failure mode - reboot with quiet deleted and systemd.log_level=debug - login on console - journalctl -b > systemd-journal-debug-immediately-after-boot

The reason you were able to get as far is another bug :)

Mar 08 13:06:00 cloud1 systemd[1]: Trying to enqueue job multi-user.target/start/isolate Mar 08 13:06:00 cloud1 systemd[1]: Found ordering cycle on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on home-portal_backup-portal_backup.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on srv_new.mount/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on network-online.target/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wicked.service/start Mar 08 13:06:00 cloud1 systemd[1]: Found dependency on wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Breaking ordering cycle by deleting job local-fs.target/start Mar 08 13:06:00 cloud1 systemd[1]: Job local-fs.target/start deleted to break ordering cycle starting with wickedd-nanny.service/start Mar 08 13:06:00 cloud1 systemd[1]: Deleting job postfix.service/start as dependency of job local-fs.target/start

So you have dependency loop; sounds like some of local mount points depend on network (NFS? iSCSI?) paths. Could you show /etc/fstab?

=== See the NFS mount comment below === /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part1 swap swap defaults 0 0 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part2 / ext4 acl,user_xattr 1 1 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001-part3 /home ext4 acl,user_xattr 1 2 proc /proc proc defaults 0 0 sysfs /sys sysfs noauto 0 0 debugfs /sys/kernel/debug debugfs noauto 0 0 usbfs /proc/bus/usb usbfs noauto 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0

# NFS MOUNT HERE 10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0

Add _netdev to options of these two filesystems below. It should prevent dependency loop.

...

# NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0 ===

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

19:43

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/09/2016 01:11 PM, Greg Freemyer wrote:

...

# NFS MOUNT HERE 10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0

# NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0

Oh, I see what you're doing! I woud;'t do it that way. back in the 1980s working with SUN workstations and *EVERYTING* being NFS it was done with symlinks. When I did similar on opensuse with my laptop connected to the home LAN I worked it out this way: I had a /mnt/nfs/ main directory had under that things like /mnt/nfs/homelan/downloads So there was a homelan:/home.anton/dowloads /mnt/nts/homelan/downloads nfs \ rw,rsize..,wsize...,_netdev 0 2 There was also awaylan:/yamma/yamma/dowloads /mnt/nts/homelan/downloads nfs \ rw,rsize..,wsize...,_netdev 0 2 I then had symlink from /home/anton/downloads to /mnt/nts/homelan/downloads If I was working in a coffee shop and didn't have a connection to neither homelan nor awaylan then my downloads directory appeared empty. The "_netdev" is important. The symlink avoid the bind mount that seems to be giving you problems.

...

===

The NFS server doesn't allow me to control of the file metadata (ownership etc.). In particular it forces all files on it to be owned by one UID / GID.

So as root I've created a couple large container files which I mount via loopback. Within the containers I have full control of the filesystem metadata (file ownership, etc).

I take it to mean that you can't run rpc.imapd ? You do realise that imapd allows *arbitrary* remapping of uid and gid? So if the remote files are 'owned' rw-r----- by hpotter;gryffindor you can map them to, for example, gregf:users -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

20:07

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

On 2016-03-09 20:43, Anton Aylward wrote:

...

I take it to mean that you can't run rpc.imapd ? You do realise that imapd allows *arbitrary* remapping of uid and gid?

So if the remote files are 'owned' rw-r----- by hpotter;gryffindor you can map them to, for example, gregf:users

Do you have more info on that? I seem to recall having read about this, but I don't think I got a conclusion :-? cer@Telcontar:~> apropos rpc.imapd rpc.imapd: nothing appropriate. cer@Telcontar:~> cer@Telcontar:~> locate rpc.imapd cer@Telcontar:~> The name is confusing, it seems to be about mail protocol (imap), not about nfs. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Darin Perusich

20:17

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

rpc.idmapd is the correct name of the binary and it's part of the nfs-client package. -- Later, Darin On Wed, Mar 9, 2016 at 3:07 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

On 2016-03-09 20:43, Anton Aylward wrote:

...
I take it to mean that you can't run rpc.imapd ? You do realise that imapd allows *arbitrary* remapping of uid and gid?

So if the remote files are 'owned' rw-r----- by hpotter;gryffindor you can map them to, for example, gregf:users

Do you have more info on that? I seem to recall having read about this, but I don't think I got a conclusion :-?

cer@Telcontar:~> apropos rpc.imapd rpc.imapd: nothing appropriate. cer@Telcontar:~> cer@Telcontar:~> locate rpc.imapd cer@Telcontar:~>

The name is confusing, it seems to be about mail protocol (imap), not about nfs.

-- Cheers / Saludos,

Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

20:19

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

On 2016-03-09 21:17, Darin Perusich wrote:

...

rpc.idmapd is the correct name of the binary and it's part of the nfs-client package.

Not here. cer@Telcontar:~> rpm -q nfs-client nfs-client-1.2.8-4.17.1.x86_64 cer@Telcontar:~> rpm -ql nfs-client | grep imap cer@Telcontar:~> It must be new. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Darin Perusich

10 Mar 10 Mar

14:53

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

It's not "imap", it's "idmap" rpm -qli nfs-client |grep idmap /etc/idmapd.conf /usr/sbin/nfsidmap /usr/sbin/rpc.idmapd /usr/share/man/man8/idmapd.8.gz /usr/share/man/man8/nfsidmap.8.gz /usr/share/man/man8/rpc.idmapd.8.gz -- Later, Darin On Wed, Mar 9, 2016 at 3:19 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

On 2016-03-09 21:17, Darin Perusich wrote:

...
rpc.idmapd is the correct name of the binary and it's part of the nfs-client package.

Not here.

cer@Telcontar:~> rpm -q nfs-client nfs-client-1.2.8-4.17.1.x86_64 cer@Telcontar:~> rpm -ql nfs-client | grep imap cer@Telcontar:~>

It must be new.

-- Cheers / Saludos,

Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

16:20

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

On 03/10/2016 09:53 AM, Darin Perusich wrote:

...

It's not "imap", it's "idmap"

Sorry, my typo! -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

17:01

New subject: [opensuse] about rpc.imapd [Was: The fight continues - sistemad troubleshooting?]

On 2016-03-10 15:53, Darin Perusich wrote:

...

It's not "imap", it's "idmap"

Doh! Of course. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Andrei Borzenkov

9 Mar 9 Mar

20:21

New subject: [opensuse] The fight continues - sistemad troubleshooting?

09.03.2016 22:43, Anton Aylward пишет:

...

You do realise that imapd allows *arbitrary* remapping of uid and gid?

That's common misconception. While NFS v4+ does support user name mapping for display purposes, the *authentication* is done on RPC level and remains unchanged. So the only way to actually allow mapping between different UIDs is to use GSSAPI which in practice means Kerberos. As long as traditional SYS credentials are used, you still get the same problem (in the sense even worse, because it looks like user is correct ... :) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

20:25

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-09 21:21, Andrei Borzenkov wrote:

...

09.03.2016 22:43, Anton Aylward пишет:

...
You do realise that imapd allows *arbitrary* remapping of uid and gid?

That's common misconception. While NFS v4+ does support user name mapping for display purposes, the *authentication* is done on RPC level and remains unchanged. So the only way to actually allow mapping between different UIDs is to use GSSAPI which in practice means Kerberos. As long as traditional SYS credentials are used, you still get the same problem (in the sense even worse, because it looks like user is correct ... :)

This must be what I remember, that it is not possible to really and easily map users in nfs :-( -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Greg Freemyer

21:09

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On Wed, Mar 9, 2016 at 2:43 PM, Anton Aylward <opensuse@antonaylward.com> wrote:

...

On 03/09/2016 01:11 PM, Greg Freemyer wrote:

...
# NFS MOUNT HERE 10.200.3.230:/mnt/pacers1/kvm672/kvm672 /srv_new nfs rw,relatime,vers=3 0 0

# NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0

Oh, I see what you're doing! I woud;'t do it that way.

back in the 1980s working with SUN workstations and *EVERYTING* being NFS it was done with symlinks.

When I did similar on opensuse with my laptop connected to the home LAN I worked it out this way:

I had a /mnt/nfs/ main directory had under that things like

/mnt/nfs/homelan/downloads

So there was a

homelan:/home.anton/dowloads /mnt/nts/homelan/downloads nfs \ rw,rsize..,wsize...,_netdev 0 2

There was also

awaylan:/yamma/yamma/dowloads /mnt/nts/homelan/downloads nfs \ rw,rsize..,wsize...,_netdev 0 2

I then had symlink from /home/anton/downloads to /mnt/nts/homelan/downloads

If I was working in a coffee shop and didn't have a connection to neither homelan nor awaylan then my downloads directory appeared empty.

You seem to overcoming a different problem than the one I have. My NFS mount should be reliable. The NFS server and the client are both in a data center (in the cloud) This issue apparently came up because one of my container files somehow got deleted. (I haven't looked into that at all. It held backups from a third cloud VM, so I do want it to be reliable.)

...

The "_netdev" is important.

I'm testing now. How long should the _netdev mounts take to mount? After reboot I immediately logged in: the NFS mount is in place but the _netdev mount of /srv/sftp is not. It has been 20 minutes since reboot, so even the 15 minute point has passed? I'll keep watching.

...

The symlink avoid the bind mount that seems to be giving you problems.

I'm not doing bind mounts. They are loopback mounts. If you don't know what those are, an example is mounting an ISO image to allow files to be accessed inside the ISO.

...

...
===

The NFS server doesn't allow me to control of the file metadata (ownership etc.). In particular it forces all files on it to be owned by one UID / GID.

So as root I've created a couple large container files which I mount via loopback. Within the containers I have full control of the filesystem metadata (file ownership, etc).

I take it to mean that you can't run rpc.imapd ? You do realise that imapd allows *arbitrary* remapping of uid and gid?

So if the remote files are 'owned' rw-r----- by hpotter;gryffindor you can map them to, for example, gregf:users

The biggest thing I have on the NFS partition is a SFTP folder structure. One folder for each of my clients. # mkdir -p /srv/sftp/<USER>/incoming # mkdir -p /srv/sftp/<USER>/outgoing For each client I create a Linux account with a /sbin/nogin shell. # useradd -g sftpusers -d / -s /sbin/nologin <USER> I make each client the owner of the folder their files are in # chown -R <USER>:sftpusers /srv/sftp/<USER>/* # chmod 555 /srv/sftp/<USER>/outgoing Thus I need to have numerous UIDs as owners of files on the NFS mount. I don't think imapd can let me do that. FYI: My provider specifically doesn't allow that. They force all files created to be owned by one specific UID. I think they do that so they can sell more functional disk space at a higher per GB price. I pay the "backup space" rate. I overcome the single UID limitation by creating a large container file. I think I did # dd if=/dev/zero of=/srv_new/sftp-container-large count=1 seek=300GB # mkfs.ext4 /srv_new/sftp-container-large Then I do the loopback mount I showed in my fstab. With the new mount point I have full ability to create my SFTP folder structure. Greg -- Greg Freemyer www.IntelligentAvatar.net -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

10 Mar 10 Mar

02:44

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/09/2016 04:09 PM, Greg Freemyer wrote:

...

...
[Big Snip]

...

You seem to overcoming a different problem than the one I have.

True.

...

My NFS mount should be reliable. The NFS server and the client are both in a data center (in the cloud)

The point I was making wasn't so much about reliability, which was my problem in the coffee shop, but about avoiding bind mount, which is what I was doing with the SUN workstation I started off with for the technique.

...

This issue apparently came up because one of my container files somehow got deleted. (I haven't looked into that at all. It held backups from a third cloud VM, so I do want it to be reliable.)

Accidentally deleted files are quite another matter.

...

...
The "_netdev" is important.

I'm testing now. How long should the _netdev mounts take to mount?

The _netdev is about only mounting when the network is up. If you don't get a mount when there is a '_netdev' in there then you have a network problem. However the bind mount though the loop you describe

...

# NFS dependent mounts here /srv_new/sftp-container-large /srv/sftp ext4 nofail,loop 0 0 /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 nofail,loop 0 0

is another matter. What's going to stop that if happening if the NFS mount doesn't happen? Gee, this is so much easier with systemd mount units and explicit and clear dependencies. I don't know if the mount unit generator is dealing with this. Perhaps you can find the relevant units in /var/run/systemd/geneerator/ The NFS might be under /var/run/systemd/generator/remote-fs.target.d/ I'm not sure about the 'loop' ones.

...

After reboot I immediately logged in: the NFS mount is in place but the _netdev mount of /srv/sftp is not.

Are you saying that you applied to "_netdev" to the entries that had "loop"? No, that doesn't sound right. Moment .... yes that's definitely not right. What do the NFS montoring tools say? nfsstat, nfsiostat

...

...
The symlink avoid the bind mount that seems to be giving you problems.

I'm not doing bind mounts. They are loopback mounts. If you don't know what those are, an example is mounting an ISO image to allow files to be accessed inside the ISO.

I know what they are! I use them BUT I don't put them in my fstab!

...

The biggest thing I have on the NFS partition is a SFTP folder structure. One folder for each of my clients.

# mkdir -p /srv/sftp/<USER>/incoming # mkdir -p /srv/sftp/<USER>/outgoing

For each client I create a Linux account with a /sbin/nogin shell.

# useradd -g sftpusers -d / -s /sbin/nologin <USER>

I make each client the owner of the folder their files are in

# chown -R <USER>:sftpusers /srv/sftp/<USER>/* # chmod 555 /srv/sftp/<USER>/outgoing

Thus I need to have numerous UIDs as owners of files on the NFS mount. I don't think imapd can let me do that.

In the limiting case where you 'own'/admin both ends you can even have the same name/uid on both ends. That makes life very easy; in effect an idmapd is 1:1 and so is redundant. I note Andrei's caveat, but then I've had control of both ends in that I can also set up the name mapper on the 'far' end. The issue, which you don't make clear, is that the "far" end needs to have the distinct set of UID. Let me put it this way: if on machine A you have a huge /etc/passwd file (perhaps implemented by NIS or LDAP) with all of those user IDs .. Ah right, just like in the 1980s where the SUN workstation had only a small ROOTFS and no /home or /usr or the rest, those came via NFS, and the user logged in it triggered the mount of the relevant "home" from the the server, mounted it on /mnt/nfs/home, and there was a symlink from /home to /mnt/nfs/home .... but only the user's home files appared under the mount. We don't quite do it that way today, not since SUN developed PAM See pam_mkhomedir(8) pam_mount(8) <quote> Name pam_mount - A PAM module that can mount volumes for a user session Overview This module is aimed at environments with central file servers that a user wishes to mount on login and unmount on logout, such as (semi-)diskless stations where many users can logon and where statically mounting the entire /home from a server is a security risk, or listing all possible volumes in /etc/fstab is not feasible. ... The module also supports mounting local filesystems of any kind the normal mount utility supports, with extra code to make sure certain volumes are set up properly because often they need more than just a mount call, such as encrypted volumes. This includes SMB/CIFS, FUSE, dm-crypt and LUKS. </quote> That last might include your "loop" situation and <quote> NAME pam_mkhomedir - PAM module to create users home directory SYNOPSIS pam_mkhomedir.so [silent] [umask=mode] [skel=skeldir] DESCRIPTION The pam_mkhomedir PAM module will create a users home directory if it does not exist when the session begins. This allows users to be present in central database (such as NIS, kerberos or LDAP) without using a distributed file system or pre-creating a large number of directories. The skeleton directory (usually /etc/skel/) is used to copy default files and also sets a umask for the creation. The new users home directory will not be removed after logout of the user. </quote>

...

FYI: My provider specifically doesn't allow that. They force all files created to be owned by one specific UID. I think they do that so they can sell more functional disk space at a higher per GB price. I pay the "backup space" rate.

I overcome the single UID limitation by creating a large container file. I think I did

# dd if=/dev/zero of=/srv_new/sftp-container-large count=1 seek=300GB # mkfs.ext4 /srv_new/sftp-container-large

Then I do the loopback mount I showed in my fstab. With the new mount point I have full ability to create my SFTP folder structure.

Greg -- Greg Freemyer www.IntelligentAvatar.net

-- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

8 Mar 8 Mar

15:11

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-08 00:18, Anton Aylward wrote:

...

Or, RTFM, use "nofail" ... maybe

This option has another can of worms. If such a mount fails, you get no message at all that it failed - but sometimes you do: Telcontar:~ # mount /mnt/Ext/Moria2 mount: can't find LABEL=Moria_320 Telcontar:~ # There could be a non-blocking error condition (ie, continue with normal booting, but inform of the error), you could end writing files on the parent directory instead of on the mounted disk you wanted, and run out of space or fail later because of not found files. In this case, you want that some other services that depend on this mount do not try to start, or that they fail, and say clearly why, but the booting should continue so that you can diagnose and work on the problem. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Dave Howorth

15:15

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-08 15:11, Carlos E. R. wrote:

...

On 2016-03-08 00:18, Anton Aylward wrote:

...
Or, RTFM, use "nofail" ... maybe

This option has another can of worms. If such a mount fails, you get no message at all that it failed - but sometimes you do:

Telcontar:~ # mount /mnt/Ext/Moria2 mount: can't find LABEL=Moria_320 Telcontar:~ #

There could be a non-blocking error condition (ie, continue with normal booting, but inform of the error), you could end writing files on the parent directory instead of on the mounted disk you wanted, and run out of space or fail later because of not found files.

In this case, you want that some other services that depend on this mount do not try to start, or that they fail, and say clearly why, but the booting should continue so that you can diagnose and work on the problem.

You can write a script that runs soon after startup and that tries to access a subdirectory or file that is only present in the mounted directory. If it can't access the file, it raises an alarm. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

15:36

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 2016-03-08 16:15, Dave Howorth wrote:

...

...
In this case, you want that some other services that depend on this mount do not try to start, or that they fail, and say clearly why, but the booting should continue so that you can diagnose and work on the problem.

You can write a script that runs soon after startup and that tries to access a subdirectory or file that is only present in the mounted directory. If it can't access the file, it raises an alarm.

Yes, of course, that's what I do; But there may be neat methods using systemd services :-? -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)

Anton Aylward

16:45

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/08/2016 10:36 AM, Carlos E. R. wrote:

...

Yes, of course, that's what I do; But there may be neat methods using systemd services :-?

I've just posted about that; and yes, There's More Than One way To Do It. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

16:44

New subject: [opensuse] The fight continues - sistemad troubleshooting?

On 03/08/2016 10:15 AM, Dave Howorth wrote:

...

...
In this case, you want that some other services that depend on this mount do not try to start, or that they fail, and say clearly why, but the booting should continue so that you can diagnose and work on the problem.

You can write a script that runs soon after startup and that tries to access a subdirectory or file that is only present in the mounted directory. If it can't access the file, it raises an alarm.

That's easy to do with systemd, a unit that has a dependency. It can be a timer or it can be a "after" some other event. There may be a way to define a unit for the FS which has a "OnFail"

...

From RTFM ..

OnFailure= A space-separated list of one or more units that are activated when this unit enters the "failed" state. RequiresMountsFor= Takes a space-separated list of absolute paths. Automatically adds dependencies of type Requires= and After= for all mount units required to access the specified path. OnFailureJobMode= Takes a value of "fail", "replace", "replace-irreversibly", "isolate", "flush", "ignore-dependencies" or "ignore-requirements". Defaults to "replace". Specifies how the units listed in OnFailure= will be enqueued. See systemctl(1)'s --job-mode= option for details on the possible values. If this is set to "isolate", only a single unit may be listed in OnFailure=.. As far as I can make out, if a unit file exists for a mount that is used, otherwise the result from the running the generator on /etc/fstab is used. So take the line out of the fstab and create the mount unit with appropriate dependencies and the additional lines about what to do if and when it fails. You can use the currently existing generated mount file for that device as a template. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Yamaban

7 Mar 7 Mar

18:59

New subject: [opensuse] Re: The fight continues - sistemad troubleshooting?

On Mon, 7 Mar 2016 19:41, Greg Freemyer wrote:

...

On Mon, Mar 7, 2016 at 12:01 PM, Anton Aylward wrote:

...
On 03/07/2016 10:50 AM, Per Jessen wrote:

...
Greg Freemyer wrote:

...
I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log.

I went through and fixed every minor complaint until my server would run smoothly.

Commenting out this line from fstab was the "fix".

=== #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 ===

So something is not quite right about "/srv_new/portal_backup_container". What happens if you try to mount it manually?

...
The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot.

Most probably failure to mount that filesystem - I've seen that before.

Quite possibly so. Trying to mount it manually might show what is going on.

On the face of it, it looks like its trying to do do a "bind" mount (q.v. man page) but without the "--bind". After all the "/srv/.." part is not a device. Does it actually exist? Does the destination exist?

Hmm...

The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

If no one knows, I'll just open this as a bugzilla.

BTW; This isn't a --bind mount. It is a loopback mount. I'm supposed to have a large file at /srv_new/portal_backup_container that itself is a filesystem that I loopback mount.

I know that is strange, but it is needed to overcome an issue with the NFS mount options setup my my cloud provider. Inside the loopback mount I have total control of the mount options.

Oh! how about a workaround: /etc/fstab: (added noauto to options) /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 noauto,loop 0 0 and for the service that uses this mount (added to the .service file): [Service] ExecPreStart=/bin/sh -c 'test -d /home/portal_backup/portal_backup || mkdir -p /home/portal_backup/portal_backup' ExecPreStart=/bin/sh -c 'grep /home/portal_backup/portal_backup /proc/mounts || mount /home/portal_backup/portal_backup' this two parts should ensure that a) your system boots normally, b) is the mount is not there, no other service is hurt c) the service that NEEDs the mount can do the mount itself. - Yamaban. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

19:30

New subject: [opensuse] Re: The fight continues - sistemad troubleshooting?

On Mon, Mar 7, 2016 at 1:59 PM, Yamaban <foerster@lisas.de> wrote:

...

On Mon, 7 Mar 2016 19:41, Greg Freemyer wrote:

...
On Mon, Mar 7, 2016 at 12:01 PM, Anton Aylward wrote:

...
On 03/07/2016 10:50 AM, Per Jessen wrote:

...
Greg Freemyer wrote:

...
I got a journal dump before and after the failure (journalctl -xb > log). About 500 lines of logs added in the post failure log.

I went through and fixed every minor complaint until my server would run smoothly.

Commenting out this line from fstab was the "fix".

=== #/srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 loop 0 0 ===

So something is not quite right about "/srv_new/portal_backup_container". What happens if you try to mount it manually?

...
The big thing for me that makes this a pretty major bug is that the above line for some reason made my server unusable as a server. Something about it caused systemd to revert to maintenance mode 15 minutes after boot.

Most probably failure to mount that filesystem - I've seen that before.

Quite possibly so. Trying to mount it manually might show what is going on.

On the face of it, it looks like its trying to do do a "bind" mount (q.v. man page) but without the "--bind". After all the "/srv/.." part is not a device. Does it actually exist? Does the destination exist?

Hmm...

The destination does not exist (it used to).

But ignoring that rather big issue, why does a failed mount of a tertiary filesystem cause systemd to crash and burn.

If no one knows, I'll just open this as a bugzilla.

BTW; This isn't a --bind mount. It is a loopback mount. I'm supposed to have a large file at /srv_new/portal_backup_container that itself is a filesystem that I loopback mount.

I know that is strange, but it is needed to overcome an issue with the NFS mount options setup my my cloud provider. Inside the loopback mount I have total control of the mount options.

Oh! how about a workaround:

/etc/fstab: (added noauto to options) /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 noauto,loop 0 0

and for the service that uses this mount (added to the .service file): [Service] ExecPreStart=/bin/sh -c 'test -d /home/portal_backup/portal_backup || mkdir -p /home/portal_backup/portal_backup' ExecPreStart=/bin/sh -c 'grep /home/portal_backup/portal_backup /proc/mounts || mount /home/portal_backup/portal_backup'

this two parts should ensure that a) your system boots normally, b) is the mount is not there, no other service is hurt c) the service that NEEDs the mount can do the mount itself.

- Yamaban.

Thanks, I will definitely add "noauto". As to the "service". I will have to think about that. Currently another VM in the same data center is sending this machine a backup and it just does a scp to the mounted filesystem, so there isn't an obvious service to add your script to. Maybe I will just put similar lines in .profile (or .bashrc) for the dedicated backup_user account. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Yamaban

20:26

New subject: [opensuse] Re: Re: The fight continues - sistemad troubleshooting?

On Mon, 7 Mar 2016 20:30, Greg Freemyer wrote:

...

On Mon, Mar 7, 2016 at 1:59 PM, Yamaban wrote:

...
On Mon, 7 Mar 2016 19:41, Greg Freemyer wrote:

...
On Mon, Mar 7, 2016 at 12:01 PM, Anton Aylward wrote:

...
On 03/07/2016 10:50 AM, Per Jessen wrote:

...
Greg Freemyer wrote:

[snip] Oh! how about a workaround:

/etc/fstab: (added noauto to options) /srv_new/portal_backup_container /home/portal_backup/portal_backup ext4 noauto,loop 0 0

and for the service that uses this mount (added to the .service file): [Service] ExecPreStart=/bin/sh -c 'test -d /home/portal_backup/portal_backup || mkdir -p /home/portal_backup/portal_backup' ExecPreStart=/bin/sh -c 'grep /home/portal_backup/portal_backup /proc/mounts || mount /home/portal_backup/portal_backup'

this two parts should ensure that a) your system boots normally, b) is the mount is not there, no other service is hurt c) the service that NEEDs the mount can do the mount itself.

Thanks,

I will definitely add "noauto".

As to the "service". I will have to think about that. Currently another VM in the same data center is sending this machine a backup and it just does a scp to the mounted filesystem, so there isn't an obvious service to add your script to.

Maybe I will just put similar lines in .profile (or .bashrc) for the dedicated backup_user account.

Two way to get around that: First way create script that does the mount-point check and the mount, call this script either internall via a) internally via cron (with @reboot, add delay in script if needed) b) external via call from ssh before the scp starts Second way Create a 'portal_backup.service' file with [code] ====================== [Unit] Description="mount the loopback portal_backup independly" After=sshd.service postfix.service JobTimeoutSec=15 RequiresMountsFor=/home [Service] Type=oneshot ExecPreStart=/bin/sh -c 'test -d /home/portal_backup/portal_backup || mkdir -p /home/portal_backup/portal_backup' ExecStart=/bin/sh -c '/usr/bin/grep /home/portal_backup/portal_backup /proc/mounts || /usr/bin/mount /home/portal_backup/portal_backup' ExecStop=/usr/bin/umount /home/portal_backup/portal_backup [Install] ======================== [/code] and either create a portal_backup.timer (OnBootSec= or OnStartupSec=) or a call from cron (@reboot), or simply 'enable' this service IMHO a .service is the best way to go, you can call that via .timer, from cron (systemctl) or from ssh (also systemd) also if you get some Dependencies in (After=) you can add a general 'enable' for that. - Yamaban. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

13:34

On 03/05/2016 08:10 PM, Greg Freemyer wrote:

...

A server I admin has been nothing but trouble today.

It started having a problem, and being smarter than average bear I thought the best way out was to go ahead and upgrade from 13.1 to 13.2 and all would be good.

Not so much.

I'm now running 13.2. zypper dup says I'm done and so does zypper verify.

The current trouble is postfix isn't running.

And when I try "systemctl start postfix.service" it triggers a systemd meltdown and I get thrown into a systemd maintenance mode.

There is probably useful info in the systemd journal, but there is a lot of stuff in there and I don't know what I'm looking for.

Guidance appreciated.

...

From the systemd POV you should be able to get a specific drill down on postfix alone:

systemctl status postfix.service if that proves to be catastrophic then I'd take a close look at the unit file that manages postfix. See systemd.unit(5) As Per says, you can try starting postfix from the command line and see what is reporting. It may be that the upgrade has introduced new config files in /etc/postfix. Maybe/maybe-not there are rpmsave files there. Heck, I get then in normal 'zypper up' so if you got then in the 'upgrade' I wouldn't be surprised! -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

3204

Age (days ago)

3208

Last active (days ago)

List overview

Download

40 comments

8 participants

participants (8)

Andrei Borzenkov
Anton Aylward
Carlos E. R.
Darin Perusich
Dave Howorth
Greg Freemyer
Per Jessen
Yamaban

[opensuse] The fight continues - systemd troubleshooting?

tags

participants (8)