[opensuse] systemd stops postfix/nfsserver for false(?) reasons

newer
[opensuse] linux rules the world...

Peter Suetterlin

23 Oct 2017 23 Oct '17

11:05

Hi, I'm running a server on Leap 42.2. Amongst other things, it is mail and NFS server for the home directories. After a reboot today neither postfix nor the nfs server were running. A look at the boot log: Oct 23 07:44:43 royac6 systemd[1]: Stopped Postfix Mail Transport Agent. Oct 23 07:44:43 royac6 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home. So obviously it is because a 'problem' with the home directories (/home is served by nfs-server, and postfix uses Maildir in the home directories). But /home is mounted... So I looked at that: Oct 23 07:44:09 royac6 kernel: sdb: sdb1 Oct 23 07:44:09 royac6 kernel: sda: sda1 Oct 23 07:44:12 royac6 kernel: md: bind<sda1> Oct 23 07:44:13 royac6 kernel: md: bind<sdb1> Oct 23 07:44:13 royac6 kernel: md/raid1:md1: active with 2 out of 2 mirrors Oct 23 07:44:13 royac6 kernel: created bitmap (8 pages) for device md1 Oct 23 07:44:13 royac6 kernel: md1: bitmap initialized from disk: read 1 pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088 Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home. No further mentions of md1 or /home until the first message block above. A second after the umount I see Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Oct 23 07:44:44 royac6 systemd[1]: Mounting /home... What the **** is going on there? Sidenote: It is stopping the nfs and mail services before they even are started... As I don't see any real error message it seems to be some race condition in the systemd dependencies(?), but I'm not sure what to do to locate it. It seems to be the wait timer, but why is it still running? The device is already active, and has all it's disks... Any help appreciated! Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Carlos E. R.

23 Oct 23 Oct

12:09

On 2017-10-23 13:05, Peter Suetterlin wrote:

...

Hi,

I'm running a server on Leap 42.2. Amongst other things, it is mail and NFS server for the home directories.

After a reboot today neither postfix nor the nfs server were running. A look at the boot log:

Oct 23 07:44:43 royac6 systemd[1]: Stopped Postfix Mail Transport Agent.

The actual reason will be before that.

...

Oct 23 07:44:43 royac6 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded...

Notice that the RAID array has a problem, one disk missing.

...

Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

It is also stopping the raid array. Apparently /home is mounted in the raid array, not NFS. You have to clarify your setup.

...

So obviously it is because a 'problem' with the home directories (/home is served by nfs-server, and postfix uses Maildir in the home directories). But /home is mounted... So I looked at that:

Oct 23 07:44:09 royac6 kernel: sdb: sdb1 Oct 23 07:44:09 royac6 kernel: sda: sda1 Oct 23 07:44:12 royac6 kernel: md: bind<sda1> Oct 23 07:44:13 royac6 kernel: md: bind<sdb1> Oct 23 07:44:13 royac6 kernel: md/raid1:md1: active with 2 out of 2 mirrors Oct 23 07:44:13 royac6 kernel: created bitmap (8 pages) for device md1 Oct 23 07:44:13 royac6 kernel: md1: bitmap initialized from disk: read 1 pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088 Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b.

The missing disk for the RAID array now appears. Nowhere does it talk about NFS. So now it tries again to mount /home.

...

Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home.

No further mentions of md1 or /home until the first message block above.

A second after the umount I see

Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Oct 23 07:44:44 royac6 systemd[1]: Mounting /home...

What the **** is going on there? Sidenote: It is stopping the nfs and mail services before they even are started...

As I don't see any real error message it seems to be some race condition in the systemd dependencies(?), but I'm not sure what to do to locate it. It seems to be the wait timer, but why is it still running? The device is already active, and has all it's disks...

Any help appreciated!

Pit

-- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)

Per Jessen

13:51

Carlos E. R. wrote:

...

...
Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

It is also stopping the raid array. Apparently /home is mounted in the raid array, not NFS. You have to clarify your setup.

The box is an NFS server for /home, and in the garbled systemd output above, the nfs server is being stopped. -- Per Jessen, Zürich (9.1°C) http://www.cloudsuisse.com/ - your owncloud, hosted in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

18:27

On 2017-10-23 15:51, Per Jessen wrote:

...

Carlos E. R. wrote:

...
...
Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

It is also stopping the raid array. Apparently /home is mounted in the raid array, not NFS. You have to clarify your setup.

The box is an NFS server for /home, and in the garbled systemd output above, the nfs server is being stopped.

Ah, sorry, I thought it was importing an NFS share. Yes, the log says nfs server. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)

Peter Suetterlin

17:16

Sorry - I was busy with other things... Carlos E. R. wrote:

...

On 2017-10-23 13:05, Peter Suetterlin wrote:

...
Hi,

I'm running a server on Leap 42.2. Amongst other things, it is mail and NFS server for the home directories.

After a reboot today neither postfix nor the nfs server were running. A look at the boot log:

Oct 23 07:44:43 royac6 systemd[1]: Stopped Postfix Mail Transport Agent.

The actual reason will be before that.

The reason it is stopping is because it needs /home, and /home is to be unmounted. Same for the NFS server.

...

...
Oct 23 07:44:43 royac6 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded...

Notice that the RAID array has a problem, one disk missing.

But in the next part of my post can be seen that the disks are there, properly added and the RAID is active with two of two disks? Or is this a false message? Also the kernel found both disks (sda1 and sdb1)

...

...
Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

It is also stopping the raid array. Apparently /home is mounted in the raid array, not NFS. You have to clarify your setup.

Yes, as I said, it's the server for home directories, mounts them from the RAID and then exports via NFS.

...

...
So obviously it is because a 'problem' with the home directories (/home is served by nfs-server, and postfix uses Maildir in the home directories). But /home is mounted... So I looked at that:

Oct 23 07:44:09 royac6 kernel: sdb: sdb1 Oct 23 07:44:09 royac6 kernel: sda: sda1 Oct 23 07:44:12 royac6 kernel: md: bind<sda1> Oct 23 07:44:13 royac6 kernel: md: bind<sdb1> Oct 23 07:44:13 royac6 kernel: md/raid1:md1: active with 2 out of 2 mirrors Oct 23 07:44:13 royac6 kernel: created bitmap (8 pages) for device md1 Oct 23 07:44:13 royac6 kernel: md1: bitmap initialized from disk: read 1 pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088 Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b.

The missing disk for the RAID array now appears. Nowhere does it talk about NFS. So now it tries again to mount /home.

If you look at the timestamps, this is 30 seconds *before* it stops/unmounts /home and claims the disk is missing. Sorry for posting them out-of-sync. So to sum up again: Kernel detects both disks/partitions, md properly fires up the RAID clean, mounts it and recovers from an orphaned inode. Then suddenly systemd decides that there is a disk missing and unmounts again, just to 'find' the RAID again directly after that. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

17:50

23.10.2017 20:16, Peter Suetterlin пишет:

...

So to sum up again:

If you want any useful help show full log (journal -b). If you can reproduce this problem, boot without "quiet" on kernel command line, it will give more verbose and hopefully more useful log. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

18:11

On 23/10/17 18:50, Andrei Borzenkov wrote:

...

23.10.2017 20:16, Peter Suetterlin пишет:

...
So to sum up again:

If you want any useful help show full log (journal -b). If you can reproduce this problem, boot without "quiet" on kernel command line, it will give more verbose and hopefully more useful log.

What version of mdadm? What version of udev? Another possible problem is that mdadm is correctly assembling the array, and because udev is notified by the array appearing in /dev (the events are asynchronous), udev/systemd is getting confused over whether the array exists, or is degraded, or what. Doesn't sound that likely, but it's plausible. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

24 Oct 24 Oct

09:44

Wols Lists wrote:

...

On 23/10/17 18:50, Andrei Borzenkov wrote:

...
23.10.2017 20:16, Peter Suetterlin пишет:

...
So to sum up again:

If you want any useful help show full log (journal -b). If you can reproduce this problem, boot without "quiet" on kernel command line, it will give more verbose and hopefully more useful log.

What version of mdadm? What version of udev?

As I wrote, Leap 42.2. So mdadm - v3.4 - 28th January 2016 and udev-228-25.9.1

...

Another possible problem is that mdadm is correctly assembling the array, and because udev is notified by the array appearing in /dev (the events are asynchronous), udev/systemd is getting confused over whether the array exists, or is degraded, or what. Doesn't sound that likely, but it's plausible.

This is what I suspect is happening. The log part (which I unfortunately posted out-of-time-order) says md properly started the raid w/o issues. systemd also sees the partition (around line 1185 in the boot log http://www.royac.iac.es/~pit/Stuff/boot.log.xz): Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. and mounts it afterwards. But for some reason systemd thinks it is missing disks and is degraded, and I have no idea why :(( -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

09:31

Andrei Borzenkov wrote:

...

23.10.2017 20:16, Peter Suetterlin пишет:

...
So to sum up again:

If you want any useful help show full log (journal -b).

http://www.royac.iac.es/~pit/Stuff/boot.log.xz

...

If you can reproduce this problem, boot without "quiet" on kernel command line, it will give more verbose and hopefully more useful log.

OK, I'll take the quiet out. Yes, I've seen it once before. But the machine is typically only booted once or twice a year.... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

17:28

24.10.2017 12:31, Peter Suetterlin пишет:

...

Andrei Borzenkov wrote:

...
23.10.2017 20:16, Peter Suetterlin пишет:

...
So to sum up again:

If you want any useful help show full log (journal -b).

http://www.royac.iac.es/~pit/Stuff/boot.log.xz

Oct 23 07:44:09 royac6 systemd[1]: Started Timer to wait for more drives before activating degraded array.. Oct 23 07:44:09 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. This was in initrd and is correct. Timer is started, array is assembled, device appears, timer is stopped. Oct 23 07:44:13 royac6 systemd[1]: Started Timer to wait for more drives before activating degraded array.. And that is unexpected. There should be no reasons to start timer at this point, unless there is some race condition or bug in mdadm. What happens further is probably direct consequence of this. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded... In Leap 42.2 this service has Conflicts against device node. When systemd attempts to start this service, it also attempts to "stop" device node. While this is not possible, it will cause anything depending on this device node (mount point in the first place) to be stopped as well. So you get Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home. As for Postfix I suspect it has dependency on /home.

...

...
If you can reproduce this problem, boot without "quiet" on kernel command line, it will give more verbose and hopefully more useful log.

OK, I'll take the quiet out. Yes, I've seen it once before. But the machine is typically only booted once or twice a year....

Given that it is likely one-off condition rebooting once a year is unlikely to provide much useful information. Besides in 42.3 service was changed to not conflict, instead it is using conditional to skip starting. It would be interesting to (try to) understand how timer gets enabled. Could you provide output of "udevadm info --export-db" (not as compressed file, please)? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

25 Oct 25 Oct

10:29

Thanks a LOT for your detailed analysis Andrei! Andrei Borzenkov wrote:

...

24.10.2017 12:31, Peter Suetterlin пишет:

...
Andrei Borzenkov wrote:

...
23.10.2017 20:16, Peter Suetterlin пишет:

...
So to sum up again:

If you want any useful help show full log (journal -b).

http://www.royac.iac.es/~pit/Stuff/boot.log.xz

Oct 23 07:44:09 royac6 systemd[1]: Started Timer to wait for more drives before activating degraded array.. Oct 23 07:44:09 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array..

This was in initrd and is correct. Timer is started, array is assembled, device appears, timer is stopped.

Oct 23 07:44:13 royac6 systemd[1]: Started Timer to wait for more drives before activating degraded array..

And that is unexpected. There should be no reasons to start timer at this point, unless there is some race condition or bug in mdadm. What happens further is probably direct consequence of this.

Ah. I had also assumed there's no issue with md0 because this is already handled during the initrd phase, as it contains the system. I assumed it would now start that timer for the second RAID, md1, which is /home. But the timers are a systemd thing, is it?

...

Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded...

In Leap 42.2 this service has Conflicts against device node. When systemd attempts to start this service, it also attempts to "stop" device node. While this is not possible, it will cause anything depending on this device node (mount point in the first place) to be stopped as well.

Ouch. That is nasty.

...

So you get

Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

As for Postfix I suspect it has dependency on /home.

Yes, it needs $HOME/Maildir for mail delivery. So stopping both services is 'reasonable' if you umount /home. The questions are why it's unmounted, and why the services are not started after the second mount.

...

Given that it is likely one-off condition rebooting once a year is unlikely to provide much useful information.

Well, I can of course reboot for a test, just have to time it nicely with the users. I'll probably have a try later this evening.

...

Besides in 42.3 service was changed to not conflict, instead it is using conditional to skip starting.

Partially good news then - an update of the system is already planned.

...

It would be interesting to (try to) understand how timer gets enabled. Could you provide output of "udevadm info --export-db" (not as compressed file, please)?

I tried paste.opensuse.org, but it didn't like the size (7k lines...). So again here: http://www.royac.iac.es/~pit/Stuff/udev_info.txt I did create a bugreport, too: https://bugzilla.opensuse.org/show_bug.cgi?id=1064887 Thanks again, Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

10:51

On Wed, Oct 25, 2017 at 1:29 PM, Peter Suetterlin <P.Suetterlin@royac.iac.es> wrote:

...

Thanks a LOT for your detailed analysis Andrei!

Actually I was partially wrong here. .

...

...
Oct 23 07:44:13 royac6 systemd[1]: Started Timer to wait for more drives before activating degraded array..

And that is unexpected.

Well, you have two arrays, so initrd applies to root and here it *is* expected for the second array. Unfortunately normal logs do not show us which array is affected. I filed change request.

...

Ah. I had also assumed there's no issue with md0 because this is already handled during the initrd phase, as it contains the system. I assumed it would now start that timer for the second RAID, md1, which is /home.

Yes. This shifts the question from "why it was started" to "why it was not stopped". It was for md0 apparently.

...

...
Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

As for Postfix I suspect it has dependency on /home.

Yes, it needs $HOME/Maildir for mail delivery. So stopping both services is 'reasonable' if you umount /home. The questions are why it's unmounted, and why the services are not started after the second mount.

it is unmounted because systemd attempts to stop md1 which forces filesystem that depends on it to be stopped too. That is how dependency logic in systemd works. Service is not started because systemd is past the point where it would start it. Initial attempt was due to dependency on default target. Now default target is already started, so nothing triggers (re-)start of services. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

23 Oct 23 Oct

18:45

On 2017-10-23 19:16, Peter Suetterlin wrote:

...

Sorry - I was busy with other things...

Carlos E. R. wrote:

...
On 2017-10-23 13:05, Peter Suetterlin wrote:

...

Yes, as I said, it's the server for home directories, mounts them from the RAID and then exports via NFS.

Yes, Per clarified this part.

...

If you look at the timestamps, this is 30 seconds *before* it stops/unmounts /home and claims the disk is missing. Sorry for posting them out-of-sync.

Ah!! Uff... Please, don't ever do that. Or if you do, please say the logs are not in order, clearly. Please, can you repost a longer part of the log, all in correct order? Even better, the full (minutes) log, from boot till minutes later when you login? I can try have a go at it after reordering. Oct 23 07:44:09 royac6 kernel: sdb: sdb1 Oct 23 07:44:09 royac6 kernel: sda: sda1 Oct 23 07:44:12 royac6 kernel: md: bind<sda1> Oct 23 07:44:13 royac6 kernel: md: bind<sdb1> Oct 23 07:44:13 royac6 kernel: md/raid1:md1: active with 2 out of 2 mirrors Oct 23 07:44:13 royac6 kernel: created bitmap (8 pages) for device md1 Oct 23 07:44:13 royac6 kernel: md1: bitmap initialized from disk: read 1 pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088 The raid is assembled. The "capacity change" I don't understand, maybe the disks are external? Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Which disk is this one? Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home. Apparently it detects an error on the home filesystem and does recovery on it, before actually mounting it. Shouldn't there be more log entries here? There is a hole, half a minute. Oct 23 07:44:43 royac6 systemd[1]: Stopped Postfix Mail Transport Agent. Oct 23 07:44:43 royac6 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home. Yes, here it umounts /home because the array is degraded, but the reason of the degradation is missing. Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Same device as before. What is it? And then mounts home again, possibly degraded. Oct 23 07:44:44 royac6 systemd[1]: Mounting /home...

...

So to sum up again: Kernel detects both disks/partitions, md properly fires up the RAID clean, mounts it and recovers from an orphaned inode. Then suddenly systemd decides that there is a disk missing and unmounts again, just to 'find' the RAID again directly after that.

We need to see the complete boot log, without "quiet", and also the list of disks.

...

lsblk --bytes --output NAME,KNAME,RA,RM,RO,SIZE,TYPE,FSTYPE,LABEL,PARTLABEL,MOUNTPOINT,UUID,PARTUUID,WWN,MODEL,ALIGNMENT /dev/sd?

you can post those to susepaste.org for a limited time, and post here the link. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)

Wols Lists

18:52

On 23/10/17 19:45, Carlos E. R. wrote:

...

pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088

The raid is assembled. The "capacity change" I don't understand, maybe the disks are external?

This, I think, is normal. I know it confuses people, but if the array is detected while partially assembled (sounds like that did happen) the partially assembled array has a size of 0, because it's unusable. Once it's fully assembled, the size gets reset. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

24 Oct 24 Oct

10:10

Carlos E. R. wrote:

...

On 2017-10-23 19:16, Peter Suetterlin wrote:

...

...
If you look at the timestamps, this is 30 seconds *before* it stops/unmounts /home and claims the disk is missing. Sorry for posting them out-of-sync.

Ah!!

Uff...

Please, don't ever do that. Or if you do, please say the logs are not in order, clearly.

Mea maxima culpa. I posted it the way I was proceeding to find the issue :(

...

Please, can you repost a longer part of the log, all in correct order? Even better, the full (minutes) log, from boot till minutes later when you login?

http://www.royac.iac.es/~pit/Stuff/boot.log.xz

...

I can try have a go at it after reordering.

Oct 23 07:44:09 royac6 kernel: sdb: sdb1 Oct 23 07:44:09 royac6 kernel: sda: sda1 Oct 23 07:44:12 royac6 kernel: md: bind<sda1> Oct 23 07:44:13 royac6 kernel: md: bind<sdb1> Oct 23 07:44:13 royac6 kernel: md/raid1:md1: active with 2 out of 2 mirrors Oct 23 07:44:13 royac6 kernel: created bitmap (8 pages) for device md1 Oct 23 07:44:13 royac6 kernel: md1: bitmap initialized from disk: read 1 pages, set 11 of 15260 bits Oct 23 07:44:13 royac6 kernel: md1: detected capacity change from 0 to 1024061145088

The raid is assembled. The "capacity change" I don't understand, maybe the disks are external?

No, this is normal and happens with all raid assemblies

...

Oct 23 07:44:13 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b.

Which disk is this one?

That's the one in question, to be mounted on /home, which is done in the next step

...

Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home.

Apparently it detects an error on the home filesystem and does recovery on it, before actually mounting it.

Yes, the only unusual thing I could find. But the mount succeeds without further delay.

...

Shouldn't there be more log entries here? There is a hole, half a minute.

There are, but not related. The 30s is (I suppose) the timeout of the wait-for-more-disks.

...

Oct 23 07:44:43 royac6 systemd[1]: Stopped Postfix Mail Transport Agent. Oct 23 07:44:43 royac6 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice. Oct 23 07:44:43 royac6 systemd[1]: Starting Activate md array even though degraded... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS server and services. Oct 23 07:44:43 royac6 systemd[1]: Stopping NFSv4 ID-name mapping service... Oct 23 07:44:43 royac6 systemd[1]: Stopped NFS Mount Daemon. Oct 23 07:44:43 royac6 systemd[1]: Stopped NFSv4 ID-name mapping service. Oct 23 07:44:43 royac6 systemd[1]: Started Activate md array even though degraded. Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

Yes, here it umounts /home because the array is degraded, but the reason of the degradation is missing.

Exactly. There is absolutely nothing in the logs that suggests an issue with the RAID. Probably I have to put systemd in a more verbose mode?

...

Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b.

Same device as before. What is it?

Not sure - I think it restarts the detection after the timeout and finds it (again).

...

And then mounts home again, possibly degraded.

Oct 23 07:44:44 royac6 systemd[1]: Mounting /home...

Yes, it mounts it again. In perfectly sane state. And I'm quite sure (from the logs) that it was clean before, too.

...

...
So to sum up again: Kernel detects both disks/partitions, md properly fires up the RAID clean, mounts it and recovers from an orphaned inode. Then suddenly systemd decides that there is a disk missing and unmounts again, just to 'find' the RAID again directly after that.

We need to see the complete boot log, without "quiet", and also the list of disks.

Without quiet will take a while, I'll have to schedule a reboot without affecting the observatory operations...

...

...
lsblk --bytes --output NAME,KNAME,RA,RM,RO,SIZE,TYPE,FSTYPE,LABEL,PARTLABEL,MOUNTPOINT,UUID,PARTUUID,WWN,MODEL,ALIGNMENT /dev/sd?

you can post those to susepaste.org for a limited time, and post here the link.

Wanted to do that with the boot log, but that's a bit too long (5k lines). Or is there an easy way to get that into the form? The lsblk output is here: http://paste.opensuse.org/94717111 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Aaron Digulla

23 Oct 23 Oct

14:58

Am Montag, Oktober 23, 2017 13:05 CEST, Peter Suetterlin <P.Suetterlin@royac.iac.es> schrieb:

...

Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home.

It seems systemd is trying to mount /home twice. Check your fstab and make sure there is no duplicate entry. The other thing is to look for the source of "Starting Activate md array even though degraded". That sounds like the RAID1 is broken somehow. Use the standard tools to check the array. systemd is probably just calling those tools as well, they report a problem and systemd is acting on that. The other problems (mail and NFS) are stopped by systemd because they need /home or the target "Local File Systems". Since /home is broken, they are stopped and /home is unmounted. Also check your other local file systems (root, var, ...).

...

Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

Why does it report /home twice?

...

Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Oct 23 07:44:44 royac6 systemd[1]: Mounting /home...

This sounds like systemd gave up waiting for the broken mirror and tried to mount /home anyway. that would be the third time /home is mounted during boot. Regards, -- Aaron "Optimizer" Digulla a.k.a. Philmann Dark "It's not the universe that's limited, it's our imagination. Follow me and I'll show you something beyond the limits." http://blog.pdark.de/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

17:46

Aaron Digulla wrote:

...

Am Montag, Oktober 23, 2017 13:05 CEST, Peter Suetterlin <P.Suetterlin@royac.iac.es> schrieb:

...
Oct 23 07:44:13 royac6 systemd[1]: Mounting /home... Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): 1 orphan inode deleted Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): recovery complete Oct 23 07:44:13 royac6 kernel: EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: discard Oct 23 07:44:13 royac6 systemd[1]: Mounted /home.

It seems systemd is trying to mount /home twice. Check your fstab and make sure there is no duplicate entry.

Eh - no. The first line says it's going to mount, the second says "now it's mounted". There's only one entry for /home in the fstab.

...

The other thing is to look for the source of "Starting Activate md array even though degraded". That sounds like the RAID1 is broken somehow. Use the standard tools to check the array. systemd is probably just calling those tools as well, they report a problem and systemd is acting on that.

The tools (mdstat, mdadm --detail / -E) tell me that the RAID is fine, and it had been properly assembled before the mount. But 30s later systemd thinks it is degraded. Which it is not(?).

...

The other problems (mail and NFS) are stopped by systemd because they need /home or the target "Local File Systems". Since /home is broken, they are stopped and /home is unmounted. Also check your other local file systems (root, var, ...).

The rest is on a different RAID1 (md0) and has no issues.

...

...
Oct 23 07:44:43 royac6 systemd[1]: Stopped target Local File Systems. Oct 23 07:44:43 royac6 systemd[1]: Unmounting /home... Oct 23 07:44:43 royac6 systemd[1]: Stopped (with error) /dev/md1. Oct 23 07:44:43 royac6 systemd[1]: Unmounted /home.

Why does it report /home twice?

Again, 'unmounting' and 'unmounted', so refering to the same operation.

...

...
Oct 23 07:44:44 royac6 systemd[1]: Stopped Timer to wait for more drives before activating degraded array.. Oct 23 07:44:44 royac6 systemd[1]: Found device /dev/disk/by-uuid/133b616a-1100-4278-86a7-9eb677783e9b. Oct 23 07:44:44 royac6 systemd[1]: Mounting /home...

This sounds like systemd gave up waiting for the broken mirror and tried to mount /home anyway. that would be the third time /home is mounted during boot.

It's the second time, but yes. It had mounted it, then unmounted, and mounted again. But there is nothing in the logs that gives a hint why it thinks the disk is not there. The RAID is not degraded, and never was (else there would be *some* message in the logs, wouldn't it? Or is it systemd policy to hide such unimportant info from me? And the final question is, of course, after /home is finally mounted - why doesn't it start NFS server and postfix services now? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org