[Bug 821879] New: udev update breaks network
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c0 Summary: udev update breaks network Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Critical Priority: P5 - None Component: Update Problems AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: bwiedemann@suse.com QAContact: jsrain@suse.com CC: meissner@suse.com, fcrozat@suse.com, rmilasan@suse.com Found By: Development Blocker: --- Reproducible: Always Steps to Reproduce: have a 12.3 x86_64 VM with eth0 configured for dhcp4 ssh to the VM and do zypper in udev-195-13.25.1 Actual Results: Installing: udev-195-13.25.1 ...............<50%>====================[\] in the middle of udev installation, network goes down, no dhcpcd running This is very bad for servers where you often have no easy way to do rcnetwork restart (which reports an error after 31s but still brings the dhcpcd and network back) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c1
--- Comment #1 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c3
--- Comment #3 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
FeiXiang Zhang
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c4
Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c5
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c6
--- Comment #6 from Marius Tomaschewski
in the middle of udev installation, network goes down, no dhcpcd running
"systemctl status network.service" shows it is stopped? Then I guess, systemd resolves some "Wants" dependencies and stops it or something like this. Please enable systemd debugging, so it is visible what systemd is doing (starting/stopping).
This is very bad for servers where you often have no easy way to do rcnetwork restart (which reports an error after 31s but still brings the dhcpcd and network back)
Sure. Which errors does it show exactly? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c7
--- Comment #7 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c8
--- Comment #8 from Frederic Crozat
BTW, how do I enable debugging in systemd while the system is running. I know for udev, but doesn't help, don't get anything.
SIGRTMIN+22 to PID 1, which means: kill -56 1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c9
--- Comment #9 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c10
Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c11
--- Comment #11 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c12
--- Comment #12 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c13
--- Comment #13 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c14
--- Comment #14 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c15
--- Comment #15 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c16
--- Comment #16 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c17
--- Comment #17 from Marius Tomaschewski
Again another note here:
If we switch from dhcpcd to dhclient in /etc/sysconfig/network/dhcp the problem can't be reproduced.
opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_BIN DHCLIENT_BIN="dhclient"
opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_DEBUG DHCLIENT_DEBUG="yes" But debug mode doesn't seem to work, at least when using dhclient.
It causes to writes debug to /var/log/dhclient-script.$interface.log. Bernhard, We have to find out what's the problem with the network first before we continue to check what happens while "zypper in -f udev". Please stop the network, set DEBUG=EXTRA in /etc/sysconfig/network/config (bash -vx of all scripts), then start the network and attach the tar archive created by: "tar cvjpf /tmp/network-debug.tar.bz2 /dev/.sysconfig/network" On 12.3-GA, there is sysconfig-0.80.5-1.2.1. In updates, there is sysconfig-0.80.5-1.5.1, that provides a fix in this area ... but perhaps the fix does not work properly: * Mi Mär 20 2013 mt@suse.de - Fixed to wait for dhcp/ipv6 under systemd again. Fixed regression caused by bnc#785240, bnc#780644 fixes to not discard the dhcp/ipv6 dupplicate address detection in progress error codes under systemd completely, but wait until dhcp/ipv6 dad finished or the WAIT_FOR_INTERFACES timeout is reached and then discard in the status returned to systemd (bnc#808718). It caused failures of other services trying to bind tentative IPv6 addresses, e.g. in mixed dhcp4 / static IPv6 setups. Thanks to Rolf Eike Beer for the report/tests/debug outputs. [0001-Fixed-to-wait-for-dhcp-ipv6-under-systemd-again.patch] Which sysconfig are you using? When the newer one, please install the older version and try again. When this solves the rcnetwork errors, there is a bug in the fix. Please provide DEBUG=EXTRA output of the version causing it or from both, but please make sure to remove all exdeb.* files from older runs first. (In reply to comment #16)
found something: rcnetwork status has eth0 is up, but ipv6 duplicate address check failed dead
but IPv6 is not actually used in that network
No, but duplicate address detection is always done for IPv6 and you seem to have another machine in the network using same MAC address as yours. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c18
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c19
--- Comment #19 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c20
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c21
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c22
Marius Tomaschewski
I did echo net.ipv6.conf.eth0.accept_dad=0 >> /etc/sysctl.conf reboot
and then, rcnetwork restart would succeed after 3.5s instead of failing after 31s ; and I could ping6 but after next boot I still get duplicate address errors maybe because network is initialized before sysctl.conf is used?
Move it to /etc/sysconfig/network/ifsysctl -- "man 5 ifsysctl"; this one is applied a) by udev, b) by ifup. (In reply to comment #19)
err. host MAC starts with FE but VM's with FA
yes, they differ. but as they're generated/random, another vm on another host still may have it.
could still be a kernel bug http://ipv6-or-no-ipv6.blogspot.de/2013/02/ipv6-duplicate-address-in-linux.h...
This could be. You aren't using bonding, are you? Linux bonding AFAIR sends out dad through each port. The ipv6 code is IMO not aware of and detects dups when it gets two answers. It basically does not check received > sent, but received > 1. See also bug#715430. Workaround is to set "disable_ipv6 = 1" or at least "accept_dad = 0" an bridge-ports & bond-slaves [where no IPv6 addr is assigned anyway]. (In reply to comment #21)
It looks as if the duplication happens in the bridge interface [...] The two packets are only 22 microseconds apart which is about the time a ping on 127.0.0.1 takes but much less than anything over the network.
also brctl showmacs br563 |grep 33:33:ff:23:f6:4d ^^^^^^^^^^^^^^^^^
This (33:33:*) is a IPv6 multicast address mapped to ethernet MAC.
does not return anything, so I guess the default bridge behaviour is to pass the packet to all ports - but should that really include the port it came from?
Host kernel is 3.0.74-0.6.8-default
I've not noticed such behavior on my hosts + vm's using same kernel. Jiri, can you take a look / do you have an idea what goes wrong here? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c23
--- Comment #23 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c24
Jiri Bohac
https://bugs.launchpad.net/nova/+bug/1011134
so we really have 3 bugs:
1. VM interfaces on compute-hosts of SUSE Cloud 1.0 / OpenStack have hairpin_mode on, making them receive back their own traffic
I think you're correct. My understanding of VEPA is that the hairpin mode is there to allow the bridge to act as an uplink to another VEPA-mode bridge connected to the specific port. Setting the hairpin mode on a VSI is wrong. The above mentioned Ubuntu bug, has a link to an openstack "fix" for this: https://review.openstack.org/#/c/14017/ Instead of not setting the hairpin mode on the VSIs, they set up a packet filter to work around the problem. Am I missing something?
2. with the Host's bridge treating multicast like broadcast, VM's Linux IPv6 duplicate address detection receives its own sent packet and assumes some other host is using the same addr
Multicasts are treated like broadcasts by L2 switches - I don't see what's wrong with that. They just should not be looped back to the port they came from, which is what the hairpin misconfiguration causes. David Miller has rejected a patch trying to work around such a network misconfiguration: http://www.spinics.net/lists/netdev/msg127696.html I agree. I think it's a good thing that DAD detects duplicate addresses caused by a duplicate MAC address - something quite likely to happen with virtual machines.
3. 12.3's systemd assumes this means a failure in the network script and on udev upgrade the network is killed
Yes, this looks incorrect. For example, one may have more than one IPv6 addresses configured. One of them failing DAD should not be a reason to deconfigure the network -- the other addresses can still be used, IPv4 can be used. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c25
--- Comment #25 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c26
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c27
--- Comment #27 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c28
--- Comment #28 from Marius Tomaschewski
Created an attachment (id=542777) --> (http://bugzilla.novell.com/attachment.cgi?id=542777) [details] journal
it seems on boot there are two dhcpcd processes running at the same time
No, there are multiple stages: dhcpcd gets a lease and calls dhcpcd-hook with "up", "new", "down", "complete" [which results in "ifdown -o dhcp", "netconfig" and "ifup -o dhcp"]. Then it forks. Basically the problem is, that when /etc/init.d/network reports !0 to systemd, systemd will stop it again and not start services requiring it. When it does not report failure on error conditions, other services will fail. For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails. You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface). (In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c29
--- Comment #29 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c30
--- Comment #30 from Bernhard Wiedemann
For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails.
You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface).
IPv6 dad is no more failing, as I workarounded this on the openstack side. So rcnetwork status always shows as green, but the udev bug can still occur. Also eth0 is set to DHCP4 so no dhcpv6 client should run.
(In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg).
Does dhcpcd react to receiving its own gratuitous ARP packet? How? What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c31
--- Comment #31 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c32
--- Comment #32 from Marius Tomaschewski
(In reply to comment #28)
For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails.
You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface).
IPv6 dad is no more failing, as I workarounded this on the openstack side.
OK.
So rcnetwork status always shows as green, but the udev bug can still occur.
Only when some error occurs and duplicate addresses are definitely errors. The status isn't checked per address, but per interface and rcnetwork reports failure when there is an error on mandatory interfaces. You can use STARTMODE=hotplug or ifplugd for all "nice to have" interfaces.
Also eth0 is set to DHCP4 so no dhcpv6 client should run.
When BOOTPROTO=dhcp4 is set, dhcp6 will be not started/used, but this does not mean IPv6 is disabled -- it is enabled by default (static and autoconf). You can disable a) DAD or b) IPv6 at all via ifsysctl or set IPV6_DAD_WAIT="0" to disable it the check (each one per-interface or globally).
(In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg).
Does dhcpcd react to receiving its own gratuitous ARP packet? How?
I don't know/remember what it exactly does here.
What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states?
/etc/init.d/network is using "X-Systemd-RemainAfterExit: true" LSB tag: RemainAfterExit= Takes a boolean value that specifies whether the service shall be considered active even when all its processes exited. Defaults to no. "active (exited)" means the network.service cgroup is empty. You'll see this in static configurations (systemctl status network.service). "active (running)" means, there is some process in the cgroup, that is in e.g. dhcp mode where dhcp client is running. Are you using dhcp and get "active (exited)"? Did the dhcpcd die? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c33
--- Comment #33 from Marius Tomaschewski
Steps to reproduce: cd /dev/shm wget http://openqa.opensuse.org/openqa/img/12.3-mini64.qcow2 # 369MB qemu-kvm -drive if=virtio,file=12.3-mini64.qcow2
Going to fetch it...
root linux rcnetwork status # will show "active (exited)"
No dhcpcd in the cgroup?!
zypper -n in -f udev ifconfig # will show eth0 without inet addr
The problem seems to be here: Jun 06 15:58:36 bwiedemann-12 systemd[1]: Received SIGCHLD from PID 2204 (ifstatus-route). Jun 06 15:58:36 bwiedemann-12 systemd[1]: Got SIGCHLD for process 2204 (ifstatus-route) Jun 06 15:58:36 bwiedemann-12 systemd[1]: Child 2204 died (code=exited, status=1/FAILURE)
Could it be that multiple ifup eth0 processes are spawned? When running this with -no-kvm option, bootup was so slow that rcnetwork status showed 3 different PIDs in lines of /bin/bash /sbin/ifup eth0 -o rc onboot
and even an ifdown in between?
ifdown is usually from "ifdown eth0 -o dhcp" -- started for dhcp post processing. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c34
--- Comment #34 from Marius Tomaschewski
What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states?
"active (exited)" means the network.service cgroup is empty. You'll see this in static configurations (systemctl status network.service). "active (running)" means, there is some process in the cgroup, that is in e.g. dhcp mode where dhcp client is running.
Wrong. "active (exited)" means, the /etc/init.d/network (network.service) is done; it exited with the code visible in the status. "active (running)" means, the /etc/init.d/network (network.service) is still running. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c35
--- Comment #35 from Marius Tomaschewski
IPv6 dad is no more failing, as I workarounded this on the openstack side.
And how? For me, everything works fine using your VM on my host / network. Network works fine and systemd does not trigger restart while "zypper in -f udev": 08:02:09.641545-04:00 bwiedemann-12 systemd[1]: Stopping udev Control Socket. 08:02:09.641869-04:00 bwiedemann-12 systemd[1]: Closed udev Control Socket. 08:02:09.642369-04:00 bwiedemann-12 systemd[1]: Stopping udev Kernel Socket. 08:02:09.642600-04:00 bwiedemann-12 systemd[1]: Closed udev Kernel Socket. 08:02:09.643468-04:00 bwiedemann-12 systemd[1]: Stopping udev Kernel Device Manager... 08:02:09.644205-04:00 bwiedemann-12 systemd[1]: Stopped udev Kernel Device Manager. 08:02:09.644598-04:00 bwiedemann-12 systemd[1]: Stopped udev Kernel Device Manager. 08:02:09.747782-04:00 bwiedemann-12 systemd[1]: Reloading. 08:02:09.762553-04:00 bwiedemann-12 systemd[1]: Starting udev Kernel Socket. 08:02:09.762880-04:00 bwiedemann-12 systemd[1]: Listening on udev Kernel Socket. 08:02:09.763129-04:00 bwiedemann-12 systemd[1]: Starting udev Control Socket. 08:02:09.763413-04:00 bwiedemann-12 systemd[1]: Listening on udev Control Socket. 08:02:09.763656-04:00 bwiedemann-12 systemd[1]: Starting udev Kernel Device Manager... 08:02:09.771822-04:00 bwiedemann-12 systemd[1]: Started udev Kernel Device Manager. 08:02:10.067863-04:00 bwiedemann-12 kernel: [ 55.352623] device-mapper: uevent: version 08:02:10.067873-04:00 bwiedemann-12 kernel: [ 55.352847] device-mapper: ioctl: 4.23.0-io 08:02:12.343079-04:00 bwiedemann-12 cloud-init[1799]: 2013-06-07 08:02:12,342 - util.py[WA 08:02:15.243755-04:00 bwiedemann-12 kernel: [ 60.528958] SGI XFS with ACLs, security att 08:02:15.247743-04:00 bwiedemann-12 kernel: [ 60.532965] JFS: nTxBlock = 3928, nTxLock = 08:02:15.261746-04:00 bwiedemann-12 kernel: [ 60.546700] QNX4 filesystem 0.2.3 registere 08:02:15.274808-04:00 bwiedemann-12 kernel: [ 60.559710] Btrfs loaded 08:02:15.277978-04:00 bwiedemann-12 systemd[1]: Mounted FUSE Control File System. 08:02:15.278746-04:00 bwiedemann-12 kernel: [ 60.563293] fuse init (API version 7.20) 08:02:15.299619-04:00 bwiedemann-12 os-prober: debug: /dev/vda1: is active swap Please set DEBUG=EXTRA in /etc/sysconfig/network/config and provide the output: "tar cvjpf /tmp/network-debug.tar.bz2 /dev/.sysconfig/network" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c36
--- Comment #36 from Marius Tomaschewski
it might be worth shipping a .service for the "ifup" way of network, which can then tell systemd to accept additional error code (SuccessExitStatus,RestartPreventExitStatus, see man systemd.service)
BTW: Basically the only possibility to avoid all this trouble I see -- except of comment 29 -- is to use something like this: --- a/scripts/network +++ b/scripts/network @@ -906,6 +906,8 @@ case "$ACTION" in reload_firewall + # do not report any errors to systemd + test "$SD_RUNNING" = "yes" && rc_reset ;; stop) But is this really something we should use?! Frederic, would it be possible to implement: X-Systemd-RestartPreventExitStatus: ... X-Systemd-SuccessExitStatus: ... ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c37
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c38
--- Comment #38 from Frederic Crozat
Frederic, would it be possible to implement: X-Systemd-RestartPreventExitStatus: ... X-Systemd-SuccessExitStatus: ...
?
No, it is a wrong idea. for systemd, restart is stop + start. If we really need something, you should just create a network.service file in /usr/lib/systemd/system which will call /etc/init.d/network in ExecStart / ExecStop / ExecReload. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c39
--- Comment #39 from Marius Tomaschewski
Marius: how often did you try? did you run the disk from tmpfs? on a fast host?
6 times on a i7 box with image on SSD.
I think there is a race condition happening and that depends on the speed of the system.
This could be. I'll retry on tmpfs. Could you provide the DEBUG=EXTRA outputs please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c40
--- Comment #40 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c41
--- Comment #41 from Marius Tomaschewski
Could you provide the DEBUG=EXTRA outputs please?
Together with debug logs from systemd please. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c42
--- Comment #42 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c43
--- Comment #43 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c44
Dirk Mueller
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c45
--- Comment #45 from Marius Tomaschewski
(SuccessExitStatus,RestartPreventExitStatus, see man systemd.service)
It is IMO simplier to catch it in "rcnetwork start", where it is reported. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c46
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c47
Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c48
--- Comment #48 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c49
--- Comment #49 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c50
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c51
Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c52
Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c53
--- Comment #53 from Robert Milasan
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c54
Marius Tomaschewski
for me, systemctl (and systemd) are ok:
main process (ie /etc/init.d/network, on the Process line) was executed, finished and returned an exit code of 0 => Active (because /etc/init.d/network has X-Systemd-RemainAfterExit: true) and Exited because the pid referenced on the Process line is no longer running
there is still one process running, which was started by this process, and it is dhcpcd.
And why does it behaves like this only in 50% of the cases/boots [when VM is on tmpfs of a fast host, VM on HDD in <10% of the cases]? Usually it reports "active (running)" here. Also after "systemctl restart network.service" after it initially booted in "active (exited)" state. The situation is exactly same: rcnetwork exited with status 0, but there is still dhcpcd running. Just the systemd state differs from time to time. And why does systemd -- while udev update -- drops all state and kills all cgroup members (dhcpcd) in "active (exited)" state [run1], but not in the "active (running)" [run3], where it does not kill all network.service cgroup members? => See run1 and run3 in the attachment of comment 49. (In reply to comment #52)
Re-assigning the bug to Marius, don't really see this as a udev issue.
It is not a network scripts bug. Network works, 0 status is reported. Re-assigning to Frederic as systemd is not consistent / behaves strange. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c55
--- Comment #55 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c56
--- Comment #56 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c57
--- Comment #57 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c58
--- Comment #58 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c59
--- Comment #59 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c60
--- Comment #60 from Frederic Crozat
(In reply to comment #55)
Hmm: - the systemd behavior is inconsistent.
With dhcp: it _usually_ reports correctly "active (running)", _sometimes_ "active (exited)"
You basically need a VM on a tmpfs of a fast host to get the "active (exited)".
Looks to me, that the RemainAfterExit logic is not sufficient or there is some kind of race condition. To 'considers all PID to be potentially "main PID"' is IMO correct.
Systemd should set the state consistently with RemainAfterExit: always "active (running)" when the main PID (of started init script) process exits and cgroup is not empty and when the cgroup becomes empty, change to "active (exited)".
What is the output of systemctl show network.service in both case ?
- Providing a network.service lease has potential to break the service switch in yast2 -- but I didn't tested it.
I don't see how it would break service switch ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c61
--- Comment #61 from Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c62
--- Comment #62 from Marius Tomaschewski
(In reply to comment #59)
(In reply to comment #55) What is the output of systemctl show network.service in both case ?
I'll provide it next week -- have to revert the changes first.
- Providing a network.service lease has potential to break the service switch in yast2 -- but I didn't tested it.
I don't see how it would break service switch ?
I don't see it now, but the potential is there: I were thinking about the 2nd stage tricks and because it calls "disable NetworkManager.service" without a following "enable network.service", but there is AFAIR also a "insserv network" to ensure lsb script is active, that will be forwarded to systemd now. At the moment it looks fine -- seems to work. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c64
--- Comment #64 from Marius Tomaschewski
What is the output of systemctl show network.service in both case ?
Attached in the archive. No difference except in timestamps and SubState. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c65
--- Comment #65 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c66
--- Comment #66 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c67
--- Comment #67 from Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c69
--- Comment #69 from Marius Tomaschewski
Just wondering: is it possible to have any "long running" process being run with the udev network rules ? This is no longer possible with systemd / udev in 12.3 (RUN+= shouldn't have anything long running there, systemd unit should be used instead) and udev will kill any process remaining (like an dhcpcd instance) once the script from RUN+ is finished. Feel free to tell me I'm wrong :)
Hmm.. The question is what is "long running" -- the complete boot needs just few secs in both cases. The rule may start some background jobs. AFAIR, the udev rule timeout were much higher. I'll take a close look here... when udev kills the background processes that were started via RUN, it could be a reason. But I currently don't believe it is udev related.
There is also something I don't understand: when I look at the journal trace on attachment #3 [details] (with systemd.log_level=debug), it looks like dhcpcd PID and output which is left running is not the one which was initially started:
Jun 06 15:58:25 bwiedemann-12 dhcpcd[750]: eth0: exiting vs Jun 06 16:07:30 bwiedemann-12 dhcpcd[1387]: eth0: received SIGTERM, stopping
It looks like two dhcpcd daemons were started somehow, with one "blocking" the other.
No, dhcpcd were started, after it got a lease it forks continuing in background, while the parent exits [pid 750 above].
It would be interesting to have, for the same problematic run (active(exited)), journal trace (journalctl -b, with booting with systemd.log_level=debug) and the network debug trace.
Yes, and perhaps also udev in debug mode too.
Another reminder is network.service (and this kind of initscript) is "abusing" type=forking, since there is no real "daemon" running, nor a PIDFile to help systemd to detect if there is really which PID is the main one.
Yes, it is a drawback is in systemd / RemainOnExit logic for proper systemV script support :-)
I'm also not 100% the issue is caused by active(exited) vs active(running), since upgrading udev package doesn't cause any change at systemd own configuration (no --daemon-reload nor --daemon-reexec in %scripts).
AFAIS, systemd/udev/migration hook kills all processes (using SIGTERM) from the "exited" cgroup. Maybe this happens because it removes background processes started by the RUN rule...? Try to enable network tracing, did not show any output, that is there were no "rcnework stop" involved. (In reply to comment #67)
However, I'm not sure this state is the reason why network goes down..
IMO it seems something we have to take a closer look at. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c70
--- Comment #70 from Frederic Crozat
(In reply to comment #65)
Just wondering: is it possible to have any "long running" process being run with the udev network rules ? This is no longer possible with systemd / udev in 12.3 (RUN+= shouldn't have anything long running there, systemd unit should be used instead) and udev will kill any process remaining (like an dhcpcd instance) once the script from RUN+ is finished. Feel free to tell me I'm wrong :)
Hmm.. The question is what is "long running" -- the complete boot needs just few secs in both cases. The rule may start some background jobs. AFAIR, the udev rule timeout were much higher.
Here is the exact explanation from upstream NEWS: * udev: when udevd is started by systemd, processes which are left behind by forking them off of udev rules, are unconditionally cleaned up and killed now after the event handling has finished. Services or daemons must be started as systemd services. Services can be pulled-in by udev to get started, but they can no longer be directly forked by udev rules.
I'll take a close look here... when udev kills the background processes that were started via RUN, it could be a reason. But I currently don't believe it is udev related.
The issue is appearing when udev is being upgraded, not systemd, so "something" is going on when udev is being restarted, I think.
There is also something I don't understand: when I look at the journal trace on attachment #3 [details] [details] (with systemd.log_level=debug), it looks like dhcpcd PID and output which is left running is not the one which was initially started:
Jun 06 15:58:25 bwiedemann-12 dhcpcd[750]: eth0: exiting vs Jun 06 16:07:30 bwiedemann-12 dhcpcd[1387]: eth0: received SIGTERM, stopping
It looks like two dhcpcd daemons were started somehow, with one "blocking" the other.
No, dhcpcd were started, after it got a lease it forks continuing in background, while the parent exits [pid 750 above].
ok
I'm also not 100% the issue is caused by active(exited) vs active(running), since upgrading udev package doesn't cause any change at systemd own configuration (no --daemon-reload nor --daemon-reexec in %scripts).
AFAIS, systemd/udev/migration hook kills all processes (using SIGTERM) from the "exited" cgroup. Maybe this happens because it removes background processes started by the RUN rule...?
Hmm, after looking carefully at udev %pre/%post scripts, "systemctl daemon-reload" is being called. It would be interesting to see if just calling "systemctl daemon-reload" is causing the issue (I doubt it). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c71
--- Comment #71 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c72
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Marius Tomaschewski
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c73
--- Comment #73 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c
Frederic Crozat
https://bugzilla.novell.com/show_bug.cgi?id=821879
https://bugzilla.novell.com/show_bug.cgi?id=821879#c74
Robert Milasan
participants (1)
-
bugzilla_noreply@novell.com