[Bug 821879] New: udev update breaks network
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c0 Summary: udev update breaks network Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Critical Priority: P5 - None Component: Update Problems AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: bwiedemann@suse.com QAContact: jsrain@suse.com CC: meissner@suse.com, fcrozat@suse.com, rmilasan@suse.com Found By: Development Blocker: --- Reproducible: Always Steps to Reproduce: have a 12.3 x86_64 VM with eth0 configured for dhcp4 ssh to the VM and do zypper in udev-195-13.25.1 Actual Results: Installing: udev-195-13.25.1 ...............<50%>====================[\] in the middle of udev installation, network goes down, no dhcpcd running This is very bad for servers where you often have no easy way to do rcnetwork restart (which reports an error after 31s but still brings the dhcpcd and network back) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c1 --- Comment #1 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-27 18:10:28 CEST --- Created an attachment (id=541408) --> (http://bugzilla.novell.com/attachment.cgi?id=541408) /var/log/messages network breakage is at 12:03:33 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c3 --- Comment #3 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-27 18:48:18 CEST --- interestingly, this does not happen on all systems, but at least on those affected systems it can also be triggered again with zypper in -f udev -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c FeiXiang Zhang <fxzhang@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bnc-team-screening@forge.pr |rmilasan@suse.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c4 Robert Milasan <rmilasan@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |bwiedemann@suse.com --- Comment #4 from Robert Milasan <rmilasan@suse.com> 2013-05-28 07:17:24 UTC --- Does this happen on 12.2? Don't think is udev to blame here, but not sure. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c5 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|bwiedemann@suse.com | --- Comment #5 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-28 09:53:34 CEST --- 12.2 is not affected with libudev0-182-4.33.1 at least it is the udev update breaking 12.3 networking could still be problems with systemd or other background magic that sends the SIGTERM to dhcpcd -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c6 --- Comment #6 from Marius Tomaschewski <mt@suse.com> 2013-05-28 08:01:51 UTC --- (In reply to comment #0)
in the middle of udev installation, network goes down, no dhcpcd running
"systemctl status network.service" shows it is stopped? Then I guess, systemd resolves some "Wants" dependencies and stops it or something like this. Please enable systemd debugging, so it is visible what systemd is doing (starting/stopping).
This is very bad for servers where you often have no easy way to do rcnetwork restart (which reports an error after 31s but still brings the dhcpcd and network back)
Sure. Which errors does it show exactly? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c7 --- Comment #7 from Robert Milasan <rmilasan@suse.com> 2013-05-28 08:05:12 UTC --- You are right, in 12.3 seems to be an issue, but I believe only when using ifup/ifdown. I've enabled 'set -x' in ifup/ifdown and try to restart the network: /etc/init.d/network restart 2013-05-28T03:53:10.346919-04:00 rmilasan-test-server network[11284]: + exit 7 2013-05-28T03:53:10.347523-04:00 rmilasan-test-server network[11284]: ..failed eth0 interface could not be set up until now 2013-05-28T03:53:10.356923-04:00 rmilasan-test-server network[11284]: ..failedSetting up service network . . . . . . . . . . . . ...failed 2013-05-28T03:53:10.359944-04:00 rmilasan-test-server systemd[1]: network.service: control process exited, code=exited status=7 2013-05-28T03:53:10.359955-04:00 rmilasan-test-server systemd[1]: Failed to start LSB: Configure network interfaces and set up routing. 2013-05-28T03:53:10.359958-04:00 rmilasan-test-server systemd[1]: Unit network.service entered failed state This error doesn't happen if not running manually /etc/init.d/network restart. Seems the the interface is there, but for some reason it didn't start again dhcpd, it got SIGTERM, but nothing made it to come up. BTW, how do I enable debugging in systemd while the system is running. I know for udev, but doesn't help, don't get anything. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c8 --- Comment #8 from Frederic Crozat <fcrozat@suse.com> 2013-05-28 08:11:36 UTC --- (In reply to comment #7)
BTW, how do I enable debugging in systemd while the system is running. I know for udev, but doesn't help, don't get anything.
SIGRTMIN+22 to PID 1, which means: kill -56 1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c9 --- Comment #9 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-28 10:34:04 CEST --- # systemctl status network.service network.service - LSB: Configure network interfaces and set up routing Loaded: loaded (/etc/init.d/network) Active: failed (Result: exit-code) since Tue, 2013-05-28 04:30:52 EDT; 2min 14s ago Process: 9698 ExecStart=/etc/init.d/network start (code=exited, status=7) CGroup: name=systemd:/system/network.service └ 1544 /sbin/dhcpcd --netconfig -L -E -HHH -c /etc/sysconfig/network/scripts/dhcpcd-hook -t 0 -h ... May 28 04:31:42 bwiedemann-cloud dhcpcd[1544]: eth0: adding IP address 10.122.160.102/21 May 28 04:31:42 bwiedemann-cloud dhcpcd[1544]: eth0: adding default route via 10.122.160.77 metric 0 May 28 04:31:42 bwiedemann-cloud dhcpcd[1544]: eth0: setting hostname to `bwiedemann-cloud' May 28 04:31:43 bwiedemann-cloud ifup[13413]: eth0 name: RTL-8139/8139C/8139C+ May 28 04:32:37 bwiedemann-cloud dhcpcd[1544]: eth0: renewing lease of 10.122.160.102 May 28 04:32:37 bwiedemann-cloud dhcpcd[1544]: eth0: leased 10.122.160.102 for 120 seconds May 28 04:32:37 bwiedemann-cloud dhcpcd[1544]: eth0: adding IP address 10.122.160.102/21 May 28 04:32:37 bwiedemann-cloud dhcpcd[1544]: eth0: adding default route via 10.122.160.77 metric 0 May 28 04:32:37 bwiedemann-cloud dhcpcd[1544]: eth0: setting hostname to `bwiedemann-cloud' May 28 04:32:37 bwiedemann-cloud ifup[13703]: eth0 name: RTL-8139/8139C/8139C+ During network restart, /var/log/messages get such entries every second: 2013-05-28T04:26:17.199416-04:00 bwiedemann-cloud systemd[1]: Accepted connection on private bus. 2013-05-28T04:26:17.200410-04:00 bwiedemann-cloud systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.GetAll() on /org/freedesktop/systemd1/unit/network_2eservice 2013-05-28T04:26:17.201079-04:00 bwiedemann-cloud systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local 2013-05-28T04:26:17.256321-04:00 bwiedemann-cloud systemd[1]: Received SIGCHLD from PID 8409 (ifstatus-route). 2013-05-28T04:26:17.256333-04:00 bwiedemann-cloud systemd[1]: Got SIGCHLD for process 8409 (ifstatus-route) 2013-05-28T04:26:17.256336-04:00 bwiedemann-cloud systemd[1]: Child 8409 died (code=exited, status=1/FAILURE) calling /etc/sysconfig/network/scripts/ifstatus-route eth0 directly in a root shell did not return an error -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c10 Robert Milasan <rmilasan@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #10 from Robert Milasan <rmilasan@suse.com> 2013-05-28 08:48:33 UTC --- Was thinking and maybe this is also one of the issue, stopping and starting udev. What if we only restart udevd (nothing else) at %post ? Will try this out now and will let you know. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c11 --- Comment #11 from Robert Milasan <rmilasan@suse.com> 2013-05-28 09:07:22 UTC --- OK, looks like my idea with the restart only didn't help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c12 --- Comment #12 from Robert Milasan <rmilasan@suse.com> 2013-05-28 11:39:19 UTC --- OK, after some investigation looks like (at least to me) that there is a 'remove' event, but not an 'add|change' event from kernel to udev. The other weird part is that eth0 never really goes away, meaning it looks active, usually if you do: 'ifconfig eth0 down' then run only 'ifconfig' doesn't show 'eth0', but in our case it's always there without of course an ip address as dhcpcd is not running. Also, this can be reproduced in Virtualbox by installing normal openSUSE 12.3 machine and using ifup/ifdown instead of NetworkManager, but in Virtualbox you can restart the network without any actual errors. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c13 --- Comment #13 from Robert Milasan <rmilasan@suse.com> 2013-05-28 11:49:12 UTC --- Also as a note here (forgot to write before), running: zypper in -f udev # re-install udev, network goes down, eth0 doesn't have an ip address /etc/init.d/network restart # restarted network, ip address assign to eth0 zypper in -f udev # re-install udev, network up and running, ip address still assign to eth0 The second time re-installing udev on a running system doesn't kill dhcpcd and the network is up and running normally. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c14 --- Comment #14 from Robert Milasan <rmilasan@suse.com> 2013-05-28 11:58:32 UTC --- Again another note here: If we switch from dhcpcd to dhclient in /etc/sysconfig/network/dhcp the problem can't be reproduced. opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_BIN DHCLIENT_BIN="dhclient" opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_DEBUG DHCLIENT_DEBUG="yes" But debug mode doesn't seem to work, at least when using dhclient. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c15 --- Comment #15 from Robert Milasan <rmilasan@suse.com> 2013-05-28 12:06:20 UTC --- OK, the debug works (somewhat). I was thinking that the option DHCLIENT_DEBUG means to enable debugging in dhclient or dhcpcd, turns out is debugging for ifup/ifdown (set -x). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c16 --- Comment #16 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-28 14:20:28 CEST --- found something: rcnetwork status has eth0 is up, but ipv6 duplicate address check failed dead but IPv6 is not actually used in that network yet, when I do ping6 -Ieth0 fe80::1234 connect: Cannot assign requested address and dmesg has [ 449.275203] IPv6: eth0: IPv6 duplicate address fe80::f816:3eff:fe3c:f792 detected! though nothing goes over the network. Can this be a kernel problem? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c17 --- Comment #17 from Marius Tomaschewski <mt@suse.com> 2013-05-28 12:57:16 UTC --- (In reply to comment #14)
Again another note here:
If we switch from dhcpcd to dhclient in /etc/sysconfig/network/dhcp the problem can't be reproduced.
opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_BIN DHCLIENT_BIN="dhclient"
opensuse:~ # cat /etc/sysconfig/network/dhcp|grep ^DHCLIENT_DEBUG DHCLIENT_DEBUG="yes" But debug mode doesn't seem to work, at least when using dhclient.
It causes to writes debug to /var/log/dhclient-script.$interface.log. Bernhard, We have to find out what's the problem with the network first before we continue to check what happens while "zypper in -f udev". Please stop the network, set DEBUG=EXTRA in /etc/sysconfig/network/config (bash -vx of all scripts), then start the network and attach the tar archive created by: "tar cvjpf /tmp/network-debug.tar.bz2 /dev/.sysconfig/network" On 12.3-GA, there is sysconfig-0.80.5-1.2.1. In updates, there is sysconfig-0.80.5-1.5.1, that provides a fix in this area ... but perhaps the fix does not work properly: * Mi Mär 20 2013 mt@suse.de - Fixed to wait for dhcp/ipv6 under systemd again. Fixed regression caused by bnc#785240, bnc#780644 fixes to not discard the dhcp/ipv6 dupplicate address detection in progress error codes under systemd completely, but wait until dhcp/ipv6 dad finished or the WAIT_FOR_INTERFACES timeout is reached and then discard in the status returned to systemd (bnc#808718). It caused failures of other services trying to bind tentative IPv6 addresses, e.g. in mixed dhcp4 / static IPv6 setups. Thanks to Rolf Eike Beer for the report/tests/debug outputs. [0001-Fixed-to-wait-for-dhcp-ipv6-under-systemd-again.patch] Which sysconfig are you using? When the newer one, please install the older version and try again. When this solves the rcnetwork errors, there is a bug in the fix. Please provide DEBUG=EXTRA output of the version causing it or from both, but please make sure to remove all exdeb.* files from older runs first. (In reply to comment #16)
found something: rcnetwork status has eth0 is up, but ipv6 duplicate address check failed dead
but IPv6 is not actually used in that network
No, but duplicate address detection is always done for IPv6 and you seem to have another machine in the network using same MAC address as yours. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c18 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bwiedemann@suse.com, | |jfehlig@suse.com --- Comment #18 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-29 12:33:07 CEST --- I did echo net.ipv6.conf.eth0.accept_dad=0 >> /etc/sysctl.conf reboot and then, rcnetwork restart would succeed after 3.5s instead of failing after 31s ; and I could ping6 but after next boot I still get duplicate address errors maybe because network is initialized before sysctl.conf is used? sysctl downgrade did not really help either I think I found it - the VM host interface is using the same MAC as the VM: # ifconfig vnet3 vnet3 Link encap:Ethernet HWaddr FE:16:3E:23:F6:4D inet6 addr: fe80::fc16:3eff:fe23:f64d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4832 errors:0 dropped:0 overruns:0 frame:0 TX packets:7310 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:530237 (517.8 Kb) TX bytes:12089457 (11.5 Mb) # ip a ... 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether fa:16:3e:23:f6:4d brd ff:ff:ff:ff:ff:ff inet 10.122.160.103/21 brd 10.122.160.255 scope global eth0 inet6 fe80::f816:3eff:fe23:f64d/64 scope link tentative dadfailed valid_lft forever preferred_lft forever so there really is a MAC-addr collision. normally, the vnet3 interface should not have an IP addr at all, because it is enslaved in the bridge I tried on the host ip addr del fe80::fc16:3eff:fe23:f64d/64 dev vnet3 but it did not help. IMHO, the thing to do is to use different MAC addrs for interfaces on the host-side and the VM-side because those are really two interfaces interconnected by a virtual cross-over cable. ifconfig vnet3 hw ether FE:16:3E:23:F6:4E did not help either though -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c19 --- Comment #19 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-29 12:45:16 CEST --- err. host MAC starts with FE but VM's with FA so no MAC / ipv6 collision there could still be a kernel bug http://ipv6-or-no-ipv6.blogspot.de/2013/02/ipv6-duplicate-address-in-linux.h... tcpdump (on the VM's host) shows the NDP which is used for duplicate addr detection 10:09:25.650943 00:26:f0:11:00:00 > 01:00:0c:cc:cc:cd SNAP Unnumbered, ui, Flags [Command], length 50 10:09:27.604912 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28 10:09:27.604936 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28 10:09:27.650957 00:26:f0:11:00:00 > 01:00:0c:cc:cc:cd SNAP Unnumbered, ui, Flags [Command], length 50 10:09:27.657902 IP6 :: > ff02::1:ff23:f64d: ICMP6, neighbor solicitation, who has fe80::f816:3eff:fe23:f64d, length 24 10:09:27.657911 IP6 :: > ff02::1:ff23:f64d: ICMP6, neighbor solicitation, who has fe80::f816:3eff:fe23:f64d, length 24 10:09:27.919212 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from fa:16:3e:23:f6:4d, length 300 10:09:27.919243 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from fa:16:3e:23:f6:4d, length 300 10:09:27.919387 IP 10.122.160.97.67 > 10.122.160.103.68: BOOTP/DHCP, Reply, length 300 interestingly, sent packages are shown twice -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c20 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|jfehlig@suse.com |kernel-maintainers@forge.pr | |ovo.novell.com --- Comment #20 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-29 16:25:26 CEST --- http://www.spinics.net/lists/netdev/msg127648.html is the underlying kernel problem and there must be something in the network (e.g. not properly tagged VLANs) that duplicates those multicasts -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c21 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mrueckert@suse.com --- Comment #21 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-29 16:44:01 CEST --- It looks as if the duplication happens in the bridge interface # tcpdump -s 1600 -epni vnet3 icmp6 tcpdump: WARNING: vnet3: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vnet3, link-type EN10MB (Ethernet), capture size 1600 bytes 14:37:23.392257 fa:16:3e:23:f6:4d > 33:33:ff:23:f6:4d, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff23:f64d: ICMP6, neighbor solicitation, who has fe80::f816:3eff:fe23:f64d, length 24 14:37:23.392279 fa:16:3e:23:f6:4d > 33:33:ff:23:f6:4d, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff23:f64d: ICMP6, neighbor solicitation, who has fe80::f816:3eff:fe23:f64d, length 24 # tcpdump -s 1600 -epni br563 icmp6 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br563, link-type EN10MB (Ethernet), capture size 1600 bytes 14:37:23.392257 fa:16:3e:23:f6:4d > 33:33:ff:23:f6:4d, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff23:f64d: ICMP6, neighbor solicitation, who has fe80::f816:3eff:fe23:f64d, length 24 The two packets are only 22 microseconds apart which is about the time a ping on 127.0.0.1 takes but much less than anything over the network. also brctl showmacs br563 |grep 33:33:ff:23:f6:4d does not return anything, so I guess the default bridge behaviour is to pass the packet to all ports - but should that really include the port it came from? Host kernel is 3.0.74-0.6.8-default -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c22 Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |jbohac@suse.com --- Comment #22 from Marius Tomaschewski <mt@suse.com> 2013-05-30 10:54:22 UTC --- (In reply to comment #18)
I did echo net.ipv6.conf.eth0.accept_dad=0 >> /etc/sysctl.conf reboot
and then, rcnetwork restart would succeed after 3.5s instead of failing after 31s ; and I could ping6 but after next boot I still get duplicate address errors maybe because network is initialized before sysctl.conf is used?
Move it to /etc/sysconfig/network/ifsysctl -- "man 5 ifsysctl"; this one is applied a) by udev, b) by ifup. (In reply to comment #19)
err. host MAC starts with FE but VM's with FA
yes, they differ. but as they're generated/random, another vm on another host still may have it.
could still be a kernel bug http://ipv6-or-no-ipv6.blogspot.de/2013/02/ipv6-duplicate-address-in-linux.h...
This could be. You aren't using bonding, are you? Linux bonding AFAIR sends out dad through each port. The ipv6 code is IMO not aware of and detects dups when it gets two answers. It basically does not check received > sent, but received > 1. See also bug#715430. Workaround is to set "disable_ipv6 = 1" or at least "accept_dad = 0" an bridge-ports & bond-slaves [where no IPv6 addr is assigned anyway]. (In reply to comment #21)
It looks as if the duplication happens in the bridge interface [...] The two packets are only 22 microseconds apart which is about the time a ping on 127.0.0.1 takes but much less than anything over the network.
also brctl showmacs br563 |grep 33:33:ff:23:f6:4d ^^^^^^^^^^^^^^^^^
This (33:33:*) is a IPv6 multicast address mapped to ethernet MAC.
does not return anything, so I guess the default bridge behaviour is to pass the packet to all ports - but should that really include the port it came from?
Host kernel is 3.0.74-0.6.8-default
I've not noticed such behavior on my hosts + vm's using same kernel. Jiri, can you take a look / do you have an idea what goes wrong here? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c23 --- Comment #23 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-31 17:06:56 CEST --- after reading linux/net/bridge/br_forward.c should_deliver function I finally managed to track it down to the "hairpin_mode" causing the trouble and it turns out that other people also hit this: https://bugs.launchpad.net/nova/+bug/1011134 so we really have 3 bugs: 1. VM interfaces on compute-hosts of SUSE Cloud 1.0 / OpenStack have hairpin_mode on, making them receive back their own traffic 2. with the Host's bridge treating multicast like broadcast, VM's Linux IPv6 duplicate address detection receives its own sent packet and assumes some other host is using the same addr 3. 12.3's systemd assumes this means a failure in the network script and on udev upgrade the network is killed Steps to Reproduce: have a br0 with tap0 enslaved echo 1 > /sys/class/net/br0/brif/tap0/hairpin_mode qemu-kvm -net nic,macaddr=56:00:56:76:54:32 -net tap,ifname=tap0,script=no,downscript=no -m 1000 -hda test.img -monitor stdio -vnc :9 and that still happens with a 12.3 Host and 3.7.10 kernel -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c24 Jiri Bohac <jbohac@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED CC| |jbohac@suse.com InfoProvider|jbohac@suse.com | --- Comment #24 from Jiri Bohac <jbohac@suse.com> 2013-05-31 18:30:55 CEST --- (In reply to comment #23)
https://bugs.launchpad.net/nova/+bug/1011134
so we really have 3 bugs:
1. VM interfaces on compute-hosts of SUSE Cloud 1.0 / OpenStack have hairpin_mode on, making them receive back their own traffic
I think you're correct. My understanding of VEPA is that the hairpin mode is there to allow the bridge to act as an uplink to another VEPA-mode bridge connected to the specific port. Setting the hairpin mode on a VSI is wrong. The above mentioned Ubuntu bug, has a link to an openstack "fix" for this: https://review.openstack.org/#/c/14017/ Instead of not setting the hairpin mode on the VSIs, they set up a packet filter to work around the problem. Am I missing something?
2. with the Host's bridge treating multicast like broadcast, VM's Linux IPv6 duplicate address detection receives its own sent packet and assumes some other host is using the same addr
Multicasts are treated like broadcasts by L2 switches - I don't see what's wrong with that. They just should not be looped back to the port they came from, which is what the hairpin misconfiguration causes. David Miller has rejected a patch trying to work around such a network misconfiguration: http://www.spinics.net/lists/netdev/msg127696.html I agree. I think it's a good thing that DAD detects duplicate addresses caused by a duplicate MAC address - something quite likely to happen with virtual machines.
3. 12.3's systemd assumes this means a failure in the network script and on udev upgrade the network is killed
Yes, this looks incorrect. For example, one may have more than one IPv6 addresses configured. One of them failing DAD should not be a reason to deconfigure the network -- the other addresses can still be used, IPv4 can be used. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c25 --- Comment #25 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-05-31 19:07:03 CEST --- 1. yes, IMHO upstream did an ugly workaround (and will even leave all other broadcasts coming back) using git blame on openstack nova code showed that this was introduced for https://bugs.launchpad.net/nova/+bug/933640 https://github.com/openstack/nova/commit/b61e1ea12cd41ea507b1f6496ec1413c93b... mentioning the need to reach your own floating IP that is only NATed on the host but I just tested that and it seems to not be working anyway so we could just as well disable hairpin_mode to avoid all associated trouble 2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c26 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|kernel-maintainers@forge.pr | |ovo.novell.com | --- Comment #26 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-05 09:19:06 CEST --- It seems the original problem about udev/network is unrelated to the duplicate address detection from openstack as it still randomly happens sometimes after workarounding it. Thus I split the OpenStack issue into a separate bug 823345 . When it happens, the network status is not "active (running)" but "active (exited)" redirecting to "systemctl status network.service" network.service - LSB: Configure network interfaces and set up routing Loaded: loaded (/etc/init.d/network) Active: active (exited) since Wed, 2013-06-05 09:02:54 EDT; 57s ago Process: 384 ExecStart=/etc/init.d/network start (code=exited, status=0/SUCCESS) CGroup: name=systemd:/system/network.service └ 1500 /sbin/dhcpcd --netconfig -L -E -HHH -c /etc/syscon Jun 05 09:02:54 bwiedemann-udevtest ifup[1446]: eth0 name: RTL-8139/813 Jun 05 09:02:54 bwiedemann-udevtest dhcpcd[1036]: eth0: exiting Jun 05 09:02:54 bwiedemann-udevtest network[384]: eth0 Starting DHCP4 cl Jun 05 09:02:54 bwiedemann-udevtest ifup-dhcp[1512]: Jun 05 09:02:54 bwiedemann-udevtest network[384]: eth0 IP address: 10.12 Jun 05 09:02:54 bwiedemann-udevtest ifup-dhcp[1514]: Jun 05 09:02:54 bwiedemann-udevtest ifup-dhcp[1517]: eth0 IP address: 10 Jun 05 09:02:54 bwiedemann-udevtest ifup-dhcp[1520]: eth0 IP address: 10 Jun 05 09:02:54 bwiedemann-udevtest network[384]: ..doneSetting up service ne Jun 05 09:02:54 bwiedemann-udevtest systemd[1]: Started LSB: Configure netwo -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c27 --- Comment #27 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-05 09:41:36 CEST --- Created an attachment (id=542777) --> (http://bugzilla.novell.com/attachment.cgi?id=542777) journal it seems on boot there are two dhcpcd processes running at the same time ifup-dhcp[1033]: Starting DHCP4 client on eth0 dhcpcd[1038]: eth0: dhcpcd 3.2.3 starting dhcpcd[1038]: eth0: hardware address = fa:16:3e:23:e2:65 dhcpcd[1038]: eth0: broadcasting for a lease dhcpcd[1038]: eth0: offered 10.122.160.107 from 10.122.160.97 dhcpcd[1038]: eth0: checking 10.122.160.107 is available on attached networks ifup-dhcp[1075]: eth0 Starting DHCP4 client dhcpcd[1077]: eth0: dhcpcd already running on pid 1038 (/var/run/dhcpcd-eth0.pid) dhcpcd[1077]: eth0: exiting ifup-dhcp[1094]: . ifup-dhcp[1113]: . dhcpcd[1038]: eth0: leased 10.122.160.107 for 120 seconds dhcpcd[1038]: eth0: adding IP address 10.122.160.107/21 dhcpcd[1038]: eth0: adding default route via 10.122.160.97 metric 0 ifdown[1354]: eth0 name: RTL-8139/8139C/8139C+ dhcpcd[1038]: eth0: setting hostname to `bwiedemann-udevtest' ifup[1448]: eth0 name: RTL-8139/8139C/8139C+ dhcpcd[1038]: eth0: exiting network[380]: eth0 Starting DHCP4 client. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c28 --- Comment #28 from Marius Tomaschewski <mt@suse.com> 2013-06-05 08:14:39 UTC --- (In reply to comment #27)
Created an attachment (id=542777) --> (http://bugzilla.novell.com/attachment.cgi?id=542777) [details] journal
it seems on boot there are two dhcpcd processes running at the same time
No, there are multiple stages: dhcpcd gets a lease and calls dhcpcd-hook with "up", "new", "down", "complete" [which results in "ifdown -o dhcp", "netconfig" and "ifup -o dhcp"]. Then it forks. Basically the problem is, that when /etc/init.d/network reports !0 to systemd, systemd will stop it again and not start services requiring it. When it does not report failure on error conditions, other services will fail. For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails. You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface). (In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c29 --- Comment #29 from Frederic Crozat <fcrozat@suse.com> 2013-06-05 08:25:12 UTC --- it might be worth shipping a .service for the "ifup" way of network, which can then tell systemd to accept additional error code (SuccessExitStatus,RestartPreventExitStatus, see man systemd.service) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c30 --- Comment #30 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-06 15:24:15 CEST --- (In reply to comment #28)
For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails.
You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface).
IPv6 dad is no more failing, as I workarounded this on the openstack side. So rcnetwork status always shows as green, but the udev bug can still occur. Also eth0 is set to DHCP4 so no dhcpv6 client should run.
(In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg).
Does dhcpcd react to receiving its own gratuitous ARP packet? How? What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c31 --- Comment #31 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-06 16:36:26 CEST --- Created an attachment (id=543109) --> (http://bugzilla.novell.com/attachment.cgi?id=543109) journal with systemd debug This happens completely independent of Cloud and hairpin_mode, too. Steps to reproduce: cd /dev/shm wget http://openqa.opensuse.org/openqa/img/12.3-mini64.qcow2 # 369MB qemu-kvm -drive if=virtio,file=12.3-mini64.qcow2 root linux rcnetwork status # will show "active (exited)" zypper -n in -f udev ifconfig # will show eth0 without inet addr Could it be that multiple ifup eth0 processes are spawned? When running this with -no-kvm option, bootup was so slow that rcnetwork status showed 3 different PIDs in lines of /bin/bash /sbin/ifup eth0 -o rc onboot and even an ifdown in between? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c32 --- Comment #32 from Marius Tomaschewski <mt@suse.com> 2013-06-07 04:32:19 UTC --- (In reply to comment #30)
(In reply to comment #28)
For IPv6 it is ON by default and when there is a dup detected on the link local address, IPv6 is basically not usable on this interface (multicasts, ... also a dhcpv6 client which requires it will fail). Further, there is no check which address fails or if other services are using (binding it) or not. Either everything worked or not. As the failure happens on a mandatory interface, rcnetwork fails.
You can a) disable dad via ifsysctl file, b) disable IPv6 (per interface).
IPv6 dad is no more failing, as I workarounded this on the openstack side.
OK.
So rcnetwork status always shows as green, but the udev bug can still occur.
Only when some error occurs and duplicate addresses are definitely errors. The status isn't checked per address, but per interface and rcnetwork reports failure when there is an error on mandatory interfaces. You can use STARTMODE=hotplug or ifplugd for all "nice to have" interfaces.
Also eth0 is set to DHCP4 so no dhcpv6 client should run.
When BOOTPROTO=dhcp4 is set, dhcp6 will be not started/used, but this does not mean IPv6 is disabled -- it is enabled by default (static and autoconf). You can disable a) DAD or b) IPv6 at all via ifsysctl or set IPV6_DAD_WAIT="0" to disable it the check (each one per-interface or globally).
(In reply to comment #25)
2. if there is a duplicate MAC address - shouldn't it then also refuse to do IPv4 networking? I guess our duplicate address detection code is less picky there. (admitted: IPv4 does more things in userspace than IPv6)
For IPv4 there is no duplicate address detection done by default for static IPs (see CHECK_DUPLICATE_IP & SEND_GRATUITOUS_ARP variables in network/config), but dhcpcd makes it (should be the "checking .. is available on attached networks" msg).
Does dhcpcd react to receiving its own gratuitous ARP packet? How?
I don't know/remember what it exactly does here.
What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states?
/etc/init.d/network is using "X-Systemd-RemainAfterExit: true" LSB tag: RemainAfterExit= Takes a boolean value that specifies whether the service shall be considered active even when all its processes exited. Defaults to no. "active (exited)" means the network.service cgroup is empty. You'll see this in static configurations (systemctl status network.service). "active (running)" means, there is some process in the cgroup, that is in e.g. dhcp mode where dhcp client is running. Are you using dhcp and get "active (exited)"? Did the dhcpcd die? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c33 --- Comment #33 from Marius Tomaschewski <mt@suse.com> 2013-06-07 04:51:37 UTC --- (In reply to comment #31)
Steps to reproduce: cd /dev/shm wget http://openqa.opensuse.org/openqa/img/12.3-mini64.qcow2 # 369MB qemu-kvm -drive if=virtio,file=12.3-mini64.qcow2
Going to fetch it...
root linux rcnetwork status # will show "active (exited)"
No dhcpcd in the cgroup?!
zypper -n in -f udev ifconfig # will show eth0 without inet addr
The problem seems to be here: Jun 06 15:58:36 bwiedemann-12 systemd[1]: Received SIGCHLD from PID 2204 (ifstatus-route). Jun 06 15:58:36 bwiedemann-12 systemd[1]: Got SIGCHLD for process 2204 (ifstatus-route) Jun 06 15:58:36 bwiedemann-12 systemd[1]: Child 2204 died (code=exited, status=1/FAILURE)
Could it be that multiple ifup eth0 processes are spawned? When running this with -no-kvm option, bootup was so slow that rcnetwork status showed 3 different PIDs in lines of /bin/bash /sbin/ifup eth0 -o rc onboot
and even an ifdown in between?
ifdown is usually from "ifdown eth0 -o dhcp" -- started for dhcp post processing. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c34 --- Comment #34 from Marius Tomaschewski <mt@suse.com> 2013-06-07 05:46:30 UTC --- (In reply to comment #32)
What is the difference in meaning of "active (running)" and "active (exited)" systemd/network states?
"active (exited)" means the network.service cgroup is empty. You'll see this in static configurations (systemctl status network.service). "active (running)" means, there is some process in the cgroup, that is in e.g. dhcp mode where dhcp client is running.
Wrong. "active (exited)" means, the /etc/init.d/network (network.service) is done; it exited with the code visible in the status. "active (running)" means, the /etc/init.d/network (network.service) is still running. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c35 --- Comment #35 from Marius Tomaschewski <mt@suse.com> 2013-06-07 06:15:56 UTC --- Created an attachment (id=543261) --> (http://bugzilla.novell.com/attachment.cgi?id=543261) snapshot3 (In reply to comment #32)
IPv6 dad is no more failing, as I workarounded this on the openstack side.
And how? For me, everything works fine using your VM on my host / network. Network works fine and systemd does not trigger restart while "zypper in -f udev": 08:02:09.641545-04:00 bwiedemann-12 systemd[1]: Stopping udev Control Socket. 08:02:09.641869-04:00 bwiedemann-12 systemd[1]: Closed udev Control Socket. 08:02:09.642369-04:00 bwiedemann-12 systemd[1]: Stopping udev Kernel Socket. 08:02:09.642600-04:00 bwiedemann-12 systemd[1]: Closed udev Kernel Socket. 08:02:09.643468-04:00 bwiedemann-12 systemd[1]: Stopping udev Kernel Device Manager... 08:02:09.644205-04:00 bwiedemann-12 systemd[1]: Stopped udev Kernel Device Manager. 08:02:09.644598-04:00 bwiedemann-12 systemd[1]: Stopped udev Kernel Device Manager. 08:02:09.747782-04:00 bwiedemann-12 systemd[1]: Reloading. 08:02:09.762553-04:00 bwiedemann-12 systemd[1]: Starting udev Kernel Socket. 08:02:09.762880-04:00 bwiedemann-12 systemd[1]: Listening on udev Kernel Socket. 08:02:09.763129-04:00 bwiedemann-12 systemd[1]: Starting udev Control Socket. 08:02:09.763413-04:00 bwiedemann-12 systemd[1]: Listening on udev Control Socket. 08:02:09.763656-04:00 bwiedemann-12 systemd[1]: Starting udev Kernel Device Manager... 08:02:09.771822-04:00 bwiedemann-12 systemd[1]: Started udev Kernel Device Manager. 08:02:10.067863-04:00 bwiedemann-12 kernel: [ 55.352623] device-mapper: uevent: version 08:02:10.067873-04:00 bwiedemann-12 kernel: [ 55.352847] device-mapper: ioctl: 4.23.0-io 08:02:12.343079-04:00 bwiedemann-12 cloud-init[1799]: 2013-06-07 08:02:12,342 - util.py[WA 08:02:15.243755-04:00 bwiedemann-12 kernel: [ 60.528958] SGI XFS with ACLs, security att 08:02:15.247743-04:00 bwiedemann-12 kernel: [ 60.532965] JFS: nTxBlock = 3928, nTxLock = 08:02:15.261746-04:00 bwiedemann-12 kernel: [ 60.546700] QNX4 filesystem 0.2.3 registere 08:02:15.274808-04:00 bwiedemann-12 kernel: [ 60.559710] Btrfs loaded 08:02:15.277978-04:00 bwiedemann-12 systemd[1]: Mounted FUSE Control File System. 08:02:15.278746-04:00 bwiedemann-12 kernel: [ 60.563293] fuse init (API version 7.20) 08:02:15.299619-04:00 bwiedemann-12 os-prober: debug: /dev/vda1: is active swap Please set DEBUG=EXTRA in /etc/sysconfig/network/config and provide the output: "tar cvjpf /tmp/network-debug.tar.bz2 /dev/.sysconfig/network" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |bwiedemann@suse.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c36 --- Comment #36 from Marius Tomaschewski <mt@suse.com> 2013-06-07 06:35:29 UTC --- (In reply to comment #29)
it might be worth shipping a .service for the "ifup" way of network, which can then tell systemd to accept additional error code (SuccessExitStatus,RestartPreventExitStatus, see man systemd.service)
BTW: Basically the only possibility to avoid all this trouble I see -- except of comment 29 -- is to use something like this: --- a/scripts/network +++ b/scripts/network @@ -906,6 +906,8 @@ case "$ACTION" in reload_firewall + # do not report any errors to systemd + test "$SD_RUNNING" = "yes" && rc_reset ;; stop) But is this really something we should use?! Frederic, would it be possible to implement: X-Systemd-RestartPreventExitStatus: ... X-Systemd-SuccessExitStatus: ... ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c37 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED InfoProvider|bwiedemann@suse.com | --- Comment #37 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-07 08:40:49 CEST --- Marius: how often did you try? did you run the disk from tmpfs? on a fast host? on tmpfs I get the bug >90% of the time but on HDD I only got it on 2 of 5 tries. I think there is a race condition happening and that depends on the speed of the system. The three test systems on which I reproduced the bug use DDR3 RAM and 4 cores if there was really an error in the networking, we still want it reported -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c38 --- Comment #38 from Frederic Crozat <fcrozat@suse.com> 2013-06-07 06:44:55 UTC --- (In reply to comment #36)
Frederic, would it be possible to implement: X-Systemd-RestartPreventExitStatus: ... X-Systemd-SuccessExitStatus: ...
?
No, it is a wrong idea. for systemd, restart is stop + start. If we really need something, you should just create a network.service file in /usr/lib/systemd/system which will call /etc/init.d/network in ExecStart / ExecStop / ExecReload. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c39 --- Comment #39 from Marius Tomaschewski <mt@suse.com> 2013-06-07 07:06:24 UTC --- (In reply to comment #37)
Marius: how often did you try? did you run the disk from tmpfs? on a fast host?
6 times on a i7 box with image on SSD.
I think there is a race condition happening and that depends on the speed of the system.
This could be. I'll retry on tmpfs. Could you provide the DEBUG=EXTRA outputs please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c40 --- Comment #40 from Marius Tomaschewski <mt@suse.com> 2013-06-07 07:25:44 UTC --- Bernhard, with img on tmpfs I've got it once - dhcpcd were stopped after -- but there were no EXTRA logs from network.. why? Further, there are also other services which fail, e.g. boot.local. Perhaps this is the reason... I'll retry next week. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c41 --- Comment #41 from Marius Tomaschewski <mt@suse.com> 2013-06-07 07:26:45 UTC --- (In reply to comment #39)
Could you provide the DEBUG=EXTRA outputs please?
Together with debug logs from systemd please. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c42 --- Comment #42 from Marius Tomaschewski <mt@suse.com> 2013-06-07 07:34:41 UTC --- Created an attachment (id=543274) --> (http://bugzilla.novell.com/attachment.cgi?id=543274) mt-snapshot4 Hmm... from tmpfs all sort of things are failing. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c43 --- Comment #43 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-06-13 08:17:47 CEST --- Marius, any news on this issue? Can you reproduce it? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c44 Dirk Mueller <dmueller@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High --- Comment #44 from Dirk Mueller <dmueller@suse.com> 2013-07-10 10:51:24 CEST --- Hit the issue as well. It is certainly inconvenient having a remote server go down when installing updates. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c45 --- Comment #45 from Marius Tomaschewski <mt@suse.com> 2013-07-10 13:18:00 UTC --- Created an attachment (id=547421) --> (http://bugzilla.novell.com/attachment.cgi?id=547421) [PATCH] Disard /etc/init.d/network start errors under systemd (In reply to comment #29)
(SuccessExitStatus,RestartPreventExitStatus, see man systemd.service)
It is IMO simplier to catch it in "rcnetwork start", where it is reported. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c46 Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |bwiedemann@suse.com --- Comment #46 from Marius Tomaschewski <mt@suse.com> 2013-07-10 13:20:42 UTC --- Bernhard, Dirk, does it work for you? Package is in $OBS/home:mtomaschewski:branches:openSUSE:12.3:Update/sysconfig -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c47 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED InfoProvider|bwiedemann@suse.com | --- Comment #47 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-07-10 18:39:12 CEST --- I installed http://download.opensuse.org/repositories/home:/mtomaschewski:/branches:/ope... and still could reproduce the problem as described in comment 31 with network status showing active (exited) and udev install killing network -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c48 --- Comment #48 from Marius Tomaschewski <mt@suse.com> 2013-07-11 10:54:54 UTC --- Created an attachment (id=547568) --> (http://bugzilla.novell.com/attachment.cgi?id=547568) 20130711-mt-test-1_3 output archive Yes, the network sometimes has "active (exited)" and sometimes "active (running)" state after booting -- although it reports 0 to systemd, dhcpcd is still running [so definitely not in exited] in both cases and everything with it looks fine. Anything else seems to depend on it. While udev update in exited state, this causes to kill network processes while it migrates to the new udev -- network scripts are AFAIS not involved in this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c49 --- Comment #49 from Marius Tomaschewski <mt@suse.com> 2013-07-11 11:03:49 UTC --- Created an attachment (id=547569) --> (http://bugzilla.novell.com/attachment.cgi?id=547569) screenshot: network in active exited state, even it reported 0 and dhcpcd is running -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c50 Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |fcrozat@suse.com --- Comment #50 from Marius Tomaschewski <mt@suse.com> 2013-07-11 11:07:57 UTC --- Frederic, why systemd shows the network.service in "active (exited)" state, even rcnetwork start reported 0 and dhcpcd is still running in the network.service cgroup? This looks like a systemd bug to me. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c51 Frederic Crozat <fcrozat@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED InfoProvider|fcrozat@suse.com | --- Comment #51 from Frederic Crozat <fcrozat@suse.com> 2013-07-11 11:23:42 UTC --- for me, systemctl (and systemd) are ok: main process (ie /etc/init.d/network, on the Process line) was executed, finished and returned an exit code of 0 => Active (because /etc/init.d/network has X-Systemd-RemainAfterExit: true) and Exited because the pid referenced on the Process line is no longer running there is still one process running, which was started by this process, and it is dhcpcd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c52 Robert Milasan <rmilasan@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|rmilasan@suse.com |mt@suse.com --- Comment #52 from Robert Milasan <rmilasan@suse.com> 2013-07-11 11:27:25 UTC --- Re-assigning the bug to Marius, don't really see this as a udev issue. Testing now a different way to install udev (%post and %pre part), just to see if it makes a difference. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c53 --- Comment #53 from Robert Milasan <rmilasan@suse.com> 2013-07-11 11:36:10 UTC --- OK, the idea didn't make a difference. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c54 Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|mt@suse.com |fcrozat@suse.com --- Comment #54 from Marius Tomaschewski <mt@suse.com> 2013-07-11 12:01:53 UTC --- (In reply to comment #51)
for me, systemctl (and systemd) are ok:
main process (ie /etc/init.d/network, on the Process line) was executed, finished and returned an exit code of 0 => Active (because /etc/init.d/network has X-Systemd-RemainAfterExit: true) and Exited because the pid referenced on the Process line is no longer running
there is still one process running, which was started by this process, and it is dhcpcd.
And why does it behaves like this only in 50% of the cases/boots [when VM is on tmpfs of a fast host, VM on HDD in <10% of the cases]? Usually it reports "active (running)" here. Also after "systemctl restart network.service" after it initially booted in "active (exited)" state. The situation is exactly same: rcnetwork exited with status 0, but there is still dhcpcd running. Just the systemd state differs from time to time. And why does systemd -- while udev update -- drops all state and kills all cgroup members (dhcpcd) in "active (exited)" state [run1], but not in the "active (running)" [run3], where it does not kill all network.service cgroup members? => See run1 and run3 in the attachment of comment 49. (In reply to comment #52)
Re-assigning the bug to Marius, don't really see this as a udev issue.
It is not a network scripts bug. Network works, 0 status is reported. Re-assigning to Frederic as systemd is not consistent / behaves strange. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c55 --- Comment #55 from Frederic Crozat <fcrozat@suse.com> 2013-07-11 14:31:11 UTC --- The issue is mostly network initscript which can have two opposite behaviour: - either be a initscript which have no process left running (static IP) - or some process running (dhcp). systemd can't handle easily those two opposite constraints (see manpage systemd.service) - RemainAfterExit is supposed to tell if a service show stay "active" when no process is left running (issue here static vs dhcp). - GuessMainPID is supposed to search which PID is the "master one", when there is no PIDFile. For sysvinit scripts, systemd was disabling this feature (because it is not reliable) and considers all PID to be potentially "main PID" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c56 --- Comment #56 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-07-11 16:37:49 CEST --- I dropped RemainAfterExit in /etc/init.d/network in my testvm and the status still went to "active (exited)" on boot with the associated breakage on udev upgrade How can I enable GuessMainPID for the network script to test if this makes a difference? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c57 --- Comment #57 from Frederic Crozat <fcrozat@suse.com> 2013-07-11 14:55:55 UTC --- you need to create a network.service file which will call the initscript in ExecStart / ExecStop / ExecReload .. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c58 --- Comment #58 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-07-11 17:19:04 CEST --- Created an attachment (id=547604) --> (http://bugzilla.novell.com/attachment.cgi?id=547604) network.service with attached service file in /usr/lib/systemd/system/ and systemd enable network.service network always came up correctly as "active (running)" but rcnetwork stop would lead to failed state maybe because dhcpcd exits on SIGTERM with return code 1 which is counted as error -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c59 --- Comment #59 from Marius Tomaschewski <mt@suse.com> 2013-07-12 13:20:28 UTC --- (In reply to comment #55) Hmm: - the systemd behavior is inconsistent. With dhcp: it _usually_ reports correctly "active (running)", _sometimes_ "active (exited)" You basically need a VM on a tmpfs of a fast host to get the "active (exited)". Looks to me, that the RemainAfterExit logic is not sufficient or there is some kind of race condition. To 'considers all PID to be potentially "main PID"' is IMO correct. Systemd should set the state consistently with RemainAfterExit: always "active (running)" when the main PID (of started init script) process exits and cgroup is not empty and when the cgroup becomes empty, change to "active (exited)". When the dhcp client gets an infinite lease, it exits. That is also with dhcp, you can get the "static setup" behavior. But it will not produce problem on update because dhcpd does not deconfigure ifaces. - Dropping RemainAfterExit would break static setups -> a no-go. - Providing a network.service lease has potential to break the service switch in yast2 -- but I didn't tested it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c60 --- Comment #60 from Frederic Crozat <fcrozat@suse.com> 2013-07-12 13:27:32 UTC --- (In reply to comment #59)
(In reply to comment #55)
Hmm: - the systemd behavior is inconsistent.
With dhcp: it _usually_ reports correctly "active (running)", _sometimes_ "active (exited)"
You basically need a VM on a tmpfs of a fast host to get the "active (exited)".
Looks to me, that the RemainAfterExit logic is not sufficient or there is some kind of race condition. To 'considers all PID to be potentially "main PID"' is IMO correct.
Systemd should set the state consistently with RemainAfterExit: always "active (running)" when the main PID (of started init script) process exits and cgroup is not empty and when the cgroup becomes empty, change to "active (exited)".
What is the output of systemctl show network.service in both case ?
- Providing a network.service lease has potential to break the service switch in yast2 -- but I didn't tested it.
I don't see how it would break service switch ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c61 --- Comment #61 from Marius Tomaschewski <mt@suse.com> 2013-07-12 14:17:44 UTC --- Created an attachment (id=547772) --> (http://bugzilla.novell.com/attachment.cgi?id=547772) network.service (mt 1) OK, this one seems to work -- at least with dhcp. I've switched between NM and sysconfig several times in yast2, this seem to work too. Unlike in attachment 547604 I keep the KillMode=control-group as this is what's needed I think. (I'm thinking of starting eth0 as hotplug [at boot], eth1 as hotplug later, ...), where the remaining pid changes. More tests (static, ...) in the next weeks. [I guess, the state difference results from the GuessMainPID=yes default in the service file. This seems to be required with RemainAfterExit=yes] -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c62 --- Comment #62 from Marius Tomaschewski <mt@suse.com> 2013-07-12 14:28:14 UTC --- (In reply to comment #60)
(In reply to comment #59)
(In reply to comment #55) What is the output of systemctl show network.service in both case ?
I'll provide it next week -- have to revert the changes first.
- Providing a network.service lease has potential to break the service switch in yast2 -- but I didn't tested it.
I don't see how it would break service switch ?
I don't see it now, but the potential is there: I were thinking about the 2nd stage tricks and because it calls "disable NetworkManager.service" without a following "enable network.service", but there is AFAIR also a "insserv network" to ensure lsb script is active, that will be forwarded to systemd now. At the moment it looks fine -- seems to work. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Swamp Workflow Management <swamp@suse.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status Whiteboard| |maint:running:53579:moderat | |e -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c64 --- Comment #64 from Marius Tomaschewski <mt@suse.com> 2013-07-17 21:30:47 UTC --- Created an attachment (id=548391) --> (http://bugzilla.novell.com/attachment.cgi?id=548391) systemctl show network.service and dump in both states (In reply to comment #60)
What is the output of systemctl show network.service in both case ?
Attached in the archive. No difference except in timestamps and SubState. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c65 --- Comment #65 from Frederic Crozat <fcrozat@suse.com> 2013-07-19 14:38:38 UTC --- Just wondering: is it possible to have any "long running" process being run with the udev network rules ? This is no longer possible with systemd / udev in 12.3 (RUN+= shouldn't have anything long running there, systemd unit should be used instead) and udev will kill any process remaining (like an dhcpcd instance) once the script from RUN+ is finished. Feel free to tell me I'm wrong :) There is also something I don't understand: when I look at the journal trace on attachment #3 (with systemd.log_level=debug), it looks like dhcpcd PID and output which is left running is not the one which was initially started: Jun 06 15:58:25 bwiedemann-12 dhcpcd[750]: eth0: exiting vs Jun 06 16:07:30 bwiedemann-12 dhcpcd[1387]: eth0: received SIGTERM, stopping It looks like two dhcpcd daemons were started somehow, with one "blocking" the other. It would be interesting to have, for the same problematic run (active(exited)), journal trace (journalctl -b, with booting with systemd.log_level=debug) and the network debug trace. Another reminder is network.service (and this kind of initscript) is "abusing" type=forking, since there is no real "daemon" running, nor a PIDFile to help systemd to detect if there is really which PID is the main one. I'm also not 100% the issue is caused by active(exited) vs active(running), since upgrading udev package doesn't cause any change at systemd own configuration (no --daemon-reload nor --daemon-reexec in %scripts). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c66 --- Comment #66 from Frederic Crozat <fcrozat@suse.com> 2013-07-19 14:43:39 UTC --- I've backported an upstream patch which fixes incorrect termination of service when some child terminates. I don't think it is related to this particular bug but it might be worth testing anyway ( http://download.opensuse.org/repositories/home:/fcrozat:/branches:/openSUSE:... ). Should be available in one hour.. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c67 --- Comment #67 from Frederic Crozat <fcrozat@suse.com> 2013-07-19 15:30:43 UTC --- I might have incorrectly understood one specific commit for our case : commit 1835f23c2a53e632959270e79dbf3143874e6111 Author: Lennart Poettering <lennart@poettering.net> Date: Tue Oct 11 20:21:06 2011 +0200 service: don't try to guess PID for SysV services anymore As it turns out there are quite a number of SysV services too broken to make the guessing work: instead of returning in the parent only after the child is fully initialized they return immediately. The effect is that the guessing in systemd might happen too early, at a time where the final main process doesn't exist yet. By turning this off we won't try to detect the main pid anymore, with the effect that all processes of the service in question are considered equally likely to be the main process. This means systemd will pick one process as the "main" process, in the remaining processes and if this process terminated, it will cause the service to enter active(exited), even if there are other processes remaining in the cgroup. In our case, it might be because there are two dhcpcd processes running, one is picked by systemd, but it will terminated because another one is running and this would cause active(exited). However, I'm not sure this state is the reason why network goes down.. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c69 --- Comment #69 from Marius Tomaschewski <mt@suse.com> 2013-07-22 12:48:06 UTC --- (In reply to comment #65)
Just wondering: is it possible to have any "long running" process being run with the udev network rules ? This is no longer possible with systemd / udev in 12.3 (RUN+= shouldn't have anything long running there, systemd unit should be used instead) and udev will kill any process remaining (like an dhcpcd instance) once the script from RUN+ is finished. Feel free to tell me I'm wrong :)
Hmm.. The question is what is "long running" -- the complete boot needs just few secs in both cases. The rule may start some background jobs. AFAIR, the udev rule timeout were much higher. I'll take a close look here... when udev kills the background processes that were started via RUN, it could be a reason. But I currently don't believe it is udev related.
There is also something I don't understand: when I look at the journal trace on attachment #3 [details] (with systemd.log_level=debug), it looks like dhcpcd PID and output which is left running is not the one which was initially started:
Jun 06 15:58:25 bwiedemann-12 dhcpcd[750]: eth0: exiting vs Jun 06 16:07:30 bwiedemann-12 dhcpcd[1387]: eth0: received SIGTERM, stopping
It looks like two dhcpcd daemons were started somehow, with one "blocking" the other.
No, dhcpcd were started, after it got a lease it forks continuing in background, while the parent exits [pid 750 above].
It would be interesting to have, for the same problematic run (active(exited)), journal trace (journalctl -b, with booting with systemd.log_level=debug) and the network debug trace.
Yes, and perhaps also udev in debug mode too.
Another reminder is network.service (and this kind of initscript) is "abusing" type=forking, since there is no real "daemon" running, nor a PIDFile to help systemd to detect if there is really which PID is the main one.
Yes, it is a drawback is in systemd / RemainOnExit logic for proper systemV script support :-)
I'm also not 100% the issue is caused by active(exited) vs active(running), since upgrading udev package doesn't cause any change at systemd own configuration (no --daemon-reload nor --daemon-reexec in %scripts).
AFAIS, systemd/udev/migration hook kills all processes (using SIGTERM) from the "exited" cgroup. Maybe this happens because it removes background processes started by the RUN rule...? Try to enable network tracing, did not show any output, that is there were no "rcnework stop" involved. (In reply to comment #67)
However, I'm not sure this state is the reason why network goes down..
IMO it seems something we have to take a closer look at. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c70 --- Comment #70 from Frederic Crozat <fcrozat@suse.com> 2013-07-22 13:02:14 UTC --- (In reply to comment #69)
(In reply to comment #65)
Just wondering: is it possible to have any "long running" process being run with the udev network rules ? This is no longer possible with systemd / udev in 12.3 (RUN+= shouldn't have anything long running there, systemd unit should be used instead) and udev will kill any process remaining (like an dhcpcd instance) once the script from RUN+ is finished. Feel free to tell me I'm wrong :)
Hmm.. The question is what is "long running" -- the complete boot needs just few secs in both cases. The rule may start some background jobs. AFAIR, the udev rule timeout were much higher.
Here is the exact explanation from upstream NEWS: * udev: when udevd is started by systemd, processes which are left behind by forking them off of udev rules, are unconditionally cleaned up and killed now after the event handling has finished. Services or daemons must be started as systemd services. Services can be pulled-in by udev to get started, but they can no longer be directly forked by udev rules.
I'll take a close look here... when udev kills the background processes that were started via RUN, it could be a reason. But I currently don't believe it is udev related.
The issue is appearing when udev is being upgraded, not systemd, so "something" is going on when udev is being restarted, I think.
There is also something I don't understand: when I look at the journal trace on attachment #3 [details] [details] (with systemd.log_level=debug), it looks like dhcpcd PID and output which is left running is not the one which was initially started:
Jun 06 15:58:25 bwiedemann-12 dhcpcd[750]: eth0: exiting vs Jun 06 16:07:30 bwiedemann-12 dhcpcd[1387]: eth0: received SIGTERM, stopping
It looks like two dhcpcd daemons were started somehow, with one "blocking" the other.
No, dhcpcd were started, after it got a lease it forks continuing in background, while the parent exits [pid 750 above].
ok
I'm also not 100% the issue is caused by active(exited) vs active(running), since upgrading udev package doesn't cause any change at systemd own configuration (no --daemon-reload nor --daemon-reexec in %scripts).
AFAIS, systemd/udev/migration hook kills all processes (using SIGTERM) from the "exited" cgroup. Maybe this happens because it removes background processes started by the RUN rule...?
Hmm, after looking carefully at udev %pre/%post scripts, "systemctl daemon-reload" is being called. It would be interesting to see if just calling "systemctl daemon-reload" is causing the issue (I doubt it). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c71 --- Comment #71 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-07-22 15:39:49 CEST --- just tested that "systemctl daemon-reload" does not kill dhcpcd but zypper in -f udev does -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c72 Swamp Workflow Management <swamp@suse.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status Whiteboard|maint:running:53579:moderat |maint:running:53579:moderat |e |e | |maint:released:sle11-sp2:53 | |588 --- Comment #72 from Swamp Workflow Management <swamp@suse.de> 2013-08-02 14:46:20 UTC --- Update released for: openstack-nova, openstack-nova-api, openstack-nova-cert, openstack-nova-compute, openstack-nova-network, openstack-nova-objectstore, openstack-nova-scheduler, openstack-nova-test, openstack-nova-vncproxy, openstack-nova-volume, python-nova Products: SUSE-CLOUD 1.0 (x86_64) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |835812 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks|835812 | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Marius Tomaschewski <mt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.novell.com | |/show_bug.cgi?id=835812 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c73 --- Comment #73 from Bernhard Wiedemann <bwiedemann@suse.com> 2013-09-10 16:00:15 CEST --- This is an autogenerated message for OBS integration: This bug (821879) was mentioned in https://build.opensuse.org/request/show/198226 Factory / sysconfig -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c Frederic Crozat <fcrozat@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|fcrozat@suse.com |rmilasan@suse.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=821879 https://bugzilla.novell.com/show_bug.cgi?id=821879#c74 Robert Milasan <rmilasan@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #74 from Robert Milasan <rmilasan@suse.com> 2014-06-06 08:09:15 UTC --- Setting this to RESOLVED/FIXED. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com