[Bug 342148] New: after kernel upadate: e1000 ethernet only works while in promiscuous mode
https://bugzilla.novell.com/show_bug.cgi?id=342148#c1 Summary: after kernel upadate: e1000 ethernet only works while in promiscuous mode Product: openSUSE 10.3 Version: Final Platform: x86-64 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: koenig@linux.de QAContact: qa@suse.de Found By: --- last weekend I got a new kernel via online update to harald > uname -a Linux harald 2.6.22.12-0.1-default #1 SMP 2007/11/06 23:05:18 UTC x86_64 x86_64 x86_64 GNU/Linux on an IBM T60 notebook with e1000. now since monday I have major problems with dhcp and network setup in our company, before none of these problems showed up regularly (not with 10.3 before last update and not with 10.1/10.2 before). 1) the weirdest effect to me is that now a ping to the default gateway only works while eth1 (e1000) is in promiscuous mode by running tcpdump. as soon as I stop tcpdump, I can't reach the gateway anymore :-( completely stopping firewall doesn't make any difference. finally I did a "rmmod e1000 ; modprobe e1000" and "ping gateway" works again without running tcpdump. other observations: 2) on monday (day 1 after update;) the same "can't ping gateway" happend. at that time I finally stoppted testing and did reboot the notebook (usually I only do suspend2ram for weeks -- typically until kernel update;). 3) dhcp takes very long (many minutes and/or multiple unplug/plug of RJ45). today our net admin checked the dhcp server logs and noticed that the first dhcp answer from the dhcp server seemed to be ignored from the notebook. it took 15 minutes until the notebook got it's (static/reserved) ip-adress via dhcpcd. tough ping to gateway didn't work then (see above). are there known/similar problems with the current kernel 2.6.22.12-0.1 ? any suggestion for testing in case the same problems show up again tomorrow morning ? before I found out that running tcpdump fixes the "can't ping gateway" problem, I planned for next morning to run tcpdump on eth1 before I connect the RJ45 and capture all packets, so that later I can check and compare the dhcp server log with the tcpdump log of the received dhcp packets. but now it might happen that running tcpdump might influence just the problem which I try to test -- can this be a heisenbug ? any thoughts about good tests -- or any suggestion about new kernel/setttings ? one scenario for next week is to step back to last kernel version. comments on that ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=342148#c1
Harald Koenig
https://bugzilla.novell.com/show_bug.cgi?id=342148#c2
Brandon Philips
https://bugzilla.novell.com/show_bug.cgi?id=342148#c3
--- Comment #3 from Harald Koenig
https://bugzilla.novell.com/show_bug.cgi?id=342148#c4
--- Comment #4 from Harald Koenig
Could you attach the output from `dmesg` while you are seeing the issue?
done (without/before the test suspend cycles below).
Is there any pattern to the times it works and the times it doesn't? Have you tried other wired networks?
not that I'd realize. right now I have the impression that once "it happens" the driver/stack is stuck until I rmmod/modprobe e1000 driver... any idea how to query some driver (tcp stack?) data before/after rmmod/modprobe such that a comparision might give some hints ? I just did a rmmod/modprobe e1000 and then did a few suspend/resume from/to ram cycles with unplug/replug ethernet etc (typical very day duty cycle) but I haven't been successfull trying to trigger problme -- right now ping works fine without tcpdump...
Perhaps you could try testing the latest kernel-default from this repository? http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/
can't promise it but I'll try to... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=342148#c5
Brandon Philips
https://bugzilla.novell.com/show_bug.cgi?id=342148#c6
--- Comment #6 from Harald Koenig
It looks like you are running the fglrx driver. Can you reproduce the issue without it?
not possible right (at least as long as I don't know how to quickly trigger the problem). I need the fglrx module and I have the kill the X server and my whole display session to remove it. not an option right now, sorry.
Also, it looks like in some cases you are specifying parameters for `modprobe e1000`. Why are you doing this?
since the update 10.2 -> 10.3 I again suffer severe latency problems on our gigabit network (had similar problems already with 10.2, but with 10.2 "InterruptThrottleRate=0" fixed this and now with 10.3 I get ping output to local hosts or the gateway like this: 64 bytes from atuin (10.0.3.70): icmp_seq=1 ttl=63 time=99.4 ms 64 bytes from atuin (10.0.3.70): icmp_seq=2 ttl=63 time=108 ms 64 bytes from atuin (10.0.3.70): icmp_seq=3 ttl=63 time=0.639 ms 64 bytes from atuin (10.0.3.70): icmp_seq=4 ttl=63 time=108 ms 64 bytes from atuin (10.0.3.70): icmp_seq=5 ttl=63 time=0.562 ms 64 bytes from atuin (10.0.3.70): icmp_seq=6 ttl=63 time=108 ms 64 bytes from atuin (10.0.3.70): icmp_seq=7 ttl=63 time=1000 ms 64 bytes from atuin (10.0.3.70): icmp_seq=8 ttl=63 time=108 ms 64 bytes from atuin (10.0.3.70): icmp_seq=9 ttl=63 time=0.645 ms 64 bytes from atuin (10.0.3.70): icmp_seq=10 ttl=63 time=104 ms sustained throughput of tcp connections (e.g. scp) is ok though, but e.g. running X11 sessions over ssh completely sucks (starting a plain X client takes 20-30 seconds because of those latencies for every small packet!). with the 10.3 driver (and the most recent e1000 driver from intel.com) I get sustained good ping rates only with "RxIntDelay=0" -- and I have to patch the driver for this chip in order to get "RxIntDelay=0" working (this value is out of range by default). unfortuneately, this setting only optimizes ping/latency and reduces sustained unidirectional tcp stream throughput dramatically :-( all the switch/can't-ping problems have occured with the unpatched suse e1000.ko module -- patched/recompiled modules have been loaded only for short periods for some ping tests. so back to the original problem...
Can you reproduce the problem without specifying InterruptThrottleRate or RxIntDelay?
I've removed my modprobe.conf.local for now, I'll wait and see (possibly tomorrow morning...?!)
I am lowering the priority since there is no clear pattern to reproducing the issue and it seems to be working now.
ok. thanks for your input so far! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=342148
Harald Koenig
https://bugzilla.novell.com/show_bug.cgi?id=342148#c7
--- Comment #7 from Brandon Philips
https://bugzilla.novell.com/show_bug.cgi?id=342148#c8
--- Comment #8 from Harald Koenig
Are you sure that the interface is up when you resume?
yes
Can you run `rcnetwork restart` or `ifconfig ethX up` to fix the issue without the rmmod/modprobe cycle?
I tried that many times before -- did never help. situation today: I don't get ip-address via dhcp (as before;) now I checked the dhcp server log: there is a DHCPDISCOVER/DHCPOFFER every minute (for hours before now I needed the notebook;). only after running "tcpdump -i eth1" I got the ip address (and as usual, ping stops working after killing tcpdump). I get a new piece for the jigsaw every day, and all match the global image "no packets without promiscuous mode [since last kernel update]. this is now the plain suse e1000.ko without any module options... (but still with fglrx) this is the tail from the dhcpd log for my T60: Nov 22 14:20:45 obitest dhcpd: DHCPDISCOVER from 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:20:45 obitest dhcpd: DHCPOFFER on 10.10.8.60 to 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:21:49 obitest dhcpd: DHCPDISCOVER from 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:21:49 obitest dhcpd: DHCPOFFER on 10.10.8.60 to 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:22:53 obitest dhcpd: DHCPDISCOVER from 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:22:53 obitest dhcpd: DHCPOFFER on 10.10.8.60 to 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:22:53 obitest dhcpd: DHCPREQUEST for 10.10.8.60 (10.0.5.22) from 00:16:41:ad:9c:b0 via 10.10.11.254 Nov 22 14:22:53 obitest dhcpd: DHCPACK on 10.10.8.60 to 00:16:41:ad:9c:b0 via 10.10.11.254 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=342148#c9
Karsten Keil
https://bugzilla.novell.com/show_bug.cgi?id=342148#c10
--- Comment #10 from Harald Koenig
This looks like the driver doesn't see the packet and since setting promiscuous mode helps it looks like that the HW lost the MAC address in the receive path. If you are in the fail situation, can you please do a ifconfig and check if the MAC address is correct ? I do not remember that we changed the driver since RC1, so I do not understand why this problem comes from a recent kernel update, but I'll recheck.
"ifconfig eth1" alrays shows the same MAC address, even without running tcpdump (ping not working). now I run "watch -dn1 ifconfig eth1" and noticed that even in "blocked" state without tcpdump, both Rx and Tx package counts change (there are open ssh and X11 sessions), so the hardware [driver] itself doesn't seem to be the full/only problem. any suggestion what to check next ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=342148#c11
Karsten Keil
https://bugzilla.novell.com/show_bug.cgi?id=342148
User jeffm@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=342148#c12
Jeff Mahoney
participants (1)
-
bugzilla_noreply@novell.com