[Bug 337003] New: system crash on data transfer - driver problem with Intel 3 Series chipset ?
https://bugzilla.novell.com/show_bug.cgi?id=337003 Summary: system crash on data transfer - driver problem with Intel 3 Series chipset ? Product: openSUSE 10.3 Version: Final Platform: x86-64 OS/Version: Other Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: bernard.delley@psi.ch QAContact: qa@suse.de Found By: --- trying to use Intel core2 quad processors on the appropriate board with Intel 3 Series chipset for MPICH2 parallel calculations with several nodes. System crash happens systematically with big real program running. My small data transfer test program has produced system crash occasionally. I feel it could be a network driver problem. for this network interface I have not found a driver source that I could compile, as was possible (and often necessary) for previous Intel 1000 network chips with previous SUSE versions. Yast shows the hardware correctly identified: Intel 82801I (ICH9 Family) Gigabit Ethernet Controller (lspci output could be supplied -- I see no attachment option here) installation is openSUSE-10.3-GM-DVD-x86_64.iso text only ssh open firewall set off in yast (needed for MPICH2) solution of installation problem is given in https://bugzilla.novell.com/show_bug.cgi?id=328471 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c1 Mark Gordon <mtgordon@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mtgordon@novell.com Status|NEW |NEEDINFO Info Provider| |bernard.delley@psi.ch --- Comment #1 from Mark Gordon <mtgordon@novell.com> 2007-10-26 15:34:10 MST --- Re: attachments, "Add an attachment (proposed patch, testcase, etc.)" below is a link to the attachment-adding interface. It's not there when the bug is first created, but it's available subsequently. Please attach the lspci output now that you have that option. http://en.opensuse.org/Bugs/Kernel has several useful tips on reporting kernel bugs. Using that information, you should be able to provide a bit more of a precise description than "system crash" (e.g. if it's an oops, it will have been written to /var/log/messages). Please attach anything that seems relevant. Feel free to err on the side of adding too much. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c2 --- Comment #2 from Bernard Delley <bernard.delley@psi.ch> 2007-11-02 10:26:48 MST --- Created an attachment (id=181832) --> (https://bugzilla.novell.com/attachment.cgi?id=181832) messages while network connection was lost this is /var/log/messages from an event the made the machine loose network connection about Nov 2 9:45 am, it was not actually crashed. Connection was fixed by invoking yast and changing the inp address forth and back and the proceed to finish. and quit yast. this was in late afternoon about 16:48. The machine was running all the time on 4 CPU a mpich2 4-processor job with local mpd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c3 --- Comment #3 from Bernard Delley <bernard.delley@psi.ch> 2007-11-02 10:32:20 MST --- Created an attachment (id=181834) --> (https://bugzilla.novell.com/attachment.cgi?id=181834) lspci output to describe the northgate of the DG33TL board note that the Gbit network is intergrated into the ICH9 brigde. the rtl8139 card is still plugged in, if alternate network connections would be used. it has no ethernet connection now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c4 --- Comment #4 from Bernard Delley <bernard.delley@psi.ch> 2007-11-02 10:35:19 MST --- Created an attachment (id=181836) --> (https://bugzilla.novell.com/attachment.cgi?id=181836) dmesg output just for further documentation -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Mark Gordon <mtgordon@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #181832|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Mark Gordon <mtgordon@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #181834|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Mark Gordon <mtgordon@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #181836|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Mark Gordon <mtgordon@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bnc-team-screening@forge.provo.novell.com |kernel-maintainers@forge.provo.novell.com Status|NEEDINFO |NEW Component|Basesystem |Kernel Info Provider|bernard.delley@psi.ch | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c5 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |npiggin@novell.com, jeffm@novell.com, | |andrea@novell.com --- Comment #5 from Jeff Mahoney <jeffm@novell.com> 2007-11-07 13:16:34 MST --- Is the fan on this machine running? I'm seeing a number of things that might indicate overheating: Nov 2 09:45:00 fn18 smartd[3577]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 109 to 110 .. and then lots of ATA failures. There's also this: Dmol39n90m: Corrupted page table at address 2b34dc76e000 PGD c184c067 PUD bc51f067 PMD 668a3067 PTE 200058410067 Bad pagetable: 000d [1] SMP last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:06:00.0/irq CPU 2 Modules linked in: nfs lockd nfs_acl sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 microcode firmware_class snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq coretemp hwmon loop dm_mod e1000 8139too snd_hda_intel ohci1394 snd_pcm i2c_i801 ieee1394 rtc_cmos 8139cp rtc_core snd_timer i2c_core rtc_lib button mii e1000e snd serio_raw intel_agp soundcore snd_page_alloc sg usbhid hid ff_memless sd_mod ehci_hcd uhci_hcd usbcore edd ext3 mbcache jbd fan thermal processor pata_marvell ahci libata scsi_mod Pid: 7589, comm: Dmol39n90m Tainted: G N 2.6.22.5-31-default #1 RIP: 0033:[<0000000000d511eb>] [<0000000000d511eb>] RSP: 002b:00007fffdc9f5d18 EFLAGS: 00010206 RAX: 00002b34d3c94640 RBX: 0000000000000000 RCX: 00002b34d5379640 RDX: 00002b34dc76e014 RSI: 00002b34db089014 RDI: 00002b34d3c94640 RBP: 00007fffdc9f5f60 R08: 0000000000a63700 R09: 0000000000200000 R10: 0000000002148700 R11: 0000000000d4fe70 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 000000000129aab0 FS: 0000000001260900(0063) GS:ffff810127402340(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002b34dc76e000 CR3: 00000000c0077000 CR4: 00000000000006a0 Process Dmol39n90m (pid: 7589, threadinfo ffff8100ac398000, task ffff8100c31c4850) RIP [<0000000000d511eb>] RSP <00007fffdc9f5d18> Bad pte = 200058410067, process = ???, vm_flags = 100077, vaddr = 2b34dc76e000 Call Trace: [<ffffffff8026eda8>] vm_normal_page+0x62/0x7e [<ffffffff8026f7a5>] unmap_vmas+0x312/0x7a4 [<ffffffff8027389c>] exit_mmap+0x78/0xed [<ffffffff8022ee6c>] mmput+0x28/0xa0 [<ffffffff80234371>] do_exit+0x23e/0x81e [<ffffffff803fb05b>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff8021f850>] is_prefetch+0x0/0x1e4 [<ffffffff803fce15>] do_page_fault+0x2f6/0x769 [<ffffffff803f8c8c>] __sched_text_start+0x194/0x8ac [<ffffffff8027551f>] do_mmap_pgoff+0x643/0x798 [<ffffffff803fb48d>] error_exit+0x0/0x84 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c6 --- Comment #6 from Nick Piggin <npiggin@novell.com> 2007-11-07 19:19:03 MST --- The corrupted pte message means the pte unexpectedly points outside the range of the regular mem_map. There is not too much I can gather from the crash: The calltrace is pretty jumbled, but even if it was accurate, the it is only the path via which the corruption is detected, not where it occurred. vm_flags is VM_READ|VM_WRITE|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|VM_ACCOUNT It could be a library or program text mapping, I suppose (we don't print out vm_file or vm_ops information in these messages, so it is hard to know). The pte is (_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED|_PAGE_DIRTY), which is not surprising -- just a regular user page. The corruption is likely to be in the higher order bits of the pte. It's almost impossible to make any useful progress on this unless you we can reproduce it. If you can reliably reproduce, we could eventually narrow it down. Although I do think it might be an overheating issue: it is very common to see the kernel hit memory corruption in pages and pagetables... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c7 --- Comment #7 from Bernard Delley <bernard.delley@psi.ch> 2007-11-08 06:34:50 MST --- overheating? sensors works with 10.3 for this hardware. great! while the program is running on all 4 processors, sensors indicates temperature of 78 - 81 on the four processors (single chip, but each CPU has a reading) the processor cooler felt quite warm by hand, when I tried. the passive northgate cooler is quite warm (hot) judged by hand. the machine is on the lowest floor on the shelf in a cooled room, for extra reserve the side panel remains removed. the smartd readings in /var/log/messages showing a maximum of 110 could that be a correct reading of northgate (ICH9 bridge) temperature? the mention of Dmol39m90m in dmesg is indicative?? this is my user program that I to run in MPICH2 parallel mode. Does the dmesg suggest it was involved in a crash previous to the reboot? If there were any segmentation violations coming from a user program: that should crash the program, but not the system. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
From fn19.site (192.168.40.59) icmp_seq=1 Destination Host Unreachable Nov 8 14:03:10 fn18 mpd: fn18_38610 (runmainloop 308): no pulse_ack from rhs; re-entering ring
https://bugzilla.novell.com/show_bug.cgi?id=337003#c8 Bernard Delley <bernard.delley@psi.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #181832|0 |1 is obsolete| | --- Comment #8 from Bernard Delley <bernard.delley@psi.ch> 2007-11-08 06:50:15 MST --- Created an attachment (id=182612) --> (https://bugzilla.novell.com/attachment.cgi?id=182612) var/log/messages while crashing the network interface on mpich2 slave I can reproducibly crash the network interface, by running a small mpi test program a few times. the program sends fake data between the two nodes. The machines remain almost idling. the program worked fine for example on older nodes operated with suse10.1 and on many other. here are the last line appearing in messages on fn18 after the mentioned actions were taken on fn19 or on fn18 console. ssh fn18 tail -f messages after reboot, mount, ssh root@fn18 Nov 8 13:56:25 fn18 sshd[3881]: Accepted publickey for root from 192.168.40.42 port 47757 ssh2 mpdboot -n 2 --verbose --chkup Nov 8 13:58:35 fn18 sshd[3928]: PAM audit_log_acct_message() failed: Operation not permitted mpdtrace -l Nov 8 13:59:08 fn18 sshd[3950]: PAM audit_log_acct_message() failed: Operation not permitted mpiexec -machinefile mpd.hosts -n 2 Mpt Nov 8 13:59:46 fn18 sshd[3979]: PAM audit_log_acct_message() failed: Operation not permitted ssh fn18 then exit Nov 8 14:00:26 fn18 sshd[4028]: Accepted publickey for delley from 192.168.40.59 port 45559 ssh2 Nov 8 14:00:29 fn18 sshd[4030]: PAM audit_log_acct_message() failed: Operation not permitted mpiexec -machinefile mpd.hosts -n 2 Mpt Nov 8 14:01:10 fn18 mpdman: mpdman starting new log; fn18_mpdman_1 sh fn18 then several times mpiexec -machinefile mpd.hosts -n 2 Mpt after a while: Nov 8 14:03:10 fn18 mpd: fn18_38610 (runmainloop 308): no pulse_ack from rhs; re-entering ring ping fn18 PING fn18.site (192.168.40.58) 56(84) bytes of data. the directly connected console showed messages at Nov 8 14:07:23 fn18 mpd: mpd ending mpdid=fn18_38610 (inside cleanup) at console /etc/init.d/network restart eth0 Nov 8 14:11:22 fn18 kernel: eth0: no IPv6 routers present then mpdboot etc after a few mpiexec network interfaces on fn18 and fn19 were dead and later reanimated from console with "network restart eth0" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c9 --- Comment #9 from Bernard Delley <bernard.delley@psi.ch> 2007-11-08 07:01:48 MST --- Created an attachment (id=182622) --> (https://bugzilla.novell.com/attachment.cgi?id=182622) messages from the mpich2 master node the mpich2 master node had a crashed eth0 shortly before 14:20 crashes on the master node were not frequent. on the slave node less than 5 times running the Mpt program was sufficient for crashed eth0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c10 --- Comment #10 from Nick Piggin <npiggin@novell.com> 2007-11-08 18:50:38 MST --- Well I'm not sure, whether it is overheating or or other cause, but it does seem somewhat like hardware failure. On the other hand, the fact that your network driver is crashing indicates that the driver might have a bug in it. And the buggy driver could have corrupted memory and caused the 'Dmol39m90m message' (that message tells us the Dmol39m90m process had its page table memory corrupted). Let's put this problem aside for the moment and assume it is due to either a hardware or software bug elsewhere. It would be nice to try your workload with a different type of card / driver. I guess the realtek ethernet card is only a 100Mbps, so it won't help too much? Can you try it anyway? Or can you put a different (not Intel based) gigabit card in the system easily? The other thing we can try is to use a different version of the e1000 driver. I'm not an expert on the Linux network stack, so I don't konw the best way to go with this. Do you have the time, and are you comfortable with compiling and installing a new kernel? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c11 --- Comment #11 from Bernard Delley <bernard.delley@psi.ch> 2007-11-14 06:04:59 MST --- excellent ! hint at a further easy test opportunity: I used yast to delete the configuration of Intel 82801I (ICH9 Family) Gigabit Ethernet Controller (on board) and configured the rtl8139 card instead. and plugged the network cables into the 8139 outlets. The yast alone did not do it, the machine needed reboot to show a proper ifconfig and get actual connection. the 8139 appears now under eth1, I use the same static address as before. mpich now appears to work just fine, no crash seen yet after a multiple of the time needed previously to see a crash and many network transfers. the 8139 performance is great for fast Ethernet (nominally 100Mb/s) with 170 Mb/s but 6x times lower than we shoot for with the Gb/s network. The experiment strongly suggests that the PROBLEM is really with the DRIVER used for the on board Ethernet. I have not found out yet what driver is used for the intel 82801I . I see two drivers in question e1000 and e1000e in /lib/modules/2.6.22.5-31-default/kernel/drivers/net/ yast says e1000 Possibly neither is the right driver. For the previously installed mainboard Intel DG965ss I needed to compile the driver for (SUSE 10.1 )from the e1000 source provided by Intel. I found no Ethernet driver source supplied by Intel for current DG33tl . -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c12 Nick Piggin <npiggin@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kkeil@novell.com --- Comment #12 from Nick Piggin <npiggin@novell.com> 2007-11-14 17:41:29 MST --- OK, that should help a lot, and I agree it does strongly suggest a network driver (or possibly controller) problem. I hope it is also enough for your production usage so you can bear with us until we work out what the problem is... I believe the e1000 driver is the correct one. If it actually detects your card and brings up an interface, that suggests it matches the hardware PCI IDs... and given that Intel maintains both the hardware and the software, they should have this right. I'm not completely sure how the e1000 driver is maintained in the suse kernels, so I'm cc'ing Karsten. Thanks for your help and patience so far. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c13 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |john.ronciak@intel.com --- Comment #13 from Karsten Keil <kkeil@novell.com> 2007-11-15 05:04:52 MST --- You can try the e1000e, you should do (if the Intel onboard is used): ifdown eth0 rmmod e1000 modprobe e1000e this should setup eth0 again but with using e1000e We have e1000 7.6.5 on 10.3, you could also try 7.6.9.1 from sourceforge. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c14 --- Comment #14 from Bernard Delley <bernard.delley@psi.ch> 2007-11-16 02:35:59 MST --- Created an attachment (id=183654) --> (https://bugzilla.novell.com/attachment.cgi?id=183654) lspci out for identifying network chips I replaced the rtl8139 by an intel pro1000GT (PWLA8391GTBLK) pci card after setup by yast (+reboot) this eth2 works reliably (1h experience) this older intel Gb/s chip does 630Mb/s which is about half of the expected for more current onboard chips. trying e1000e for the onboard chip. rmmod e1000 modprobe e1000e yast knows neither of the ethernet controllers, so modprobe e1000 yast, specify hardware e1000e rmmod e1000 double check with lsmod : no e1000 is listed, just e1000e this gives ethernet connection. But, its unreliable again, maybe slightly better than with e1000 for now, the system is used with the intel pci card. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003#c15 --- Comment #15 from Bernard Delley <bernard.delley@psi.ch> 2007-11-16 02:39:00 MST --- I alwauys did the ifdown/ifup via a reboot after each replugging of the internet connection. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c16 --- Comment #16 from Bernard Delley <bernard.delley@psi.ch> 2007-12-19 06:47:04 MST --- same problem was found with Intel Q35 chipset, again ICH9 family: Intel DQ35joe mainboard core2 quad core processor Q6600 installation openSUSE-10.3-GM-DVD-x86_64.iso with tricks relating to acpi=on and dual or quad processors see report 328471 accidentally a boot with acpi=off was done, as reported in 328471 this make only one of the four cpu available -- not as intended with such a machine. BUT ! with safe settings (acpi=on) and consequently 1 cpu per node the on board ethernet appears to run without problems. so I will have to fall back using a pci network card for now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #182622|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #183654|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel-maintainers@forge.provo.novell.com |kkeil@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User kkeil@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c17 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |bernard.delley@psi.ch --- Comment #17 from Karsten Keil <kkeil@novell.com> 2008-01-24 07:29:58 MST --- Can you please try if the new e1000 7.6.15 driver from sourceforge.net work stable with your onboard card ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c18 Bernard Delley <bernard.delley@psi.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|bernard.delley@psi.ch | --- Comment #18 from Bernard Delley <bernard.delley@psi.ch> 2008-01-29 08:12:25 MST --- I downloaded and compiled that e1000 7.6.15 driver. encouragingly e1000_ich9lan and 82566 occurs in the source code. I replaced /lib/modules/2.6.22.5-31-default/kernel/drivers/net/e1000/e1000.ko with the new e1000.ko I tested this on the DQ35joe board Intel Q35 chipset ICH9 family, lspci says: Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) mpich2 performance with my little test program is over 1Gb/s while it lasts, but soon the network interface hangs... until my root cron process restarts things: 1,11,21,31,41,51 * * * * /etc/init.d/network restart eth0 > /dev/null 2>& 1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c19 --- Comment #19 from John Ronciak <john.ronciak@intel.com> 2008-01-29 11:07:01 MST --- Is this a pre-production board this hang is happening on? What type of traffic and load does test program do? What does the output from /proc/interrupts show during the testing? This isn't the original problem being worked in this BZ is it? This is something new right? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c20 --- Comment #20 from John Ronciak <john.ronciak@intel.com> 2008-01-29 21:44:55 MST --- In addition to the above can you please supply the output of 'lspci -vv' and 'ethtool -S eth0'. It's also strange that we can't see the driver loading in the dmesg. Can somebody look at that and get back to us? We were looking for error messages from the driver which don't appear to be there either. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c21 --- Comment #21 from Bernard Delley <bernard.delley@psi.ch> 2008-01-30 04:05:39 MST --- Created an attachment (id=192291) --> (https://bugzilla.novell.com/attachment.cgi?id=192291) output of lspci -vv on DQ35JO board in hindsight, I am not assured that I ever had a true system crash as originally reported. The beowulf node is sure not reachable from the network when bug has hit, and originally I rebooted the node. these are current intel boards with ich9 southgate. I assume that these are full scale production boards from a most reputed maker. they are shown on http://www.intel.com/products/desktop/motherboard/index.htm as DG33TL and DQ35JO my next older board I mentioned which works fine is DG965SS also shown on that page, it features an ich8 southgate. The small fortran mpi test program used in comment 18 does a series of mpi allreduce operations and records the time used. I add ethtool and lspci -vv and dmesg outputs for that DQ35JO board -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c22 --- Comment #22 from Bernard Delley <bernard.delley@psi.ch> 2008-01-30 04:06:52 MST --- Created an attachment (id=192292) --> (https://bugzilla.novell.com/attachment.cgi?id=192292) ethtool -S from DQ35JO board -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c23 --- Comment #23 from Bernard Delley <bernard.delley@psi.ch> 2008-01-30 04:10:21 MST --- Created an attachment (id=192293) --> (https://bugzilla.novell.com/attachment.cgi?id=192293) dmesg on DQ35JO board -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c24 --- Comment #24 from John Ronciak <john.ronciak@intel.com> 2008-01-31 15:43:24 MST --- The ethtool data reported in #22 is actually lspci data. The dmesg shows that link keeps being acquired but then lost again, over and over. This probably means you have something wrong with you connection, port, cable, switch, etc. or possibly that the HW is bad. Bad HW is highly unlikely. It happens once in a great while but it is very rare. With link going down all the time it's very unlikely that you'll be able to get any sort of test running let alone performance data from the test. Please correct the link problem with will probably solve the other problems. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c25 Bernard Delley <bernard.delley@psi.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #192292|0 |1 is obsolete| | --- Comment #25 from Bernard Delley <bernard.delley@psi.ch> 2008-02-06 07:29:55 MST --- Created an attachment (id=193425) --> (https://bugzilla.novell.com/attachment.cgi?id=193425) ethtool -S from DQ35JO board here is the correct file showing ethtool output. I think, the dmesg output shows mainly the result of crontab -l # DO NOT EDIT THIS FILE - edit the master and reinstall. # (/tmp/crontab.XXXXzZSMVv installed on Wed Dec 19 17:05:16 2007) # (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $) 1,11,21,31,41,51 * * * * /etc/init.d/network restart eth0 > /dev/null 2>& 1 its not a cable - switch etc problem! which resets the network in 10 min intervals. This ugly clue saves me walking to the room every time the network has been lost. there appear to be 'normal' sequences where the network was reset without need. and others which have "nfs: server fn02 not responding, still trying" this is actually saying that the network was dead, and it could not get through to the nfs server (the server is OK). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c26 --- Comment #26 from John Ronciak <john.ronciak@intel.com> 2008-02-06 09:19:15 MST --- Comments about the link in #24 still apply. The attachment in #25 show no errors happening but a fair number of xon/xoff packets. Normally don't see this many. Again, the link keeps being lost and regained which is normally not a NIC/LOM HW issue. Has this been repro'd on another like system or is it only this system? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c27 Bernard Delley <bernard.delley@psi.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #192293|0 |1 is obsolete| | --- Comment #27 from Bernard Delley <bernard.delley@psi.ch> 2008-02-07 02:00:38 MST --- Created an attachment (id=193555) --> (https://bugzilla.novell.com/attachment.cgi?id=193555) dmesg (since reboot) done after network crash on DQ35jo the cron job resetting eth0 was removed, and a reboot done on node fn19, one of the DQ35jo nodes which together with the DG33tl nodes show all such network problems with the onboard network chip. please note the boot options in dmesg, these are related to bugs 325995 and 328471 network testing was done until the network interface crashed. the messages appearing on the terminal window on the two machines in the test are shown in the next comment. the network driver is e1000-7.6.15 as suggested in a previous comment dmesg was recorded before the network restart. the network restart added these lines to dmesg: ADDRCONF(NETDEV_UP): eth0: link is not ready e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready eth0: no IPv6 routers present usb 7-1: USB disconnect, address 2 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c28 --- Comment #28 from Bernard Delley <bernard.delley@psi.ch> 2008-02-07 02:02:32 MST --- Created an attachment (id=193556) --> (https://bugzilla.novell.com/attachment.cgi?id=193556) messages during network testing -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c29 --- Comment #29 from John Ronciak <john.ronciak@intel.com> 2008-02-07 15:10:56 MST --- So let us review what is going on here. 1) the link keeps dropping and being regained. Cause is unknown but this is the crux of the problem, at least I think so. 2) Link loss can be due to lots of reasons. Here are a few questions that have not been answered yet: q1) Does this happen on other systems or just this one? q2) Does this happen with other tests like nttcp or iperf? q3) have other cables and ports (both switch and NIC) show the same problem. Above it's stated that "it's not a cable - switch etc problem". How do you know this? Bad cables cause link loss all the time. So do bad ports. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |bernard.delley@psi.ch -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c30 --- Comment #30 from Bernard Delley <bernard.delley@psi.ch> 2008-05-19 03:15:25 MST --- comment 29 missed the message from 27 + 28: the frequent link loss was completely artificial due to a cron job doing network restart (to bring the computer back to a reachable state automatically). 27 + 28 report the basic problem, with permanent network loss on mpich2 runs - when the cron network restart is removed. so the needed info on the basic problem is in 27 and 28. new info: the DG35jo board appears to run correctly with the on-board network when kernel 2.6.24-default is used with SUSE 10.3 the same fix does not work for the other board in question. the DG33tl board reports that the on-board network is started and immediately after that hangs showing the login prompt line (it is a non-graphical installation) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c31 --- Comment #31 from John Ronciak <john.ronciak@intel.com> 2008-05-19 14:39:33 MST --- The details is #27 show that only one interface is active, the on-board LOM (ICH9 LAN device). The dmesg does not show that link was lost however. The other NIC described in #25 is an 82541PI which has been out in the field for a long time now with no bug reports open on it. This still looks like a link loss problem that has nothing to do with the HW or driver. This still leaves the cable, switch ports and the like as causing the problem. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c32 --- Comment #32 from Bernard Delley <bernard.delley@psi.ch> 2008-05-20 00:58:06 MST --- NO ! its not a loose contact or the like. and we are talking about 4 machines two for each mainboard. which show the problems addressed here. DQ35jo 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connnecti works now with 2.6.24 kernel DG33tl when used with the onboard 00:19.0 Ethernet controller: Intel Corporation 82801I (ICH9 Family) Gigabit Ethern gets into hang state on use with SUSE 10.3 default kernel and hangs on boot with 2.6.24-default kernel. ==> this is the remaining problem of interest. on DG33tl I use for production calculations at reduced performance the additional 06:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) this older pci plugin card works fine, but performance is about half of the on-board chip. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User john.ronciak@intel.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c33 --- Comment #33 from John Ronciak <john.ronciak@intel.com> 2008-05-20 10:41:33 MST --- OK here are some questions as this whole thing has me very confused as to what is working and what isn't. 1) When you say the 82566DM is now working with the 2.6.24 kernel, do you mean the 2.6.24 running on top of a SuSE install or a SuSE 10.3 upgrade? 2.1) Using the on-board ICH9 LOM, what do you mean it hangs with 2.6.24 or on boot with 10.3? How does it hang? Does the system hang? Just no networking? What are the error messages from the logs? Is there a back trace? Was there as test running when it hangs? If so what is the test? 2.2) The reports above indicate link loss. Is the link stable or is it still being lost? Lost only when the hang happens? 3) the 82541 is a low end 32 bit PCI controller. The ICH9 LOM is a PCIe device so it's no wonder why it would perform better. But if it hanging all the time how are you able to run performance tests? What tests are you running? We have no reports of anything like this happening anywhere else. This seems to be your system or environment. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User digidietze@draisberghof.de added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c34 --- Comment #34 from Josua Dietze <digidietze@draisberghof.de> 2008-07-09 12:30:03 MDT --- Created an attachment (id=226828) --> (https://bugzilla.novell.com/attachment.cgi?id=226828) Oops of Fedora PAE kernel when uploading microcode -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User bernard.delley@psi.ch added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c35 Bernard Delley <bernard.delley@psi.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|bernard.delley@psi.ch | --- Comment #35 from Bernard Delley <bernard.delley@psi.ch> 2008-07-10 00:56:04 MDT --- reply to #33 1) I simply added vmlinuz-2.6.24-default and the corresponding lib/modules tree on top of the SuSE 10.3 "text" +enhanced base system installation. the grub menu is modified to boot 2.6.24 by default. this works for the DQ35joe board. I take it as evidence that the drivers in 2.6.24 are OK for the 82566DM interface. 2.1) on the DG33TL board this fix does not work. did not further investigate crash at boot, as do not hope for support on running a SuSE system with other kernel. 2.2) my problem occurs with mpich2 parallel calculations involving several nodes. after a small number of data transfers link is lost, and remains lost indefinitely. restart of the network interface fixes the problem, but requires typing at the console (walk to the basement first) -- or a cron job doing that periodically to spare the walk. 3) with the small mpich2 test job I get a few data transfers, so I can assess performance before link is lost. the problem is not only here in Zurich. I got the hint about 2.6.24 from Hamburg where mpich2 jobs are run on nodes with the same Intel DQ35joe mainboard. (the Hamburg computer programs communicating by mpich2 are not the same as here) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User kkeil@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c36 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |bernard.delley@psi.ch --- Comment #36 from Karsten Keil <kkeil@novell.com> 2008-08-07 09:59:32 MDT --- So a plain 2.6.24 works stable with DQ35joe board but doesn't boot on DG33TL because a kernel crash at boot ? Do you have details about the crash ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User kkeil@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c37 --- Comment #37 from Karsten Keil <kkeil@novell.com> 2008-10-24 08:42:13 MDT --- If it is the microcode crash from the attachment, what if you boot without loading the microcode module ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Critical |Major Priority|P2 - High |P3 - Medium -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337003 User kkeil@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=337003#c38 Karsten Keil <kkeil@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED Info Provider|bernard.delley@psi.ch | Resolution| |NORESPONSE --- Comment #38 from Karsten Keil <kkeil@novell.com> 2009-02-06 06:44:36 MST --- OK no answer, so I close this now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com