[Bug 712958] New: Kernel Problems Under Multi-User Loads

newer
[Bug 683140] New: package atftp is...

bugzilla_noreply＠novell.com

18 Aug 2011 18 Aug '11

18:34

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c0 Summary: Kernel Problems Under Multi-User Loads Classification: openSUSE Product: openSUSE 11.4 Version: Final Platform: x86-64 OS/Version: openSUSE 11.4 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: drichard@largo.com QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20100101 Firefox/6.0 I've tried changing and testing every know bottleneck that I can find and so far nothing has improved our OpenSuse 11.4 server. My belief now is that this is a kernel/scheduling problem of some type. All of my research seems to come back to the kernel. After about 10-15 concurrent users the server gets extraordinarily slow. On earlier versions of OpenSuse, we were able to get 100-200 users into hardware that was far less robust. The first visible sign of problem is the network. Here is the network stats on an OpenSuse 11.3 server which is being hammered with networking and working well. Note the low RX packets. eth0 Link encap:Ethernet HWaddr F4:CE:46:C0:EA:A8 inet addr:128.222.233.237 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1970250698 errors:0 dropped:175 overruns:0 frame:0 On this OpenSuse 11.4 server (which is used to deploy GNOME to thin clients), here is the same information: eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:67783895 errors:0 dropped:175438 overruns:0 frame:0 The other very visible issue is disk performance. I've noticed: - If you type in 'vi /etc/hosts' it sits for 3-4 seconds with a blinking cursor before the file appears. I tried it from the console and it's doing the same thing. - If you change a password for a user, it sits for 3-4 seconds before the prompt comes back. - If you try and install any software with Yast2 with a user load, the whole server basically crawls. All X events freeze during the install and you have to wait for them to install. The server is acting like it would if we were swapping, but that's not the case. top shows barely any load, we have 64GB on the server and only 12 is in use. I installed iotop and it's barely showing a load, and even when iotop is at zero, it still takes multiple seconds for a file to open in vi. When we get back below 10 users, everything immediately gets faster and things work as expected. Something is happening at a low level, and no tools seem to report the failure. Things we have tried: - Copied it to a VM instance and after it had a load, the same issues appeared. This kind of rules out the hardware. - I turned off nscd with the idea that it was slowing down file access, no changes in speed. - I turned off barriers on the ext4 file system, no change. - The VM instance actually downgrades it to ext3, with same results, so it seems not related to the physical file system. - Various sysctl.conf settings that people have mentioned, none of which seem to affect this issue. Kernel is currently: kernel-desktop-2.6.37.6-0.5.1.x86_64 Any tips or ideas are appreciated, whatever this problem is...it seems like it will prohibit Enterprise use and potential future SLED problems. Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

Show replies by date

bugzilla_noreply＠novell.com

18 Aug 18 Aug

18:36

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c1 --- Comment #1 from David Richards <drichard@largo.com> 2011-08-18 18:36:13 UTC --- Another area that seems discussed on the Fedora/RedHat bug reports is setting slice_idle to zero on the controller. This is a deeper change that I won't make until I get more of your thoughts. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20 Aug 20 Aug

07:47

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c2 Jiri Slaby <jslaby@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO CC| |jslaby@novell.com InfoProvider| |drichard@largo.com --- Comment #2 from Jiri Slaby <jslaby@novell.com> 2011-08-20 07:47:24 UTC --- Just a shot in the dark. Does latencytop show anything useful? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

22 Aug 22 Aug

13:34

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c3 David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|drichard@largo.com | --- Comment #3 from David Richards <drichard@largo.com> 2011-08-22 13:34:13 UTC --- latencytop shows nothing of interest. Each new pass shows processes finishing in just a few milliseconds. I also changed from cfg to deadline scheduler (which some have recommended) with no change in performance. Right now with 40 users, it takes nearly 4 seconds for a 20 line file to open. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:23

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c4 Michal Hocko <mhocko@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mhocko@novell.com --- Comment #4 from Michal Hocko <mhocko@novell.com> 2011-08-22 16:23:56 CEST --- (In reply to comment #0) [...]

...

The first visible sign of problem is the network. Here is the network stats on an OpenSuse 11.3 server which is being hammered with networking and working well. Note the low RX packets.

eth0 Link encap:Ethernet HWaddr F4:CE:46:C0:EA:A8 inet addr:128.222.233.237 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1970250698 errors:0 dropped:175 overruns:0 frame:0

On this OpenSuse 11.4 server (which is used to deploy GNOME to thin clients), here is the same information:

eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:67783895 errors:0 dropped:175438 overruns:0 frame:0

You seem to have _much_ bigger packet drop with 11.4 wrt 11.3. Is any of your infrastructure relying on the network (ldap...)

...

The other very visible issue is disk performance. I've noticed: - If you type in 'vi /etc/hosts' it sits for 3-4 seconds with a blinking cursor before the file appears. I tried it from the console and it's doing the same thing.

Have you tried to strace/ltrace this command with timing information? (strace -tt -o strace.log vi /etc/hosts)

...

- If you change a password for a user, it sits for 3-4 seconds before the prompt comes back.

This would point to ldap. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

15:53

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c5 --- Comment #5 from David Richards <drichard@largo.com> 2011-08-22 15:53:52 UTC --- @michal: There is no LDAP involved here at all, we are using local /etc/passwd | shadow for passwords. So that makes the delays in writing to disk even more stranger. When we get below a certain number of users, then it starts working much faster again. It appears to me to be a problem with interaction with disk and networking. Looking over the strace for vi-ing a file shows nothing unusual. It reads in all of the configuration settings for vim and then opens the file. Today with 40 users, it's taking a good 4 seconds for the file to open. :\ The one thing that I am going to try is to install the vanilla kernel instead of -desktop. It shouldn't affect us greatly, but maybe something changed in there that is causing this issue. I'm kind of running out of ideas. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:07

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c6 --- Comment #6 from David Richards <drichard@largo.com> 2011-08-22 17:07:14 UTC --- @all: I can provide a ssh tunnel to this server if anyone wants to see it first hand. These conditions (40 people working concurrently) would be impossible to reproduce on a test machine. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19:09

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c7 Greg Kroah-Hartman <gregkh@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO CC| |gregkh@novell.com InfoProvider| |drichard@largo.com --- Comment #7 from Greg Kroah-Hartman <gregkh@novell.com> 2011-08-22 19:09:40 UTC --- Can you attach the output of 'hwinfo' here for the machine that is having problems? Is this a different network controller involved than the other "working" 11.3 machines? Having so many dropped packets is worrisome, any chance it's just a flaky cable/router, some-other-hardware-related issue involved? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19:27

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c8 David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|drichard@largo.com | --- Comment #8 from David Richards <drichard@largo.com> 2011-08-22 19:27:51 UTC --- @Greg: Attaching hwinfo. The other servers are running similar hardware. Here is some more information: We believe the networking problem is in the operating system because we duped the physical machine and moved it to VMWare. The VM server is using Intel cards vs the physical server which is running Broadcom. On the VM: eth0 Link encap:Ethernet HWaddr 00:50:56:8B:00:05 inet addr:128.222.99.242 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24185635 errors:0 dropped:548189 overruns:0 frame:0 So it's doing the same thing as the physical machine, dropping tons of packets. Back to the disk IO issue, I found something very odd. I'm seeing speed differences based on the sub-directory, even though we only have one partition of / ! If I vi /etc/hosts it takes about 4-5 seconds to open. I then copied the hosts file into various folders. If I vi /tmp/hosts it opens immediately. If I vi /home/hosts, it opens in about 2 seconds. I then placed it on an NFS mount and that takes about 2 seconds to open. This is just crazy: /dev/disk/by-id/cciss-3600508b100184a395358363055330002-part1 swap swap defaults 0 0 /dev/disk/by-id/cciss-3600508b100184a395358363055330002-part2 / ext4 acl,user_xattr,barrier=0 1 1 /etc, /tmp and /home are on the same physical mount. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19:28

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c9 --- Comment #9 from David Richards <drichard@largo.com> 2011-08-22 19:28:44 UTC --- Created an attachment (id=447027) --> (http://bugzilla.novell.com/attachment.cgi?id=447027) Output from hwinfo -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19:56

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c10 Greg Kroah-Hartman <gregkh@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |drichard@largo.com --- Comment #10 from Greg Kroah-Hartman <gregkh@novell.com> 2011-08-22 19:56:31 UTC --- (In reply to comment #9)

...

Created an attachment (id=447027) --> (http://bugzilla.novell.com/attachment.cgi?id=447027) [details] Output from hwinfo

Are you sure? It usually starts out looking like: ============ start debug info ============ libhd version 18.3u (x86-64) using /var/lib/hardware .. and then ends with the pci information you provided. Care to try attaching the whole thing? Perhaps it needs to be compressed to let bugzilla accept it? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20:06

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c11 --- Comment #11 from David Richards <drichard@largo.com> 2011-08-22 20:06:12 UTC --- Created an attachment (id=447034) --> (http://bugzilla.novell.com/attachment.cgi?id=447034) Full Dump Of hwinfo -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20:14

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c12 David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|drichard@largo.com | --- Comment #12 from David Richards <drichard@largo.com> 2011-08-22 20:14:39 UTC --- We are going to turn off AppArmor tonight at 5pm and see if that affects performance. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

23 Aug 23 Aug

01:42

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c13 Greg Kroah-Hartman <gregkh@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |drichard@largo.com --- Comment #13 from Greg Kroah-Hartman <gregkh@novell.com> 2011-08-23 01:42:19 UTC --- If AppArmor doesn't change anything, could you try unloading the ipmi_si kernel module (unless you are really using the IPMI interface, are you?) Also, you have 2 network devices here, might there be some odd type of routing issue here that gets packets confused? Yeah, I'm grasping :) Oh, one other "fun" thing you might try, could you use the latest 3.0 kernel version that we have in Kernel:stable? I understand if you can't do that (note, install it and don't upgrade it, keeping both kernels on the machine so you can switch easy at boot time.) If that happens to resolve the issue, we have a better chance of being able to track the issue down through bisection and elimination. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:51

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c14 --- Comment #14 from David Richards <drichard@largo.com> 2011-08-23 13:51:05 UTC --- @greg: AppArmor was unloaded last night and made no change to the performance. I had to Google IPMI, so I guess we aren't using it. :) I modprobe -r'd the module and dmesg reports the module unloaded. No change. We have 41 users logged in today and it's taking almost 5 seconds for /etc/hosts to open with vi. I mentioned yesterday that when I cp /etc/hosts to /tmp/hosts and then vi that file it opens immediately. If I cp /etc/hosts to /etc/hosts2 and then vi /etc/hosts2 it's still slow. It's the oddest thing I have ever seen. When we get 2 users in Nautilus, the server just comes to a crawl and freezes. There is definitely something wrong with disk access. We have many servers with two NICs and have never seen this problem. Our older GNOME servers are configured in the same manner. We have Evolution running on SLED and it's configured with two networks too. Network RX errors continue to grow. When I check about every 1-2 seconds it usually goes up at least by one, sometimes more. eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:141457074 errors:0 dropped:537878 overruns:0 frame:0 We will download, install and boot the 3.0 kernel. I'll try and schedule that tonight when the users are off. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:01

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c15 --- Comment #15 from Michal Hocko <mhocko@novell.com> 2011-08-23 16:01:22 CEST --- I think that having strace -tt output attached can be still valuable. Could you attach it please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:13

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c16 --- Comment #16 from David Richards <drichard@largo.com> 2011-08-23 14:13:27 UTC --- @all: I appreciate all of your expertise in this area, I don't open bugs until we are stumped. Attaching the strace after this message. vi of /etc/services is slow too. It's just crazy that certain areas of the disk are hammered and others not, when they are all mounted in / ! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:14

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c17 --- Comment #17 from David Richards <drichard@largo.com> 2011-08-23 14:14:09 UTC --- Created an attachment (id=447195) --> (http://bugzilla.novell.com/attachment.cgi?id=447195) strace of opening /etc/hosts with vi -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:27

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c18 --- Comment #18 from Michal Hocko <mhocko@novell.com> 2011-08-23 16:27:09 CEST --- (In reply to comment #17)

...

Created an attachment (id=447195) --> (http://bugzilla.novell.com/attachment.cgi?id=447195) [details] strace of opening /etc/hosts with vi

Could you run strace with -tt -f parameters so we get the timing information and catch also all forked processes, please? Could you also give us /proc/mounts? Could you also run vmstat 1 and take the output while you are accessing the file that shows an issue. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:41

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c19 --- Comment #19 from David Richards <drichard@largo.com> 2011-08-23 14:41:16 UTC --- Created an attachment (id=447199) --> (http://bugzilla.novell.com/attachment.cgi?id=447199) strace -o /tmp/vietchostsfull.out -tt -f vi /etc/hosts Looks like 9 seconds in the dump, but that is longer than when invoked without strace. Clearly strace has some overhead. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:42

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #447199|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:44

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c20 --- Comment #20 from David Richards <drichard@largo.com> 2011-08-23 14:44:47 UTC --- Created an attachment (id=447200) --> (http://bugzilla.novell.com/attachment.cgi?id=447200) Contents of proc mounts. Local drives + NFS + Share + GVFS for users -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:49

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c21 --- Comment #21 from David Richards <drichard@largo.com> 2011-08-23 14:49:14 UTC --- I started vmstat and it's seen when cpu/id got to 23. When it returns to 2, that's when I :q'd out of vi. vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 4 0 0 48967620 257616 9364624 0 0 0 2 3 3 1 1 98 0 0 0 0 0 48967356 257616 9364628 0 0 0 112 2884 1355 1 7 92 0 0 0 0 0 48967356 257616 9364624 0 0 0 0 1936 1083 0 5 94 0 0 3 0 0 48966356 257616 9364624 0 0 0 0 2873 1472 1 9 90 0 0 9 0 0 48965852 257616 9364624 0 0 0 0 4480 2233 1 23 76 0 0 3 0 0 48965852 257616 9364624 0 0 0 0 4415 2226 1 24 75 0 0 1 0 0 48965836 257616 9364624 0 0 0 0 5186 2205 1 26 73 0 0 1 0 0 48966472 257616 9364624 0 0 0 88 4186 1974 1 19 79 0 0 0 0 0 48965860 257616 9364624 0 0 0 344 1128 516 0 2 98 0 0 1 0 0 48965860 257616 9364624 0 0 0 0 2272 1064 1 5 94 0 0 1 0 0 48964436 257616 9364624 0 0 4 4 6569 1971 1 28 70 0 0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:54

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c22 --- Comment #22 from Jiri Slaby <jslaby@novell.com> 2011-08-23 14:54:57 UTC --- (In reply to comment #19)

...

Created an attachment (id=447199) --> (http://bugzilla.novell.com/attachment.cgi?id=447199) [details] strace -o /tmp/vietchostsfull.out -tt -f vi /etc/hosts

It looks like *everything* is slow. Could you attach strace -tt -T -f output? My guess is that some lock (BTM?) is contended. Could you attach also sysrq-t output? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:11

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c23 --- Comment #23 from David Richards <drichard@largo.com> 2011-08-23 16:11:52 UTC --- Created an attachment (id=447224) --> (http://bugzilla.novell.com/attachment.cgi?id=447224) Showing cpus/hypercores. Attaching all information that I can think of, this server is 4 CPU x 4 core. Top rarely/never shows more than a few percentage busy. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:16

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c24 --- Comment #24 from David Richards <drichard@largo.com> 2011-08-23 16:16:25 UTC --- Created an attachment (id=447225) --> (http://bugzilla.novell.com/attachment.cgi?id=447225) strace -o /tmp/vietchostsmore.out -tt -f -T vi /etc/hosts With -T flag in strace -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:31

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c25 --- Comment #25 from David Richards <drichard@largo.com> 2011-08-23 16:31:30 UTC --- @Jiri: I had to google sysrq-t and think I did what you are asking. I echo'd a t to /proc/sysrq-trigger and it threw lines into /var/log/messages. Attaching the output, bzip'd -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:32

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c26 --- Comment #26 from David Richards <drichard@largo.com> 2011-08-23 16:32:07 UTC --- Created an attachment (id=447226) --> (http://bugzilla.novell.com/attachment.cgi?id=447226) sysrq-trigger was passed a "t" and this was dumped. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:45

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c27 --- Comment #27 from David Richards <drichard@largo.com> 2011-08-23 16:45:52 UTC --- @all: Also of interest, comparing the old GNOME server with the new one reveals that many more files are now open. I'll attach the lsof output from OpenSuse 11.4. OpenSuse 10.2, 140 users logged in lsof | wc -l 240444 OpenSuse 11.4, 40 users logged in lsof | wc -l 287837 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:46

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c28 --- Comment #28 from David Richards <drichard@largo.com> 2011-08-23 16:46:42 UTC --- Created an attachment (id=447231) --> (http://bugzilla.novell.com/attachment.cgi?id=447231) Output of lsof on OpenSuse 11.4, 40 users logged into GNOME desktop. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:12

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c29 --- Comment #29 from Jiri Slaby <jslaby@novell.com> 2011-08-23 17:12:14 UTC --- (In reply to comment #24)

...

Created an attachment (id=447225) --> (http://bugzilla.novell.com/attachment.cgi?id=447225) [details] strace -o /tmp/vietchostsmore.out -tt -f -T vi /etc/hosts

Interesting times (100x slower than on my machine): open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.411674> .. open("/etc/.hosts.swpx", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 <0.339978> .. close(5) = 0 <0.461197> unlink("/etc/.hosts.swpx") = 0 <0.404862> close(4) = 0 <0.313377> unlink("/etc/.hosts.swp") = 0 <0.407398> .. open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.445237> .. write(4, "..."..., 4096) = 4096 <0.417035>

...

sysrq-trigger was passed a "t" and this was dumped.

This is not useful much, it's full of scheduler stats. Could you run # dmesg -c -s 1000000 and then # echo t > /proc/sysrq-trigger # dmesg > log and attach log? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:49

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c30 --- Comment #30 from David Richards <drichard@largo.com> 2011-08-23 17:49:09 UTC --- Created an attachment (id=447246) --> (http://bugzilla.novell.com/attachment.cgi?id=447246) Output of dmesg after -c -s and then sysrg-trigger to t -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

24 Aug 24 Aug

12:51

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c31 --- Comment #31 from David Richards <drichard@largo.com> 2011-08-24 12:51:08 UTC --- @all: We did not install the 3.0 kernel last night because it seemed like we were getting good data out of the current one. Does everyone feel we should upgrade? I can limp along a few more days if this information is useful. Otherwise, I'll upgrade and reboot tomorrow. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

12:51

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|drichard@largo.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:26

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c32 Michal Hocko <mhocko@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |drichard@largo.com --- Comment #32 from Michal Hocko <mhocko@novell.com> 2011-08-24 15:26:04 CEST --- (In reply to comment #30)

...

Created an attachment (id=447246) --> (http://bugzilla.novell.com/attachment.cgi?id=447246) [details] Output of dmesg after -c -s and then sysrg-trigger to t

This still doesn't contain anything useful (any traces): $ head bug-712958_dmesg.out 000000 [631329.878618] .se->statistics.exec_max : 1.528704 [631329.878618] .se->statistics.slice_max : 1.122945 [631329.878618] .se->statistics.wait_max : 2.041506 [631329.878618] .se->statistics.wait_sum : 3.526940 [631329.878618] .se->statistics.wait_count : 30 [631329.878618] .se->load.weight : 512 Anyway, as Jiri already pointed out. File open taking almost half a sec looks really suspicious. Let's have a look at the worst (successful) file open times: $ grep "\<open\>" bug-712958_vietchostsmore.log | grep -v "= -1" | sed 's/.*<$.*$>/\1/' | sort | tail -n5 0.339978 0.379258 0.403296 0.411674 0.445237 Which are those files? $ grep "\<open\>" bug-712958_vietchostsmore.log | grep -v "= -1" | sed 's/.*<$.*$>/\1/' | sort | tail -n5 > worst_open $ grep -f worst_open bug-712958_vietchostsmore.log 24197 12:13:07.959485 open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.411674> 24197 12:13:08.382062 open("/etc/.hosts.swpx", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 <0.339978> 24197 12:13:10.370000 open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.445237> 24197 12:13:14.776717 open("/etc/4913", O_WRONLY|O_CREAT|O_EXCL, 0100644) = 3 <0.379258> 24197 12:13:17.044206 open("/etc/hosts", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 3 <0.403296> Let's have a look at all file opens from /etc: $ grep "open.*etc" bug-712958_vietchostsmore.log 24197 12:13:07.042719 open("/etc/ld.so.cache", O_RDONLY) = 3 <0.002339> 24197 12:13:07.465060 open("/etc/vimrc", O_RDONLY) = 3 <0.000014> 24197 12:13:07.789405 open("/etc/nsswitch.conf", O_RDONLY) = 3 <0.000467> 24197 12:13:07.812897 open("/etc/ld.so.cache", O_RDONLY) = 3 <0.000534> 24197 12:13:07.841753 open("/etc/ld.so.cache", O_RDONLY) = 3 <0.001435> 24197 12:13:07.907670 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3 <0.000288> 24197 12:13:07.949504 open("/etc/hosts", O_RDONLY) = 3 <0.000467> 24197 12:13:07.955735 open("/etc/.hosts.swp", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000703> 24197 12:13:07.959485 open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.411674> 24197 12:13:08.378219 open("/etc/.hosts.swpx", O_RDONLY) = -1 ENOENT (No such file or directory) <0.001717> 24197 12:13:08.382062 open("/etc/.hosts.swpx", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 <0.339978> 24197 12:13:10.370000 open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.445237> 24197 12:13:11.601865 open("/etc/hosts", O_RDONLY) = 3 <0.000019> 24197 12:13:14.776717 open("/etc/4913", O_WRONLY|O_CREAT|O_EXCL, 0100644) = 3 <0.379258> 24197 12:13:17.044206 open("/etc/hosts", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 3 <0.403296>

...

From this, it looks that any write access is terribly slow (O_RDWR or O_WRONLY). I guess that /etc/ is at your root partition which seems to be ext4? Have you formated the filesystem from scratch or migrated it from ext3? Is the filesystem almost full?

You said that if you copied the file to /tmp then it opens just fine. Is /tmp on the same root partition or is it a link to tmpfs which you have mounted? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:46

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c33 --- Comment #33 from David Richards <drichard@largo.com> 2011-08-24 13:46:10 UTC --- The history of this server was that OpenSuse 11.3 was installed with one big / file system formatted as ext4. When OpenSuse 11.4 was released, we did an upgrade. /tmp and /etc are both under /, and the same physical drive. desktop:~ # df -h Filesystem Size Used Avail Use% Mounted on rootfs 250G 15G 222G 7% / devtmpfs 32G 160K 32G 1% /dev tmpfs 32G 1.7M 32G 1% /dev/shm /dev/cciss/c0d0p2 250G 15G 222G 7% / 128.222.75.50:/largo 1.7T 832G 729G 54% /home/largo 128.222.75.50:/users 1.7T 832G 729G 54% /users //192.168.47.33/pdshare/ 317G 139G 179G 44% /home/largo/tmp/pdshare I'm not sure if the slow disk problem is causing the networking problem or it's another separate issue. The prospect of blowing the server and doing a fresh 11.4 install and reloading all of this is possible, but not pleasant. :) I'd hate to do all of that work and have this issue resurface again. :\ -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

15:45

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c34 --- Comment #34 from Jiri Slaby <jslaby@novell.com> 2011-08-24 15:45:40 UTC --- (In reply to comment #32)

...

(In reply to comment #30)

...
Created an attachment (id=447246) --> (http://bugzilla.novell.com/attachment.cgi?id=447246) [details] [details] Output of dmesg after -c -s and then sysrg-trigger to t

This still doesn't contain anything useful (any traces): $ head bug-712958_dmesg.out .000000 [631329.878618] .se->statistics.exec_max : 1.528704 [631329.878618] .se->statistics.slice_max : 1.122945

Why is this crap a part of sysrq-t output at all? The only option here is to increase the log buffer. It can be done by the log_buf_len kernel option. Something like log_buf_len=1000000. So if you are going to reboot, pass this in. Then, when you encounter this weird behaviour, repeat the process with dmesg+sysrq-trigger. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:22

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c35 --- Comment #35 from David Richards <drichard@largo.com> 2011-08-24 16:22:48 UTC --- @all: We had a meeting today and the plan is to make a VM of the current machine for our beta testers. We'll move them to the VM and then we are going to do a clean install of 11.4 from scratch and then reload all of the customizations. We are going to pull the drives out of the RAID array and force them into new positions to 100% ensure a clean wipe. I know that sometimes an upgrade doesn't always work exactly like a clean install, so possibly something just didn't quite upgrade correctly. If anyone has any thoughts about changes we should make or techniques to use, let me know. I guess we'll continue to use ext4. I should be able to get 200+ users easily on this hardware when it's working correctly. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

25 Aug 25 Aug

13:25

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c36 --- Comment #36 from David Richards <drichard@largo.com> 2011-08-25 13:25:06 UTC --- @all: Confirmed, complete reformat and reload of OS 11.4 is now scheduled for next Tuesday. We are hopeful that something failed in going from 11.3 -> 11.4. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:29

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c37 --- Comment #37 from Michal Hocko <mhocko@novell.com> 2011-08-25 15:29:50 CEST --- (In reply to comment #36)

...

@all: Confirmed, complete reformat and reload of OS 11.4 is now scheduled for next Tuesday. We are hopeful that something failed in going from 11.3 -> 11.4.

I would rather start with the / partition reformatting (with back up of course). It would be good to rule out any fs related issue. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:37

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c38 --- Comment #38 from David Richards <drichard@largo.com> 2011-08-25 13:37:49 UTC --- @michal: I'm not sure I would know the best way to do that. Would you maybe dd off the whole disk image to another external disk, and then boot a stand alone distro and then format the drive and then dd it all back? Something like Clonezilla won't work because I believe it would bring back the failure. I have mentioned already too that when we made a physical to virtual copy and put it in VMware, the same types of issues are seen. Vmware downgrades ext4 to ext3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:29

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c39 --- Comment #39 from Michal Hocko <mhocko@novell.com> 2011-08-25 16:29:49 CEST --- I would just boot from a life CD, mount the / partition somewhere, tar it up, umount, format, mount it back and untar it back. Maybe there is some handy tool for that, I do not know. This is the way I would do it. Btw. having an image of the partition could be useful for further offline debugging if the problem disappears by reformatting. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

30 Aug 30 Aug

14:41

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c40 --- Comment #40 from David Richards <drichard@largo.com> 2011-08-30 14:41:29 UTC --- @all: We created a Clonezilla of the OS 11.4 image in case we need to debug or test the older version. So today we pulled the disk drives from the RAID array and reseated them, deleted and recreated a new RAID 1+0 array (which has proven to work well on multi-user servers). We then formatted as ext3 and reloaded OS 11.4. It's going to take me all afternoon to restore all of the customizations and test them; hopefully we can move users back on this hardware tomorrow. One issue of note, whatever is wrong with the networking is back already on a fresh install, fully patched from the Internet and then rebooted: linux-gsss:~ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:33726 errors:0 dropped:358 overruns:0 frame:0 Now to re-install everything. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

31 Aug 31 Aug

21:06

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c41 --- Comment #41 from David Richards <drichard@largo.com> 2011-08-31 21:06:32 UTC --- @all: Update, server is fully reloaded and we will have our first real loads tomorrow. Right now everything is very fast, which is normal with just a few users. Networking is still having the same problems as before. If the disk issue is better with the reinstall and downgrade to ext3 I can focus on networking. linux-gsss:/ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5714698 errors:0 dropped:20439 overruns:0 frame:0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

1 Sep 1 Sep

13:13

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c42 David Richards <drichard@largo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|drichard@largo.com | --- Comment #42 from David Richards <drichard@largo.com> 2011-09-01 13:13:50 UTC --- @all: Very sadly I must report that the issue with disk and networking performance has not gone away after the server was completely reformatted and reloaded. I should be able to get 200 users on this hardware. The symptoms are identical. Files that are sitting in /etc have very slow disk access speeds. With 26 users, vi /etc/services blinks for 2 seconds before the file opens. If I copy the file to /tmp or /home, it opens immediately. Tested on other servers with heavier users loads, running SLED 11 and OS 11.3 and this isn't seen at all. I'm sure if we get to our peak loads of 40, that it's going to behave as it did before and be painfully slow. There are two network cards activated just as before, and both are behaving as before. Lots of dropped RX packets. eth0 is the internal network, and eth1 goes out to the Internet. We use this same configuration on all servers with no similar issues on earlier versions of OpenSuse. Does everyone agree that we should move to the 3.0 kernel? I can't even imagine what could be causing these issues. eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:11322212 errors:0 dropped:70978 overruns:0 frame:0 TX packets:11728443 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:4811162089 (4588.2 Mb) TX bytes:9917306091 (9457.8 Mb) Interrupt:16 Memory:f8000000-f8012800 eth1 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:74 inet addr:172.23.1.235 Bcast:172.23.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:70739 errors:0 dropped:238 overruns:0 frame:0 TX packets:7314 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:15062173 (14.3 Mb) TX bytes:760390 (742.5 Kb) Interrupt:17 Memory:fa000000-fa01280 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19:11

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c43 Bjørn Lie <bjorn.lie@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bjorn.lie@gmail.com --- Comment #43 from Bjørn Lie <bjorn.lie@gmail.com> 2011-09-01 21:11:00 CEST --- I'm not a kernel-dev, but are you still running on kernel-desktop and not kernel-default? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

23:20

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c44 --- Comment #44 from Greg Kroah-Hartman <gregkh@suse.com> 2011-09-01 23:20:23 UTC --- Yes, trying 3.0 would be great, but the number of dropped receive packets is worrisome, and might be indicative of some other type of interrupt routing issue that might be causing the disk delay as well. This hardware did work fine on 11.3, right? So that would kind of rule out the hardware problem, very strange. Yes, trying 3.0 would be a nice test if you can do it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

2 Sep 2 Sep

13:28

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c45 --- Comment #45 from David Richards <drichard@largo.com> 2011-09-02 13:28:19 UTC --- @all: OK, I'll try and get 3.0 installed today and schedule a reboot for early next week. Yes, we have 11.2 and 11.3, and SLED 11 all in production and they are very fast and stable with heavy user loads on this hardware (HP). I feel that hardware has been ruled out because the issues followed us into a VM clone which has completely different hardware. VM is Intel NICs vs broadcom for the physical server. Another interesting thing to note is that it appears ext3 is slightly faster under the conditions of this problem. When I use passwd to change a password on OS 11.3 it's immediate. On this 11.4 machine with ext3 is sits for about 2-3 seconds. When we had ext4 it was closer to 5 seconds. When I was logged on by myself after the reinstall, everything was very fast and it's gotten slower with each additional concurrent user. We have about 30 people on today. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

4 Sep 4 Sep

16:20

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c46 --- Comment #46 from Greg Kroah-Hartman <gregkh@suse.com> 2011-09-04 16:20:44 UTC --- Wait, the issues followed to a VM clone? The network packet loss and the filesystem delays? On the same exact machine or a different one? I ask about the hardware trying to rule your specific individual machine out from having issues somewhere in it, has this specific machine worked properly with the 11.3 image? Or did you buy it and only try 11.4 on it? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

6 Sep 6 Sep

12:50

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c47 --- Comment #47 from David Richards <drichard@largo.com> 2011-09-06 12:50:16 UTC --- @Greg: Lots of comments in this issue, but yes I've mentioned the same problems follow OS 11.4 when moved physical-to-virtual and then run on completely different hardware. Here is the NIC on the virtual copy of this machine: desktop:~ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:50:56:8B:00:0B inet addr:128.222.99.242 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:54371901 errors:0 dropped:528752 overruns:0 frame:0 RX dropped packets follow us to new hardware. I don't believe it's the hardware, we have other servers of identical configuration running older versions of OpenSuse that are working wonderfully. We are back now from the long weekend, and I'll get 3.0 installed today and schedule a reboot for tomorrow AM. Things that might be a bit more unique in our configuration: - Class B newtwork - Lots of users concurrently logging into the same computer. - Running remote Xwindows via XDMCP However, the dropped RX packets seem to happen even not under a user load. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

16:49

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c48 --- Comment #48 from David Richards <drichard@largo.com> 2011-09-06 16:49:17 UTC --- @all: I added repository: http://download.opensuse.org/repositories/openSUSE:/Tumbleweed/standard/ and will bring down kernel-desktop 3.0.4-43.1 tonight and get it rebooted tomorrow AM for testing. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:31

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c49 --- Comment #49 from Greg Kroah-Hartman <gregkh@suse.com> 2011-09-06 17:31:31 UTC --- Note, you need more than just the kernel package to boot 3.0 properly. Please use the Kernel:stable repo instead, that should have everything you need, and you will not pull in other Tumbleweed packages you might not want at the moment. And also, don't use the kernel-desktop flavor for a server, that doesn't make much sense, and might make things not work as well. Please use the "kernel-default" package instead. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:37

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c50 --- Comment #50 from David Richards <drichard@largo.com> 2011-09-06 17:37:12 UTC --- @Greg: Will do Greg, adjusting everything now. The issue of kernel-desktop vs kernel-default was a bit unknown in my mind because while it is a server, we are using it as a desktop. Will follow your instructions, reboot is scheduled for tomorrow AM. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

7 Sep 7 Sep

13:55

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c51 --- Comment #51 from David Richards <drichard@largo.com> 2011-09-07 13:55:13 UTC --- @all: Very happy to report partial success when using the 3.0 kernel. Disk access is working as expected now and very fast. We have 40 users on the server, and with the previous kernel we would be seeing a good amount of slowness on the file systems. The high number of packet loss has not gone away however. Dropped packets go up every few seconds in jumps of 1-3 at a time. linux-gsss:~ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:12244039 errors:0 dropped:10358 overruns:0 frame:0 I'm not sure if this still falls under a kernel issue or if a new bug report should be created. But we're very happy with the incremental fix, things are running much faster now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

17:52

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c52 --- Comment #52 from David Richards <drichard@largo.com> 2011-09-07 17:52:36 UTC --- 41 users and still very fast. I'd say for sure the disk issue is resolved in kernel 3.0.4. Dropping lots of packets, here is the current total: ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23636204 errors:0 dropped:22611 overruns:0 frame:0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

18:00

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c53 --- Comment #53 from Bjørn Lie <bjorn.lie@gmail.com> 2011-09-07 20:00:27 CEST --- Or the switch from kernel-desktop to kernel-default, unless you test, we'll never know. http://doc.opensuse.org/products/opensuse/openSUSE/opensuse-tuning/cha.tunin... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

19 Sep 19 Sep

18:28

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |drichard@largo.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20 Sep 20 Sep

13:04

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c54 --- Comment #54 from David Richards <drichard@largo.com> 2011-09-20 13:04:37 UTC --- @all: Performance has been excellent with the new kernel, we have already had 50 users logged in and they hardly show up in top. Disk speed is excellent, reads and writes are fast. The only issue that remains is the issue of continued dropped RX packets (below is the current snapshot). Never saw this on earlier versions of OpenSuse and the same issue follows us into the VMware player clone of this server...which is completely different hardware and NICs. linux-gsss:~ # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:178875656 errors:0 dropped:519193 overruns:0 frame:0 Should I open a new bug report or just alter this one? This issue does not seem to greatly affect speed, but sure would like to see this working better. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:50

New subject: [Bug 712958] Kernel Problems Under Multi-User Loads

https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c55 Greg Kroah-Hartman <gregkh@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED InfoProvider|drichard@largo.com | Resolution| |FIXED --- Comment #55 from Greg Kroah-Hartman <gregkh@suse.com> 2011-09-20 14:50:29 UTC --- Yes, if you want us to work on that issue, it should be a separate bug. Glad the 3.0 kernel is working properly for you, will mark this as fixed now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

4859

Age (days ago)

4892

Last active (days ago)

List overview

Download

58 comments

1 participants

participants (1)

bugzilla_noreply＠novell.com

[Bug 712958] New: Kernel Problems Under Multi-User Loads

tags

participants (1)