[Bug 712958] New: Kernel Problems Under Multi-User Loads
https://bugzilla.novell.com/show_bug.cgi?id=712958 https://bugzilla.novell.com/show_bug.cgi?id=712958#c0 Summary: Kernel Problems Under Multi-User Loads Classification: openSUSE Product: openSUSE 11.4 Version: Final Platform: x86-64 OS/Version: openSUSE 11.4 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: drichard@largo.com QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20100101 Firefox/6.0 I've tried changing and testing every know bottleneck that I can find and so far nothing has improved our OpenSuse 11.4 server. My belief now is that this is a kernel/scheduling problem of some type. All of my research seems to come back to the kernel. After about 10-15 concurrent users the server gets extraordinarily slow. On earlier versions of OpenSuse, we were able to get 100-200 users into hardware that was far less robust. The first visible sign of problem is the network. Here is the network stats on an OpenSuse 11.3 server which is being hammered with networking and working well. Note the low RX packets. eth0 Link encap:Ethernet HWaddr F4:CE:46:C0:EA:A8 inet addr:128.222.233.237 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1970250698 errors:0 dropped:175 overruns:0 frame:0 On this OpenSuse 11.4 server (which is used to deploy GNOME to thin clients), here is the same information: eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:67783895 errors:0 dropped:175438 overruns:0 frame:0 The other very visible issue is disk performance. I've noticed: - If you type in 'vi /etc/hosts' it sits for 3-4 seconds with a blinking cursor before the file appears. I tried it from the console and it's doing the same thing. - If you change a password for a user, it sits for 3-4 seconds before the prompt comes back. - If you try and install any software with Yast2 with a user load, the whole server basically crawls. All X events freeze during the install and you have to wait for them to install. The server is acting like it would if we were swapping, but that's not the case. top shows barely any load, we have 64GB on the server and only 12 is in use. I installed iotop and it's barely showing a load, and even when iotop is at zero, it still takes multiple seconds for a file to open in vi. When we get back below 10 users, everything immediately gets faster and things work as expected. Something is happening at a low level, and no tools seem to report the failure. Things we have tried: - Copied it to a VM instance and after it had a load, the same issues appeared. This kind of rules out the hardware. - I turned off nscd with the idea that it was slowing down file access, no changes in speed. - I turned off barriers on the ext4 file system, no change. - The VM instance actually downgrades it to ext3, with same results, so it seems not related to the physical file system. - Various sysctl.conf settings that people have mentioned, none of which seem to affect this issue. Kernel is currently: kernel-desktop-2.6.37.6-0.5.1.x86_64 Any tips or ideas are appreciated, whatever this problem is...it seems like it will prohibit Enterprise use and potential future SLED problems. Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c1
--- Comment #1 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c2
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c3
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c4
Michal Hocko
The first visible sign of problem is the network. Here is the network stats on an OpenSuse 11.3 server which is being hammered with networking and working well. Note the low RX packets.
eth0 Link encap:Ethernet HWaddr F4:CE:46:C0:EA:A8 inet addr:128.222.233.237 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1970250698 errors:0 dropped:175 overruns:0 frame:0
On this OpenSuse 11.4 server (which is used to deploy GNOME to thin clients), here is the same information:
eth0 Link encap:Ethernet HWaddr 00:1C:C4:93:DF:72 inet addr:128.222.99.243 Bcast:128.222.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:67783895 errors:0 dropped:175438 overruns:0 frame:0
You seem to have _much_ bigger packet drop with 11.4 wrt 11.3. Is any of your infrastructure relying on the network (ldap...)
The other very visible issue is disk performance. I've noticed: - If you type in 'vi /etc/hosts' it sits for 3-4 seconds with a blinking cursor before the file appears. I tried it from the console and it's doing the same thing.
Have you tried to strace/ltrace this command with timing information? (strace -tt -o strace.log vi /etc/hosts)
- If you change a password for a user, it sits for 3-4 seconds before the prompt comes back.
This would point to ldap. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c5
--- Comment #5 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c6
--- Comment #6 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c7
Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c8
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c9
--- Comment #9 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c10
Greg Kroah-Hartman
Created an attachment (id=447027) --> (http://bugzilla.novell.com/attachment.cgi?id=447027) [details] Output from hwinfo
Are you sure? It usually starts out looking like: ============ start debug info ============ libhd version 18.3u (x86-64) using /var/lib/hardware .. and then ends with the pci information you provided. Care to try attaching the whole thing? Perhaps it needs to be compressed to let bugzilla accept it? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c11
--- Comment #11 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c12
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c13
Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c14
--- Comment #14 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c15
--- Comment #15 from Michal Hocko
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c16
--- Comment #16 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c17
--- Comment #17 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c18
--- Comment #18 from Michal Hocko
Created an attachment (id=447195) --> (http://bugzilla.novell.com/attachment.cgi?id=447195) [details] strace of opening /etc/hosts with vi
Could you run strace with -tt -f parameters so we get the timing information and catch also all forked processes, please? Could you also give us /proc/mounts? Could you also run vmstat 1 and take the output while you are accessing the file that shows an issue. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c19
--- Comment #19 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c20
--- Comment #20 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c21
--- Comment #21 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c22
--- Comment #22 from Jiri Slaby
Created an attachment (id=447199) --> (http://bugzilla.novell.com/attachment.cgi?id=447199) [details] strace -o /tmp/vietchostsfull.out -tt -f vi /etc/hosts
It looks like *everything* is slow. Could you attach strace -tt -T -f output? My guess is that some lock (BTM?) is contended. Could you attach also sysrq-t output? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c23
--- Comment #23 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c24
--- Comment #24 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c25
--- Comment #25 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c26
--- Comment #26 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c27
--- Comment #27 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c28
--- Comment #28 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c29
--- Comment #29 from Jiri Slaby
Created an attachment (id=447225) --> (http://bugzilla.novell.com/attachment.cgi?id=447225) [details] strace -o /tmp/vietchostsmore.out -tt -f -T vi /etc/hosts
Interesting times (100x slower than on my machine): open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.411674> .. open("/etc/.hosts.swpx", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 <0.339978> .. close(5) = 0 <0.461197> unlink("/etc/.hosts.swpx") = 0 <0.404862> close(4) = 0 <0.313377> unlink("/etc/.hosts.swp") = 0 <0.407398> .. open("/etc/.hosts.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 <0.445237> .. write(4, "..."..., 4096) = 4096 <0.417035>
sysrq-trigger was passed a "t" and this was dumped.
This is not useful much, it's full of scheduler stats. Could you run # dmesg -c -s 1000000 and then # echo t > /proc/sysrq-trigger # dmesg > log and attach log? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c30
--- Comment #30 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c31
--- Comment #31 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c32
Michal Hocko
Created an attachment (id=447246) --> (http://bugzilla.novell.com/attachment.cgi?id=447246) [details] Output of dmesg after -c -s and then sysrg-trigger to t
This still doesn't contain anything useful (any traces):
$ head bug-712958_dmesg.out
000000
[631329.878618] .se->statistics.exec_max : 1.528704
[631329.878618] .se->statistics.slice_max : 1.122945
[631329.878618] .se->statistics.wait_max : 2.041506
[631329.878618] .se->statistics.wait_sum : 3.526940
[631329.878618] .se->statistics.wait_count : 30
[631329.878618] .se->load.weight : 512
Anyway, as Jiri already pointed out. File open taking almost half a sec looks
really suspicious.
Let's have a look at the worst (successful) file open times:
$ grep "\
From this, it looks that any write access is terribly slow (O_RDWR or O_WRONLY). I guess that /etc/ is at your root partition which seems to be ext4? Have you formated the filesystem from scratch or migrated it from ext3? Is the filesystem almost full?
You said that if you copied the file to /tmp then it opens just fine. Is /tmp on the same root partition or is it a link to tmpfs which you have mounted? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c33
--- Comment #33 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c34
--- Comment #34 from Jiri Slaby
(In reply to comment #30)
Created an attachment (id=447246) --> (http://bugzilla.novell.com/attachment.cgi?id=447246) [details] [details] Output of dmesg after -c -s and then sysrg-trigger to t
This still doesn't contain anything useful (any traces): $ head bug-712958_dmesg.out .000000 [631329.878618] .se->statistics.exec_max : 1.528704 [631329.878618] .se->statistics.slice_max : 1.122945
Why is this crap a part of sysrq-t output at all? The only option here is to increase the log buffer. It can be done by the log_buf_len kernel option. Something like log_buf_len=1000000. So if you are going to reboot, pass this in. Then, when you encounter this weird behaviour, repeat the process with dmesg+sysrq-trigger. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c35
--- Comment #35 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c36
--- Comment #36 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c37
--- Comment #37 from Michal Hocko
@all: Confirmed, complete reformat and reload of OS 11.4 is now scheduled for next Tuesday. We are hopeful that something failed in going from 11.3 -> 11.4.
I would rather start with the / partition reformatting (with back up of course). It would be good to rule out any fs related issue. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c38
--- Comment #38 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c39
--- Comment #39 from Michal Hocko
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c40
--- Comment #40 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c41
--- Comment #41 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c42
David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c43
Bjørn Lie
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c44
--- Comment #44 from Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c45
--- Comment #45 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c46
--- Comment #46 from Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c47
--- Comment #47 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c48
--- Comment #48 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c49
--- Comment #49 from Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c50
--- Comment #50 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c51
--- Comment #51 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c52
--- Comment #52 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c53
--- Comment #53 from Bjørn Lie
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c54
--- Comment #54 from David Richards
https://bugzilla.novell.com/show_bug.cgi?id=712958
https://bugzilla.novell.com/show_bug.cgi?id=712958#c55
Greg Kroah-Hartman
participants (1)
-
bugzilla_noreply@novell.com