[Bug 662083] New: process accounting kind of broken on kernel-xen 2.6.34.x dom0
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c0 Summary: process accounting kind of broken on kernel-xen 2.6.34.x dom0 Classification: openSUSE Product: openSUSE 11.3 Version: Final Platform: x86-64 OS/Version: openSUSE 11.3 Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: samuel.kvasnica@ims.co.at QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 After having quite hard times with the soft-lockup nightmare on 2.6.31.x kernels with xen I'm happy to say that the 2.6.34 seem to be soft-lockup free by now (Yes, it got better on nehalem by the patch, but something is still broken there, just not so frequent...). But there are other 'cosmetic' glitches related to clock which were not present in 2.6.31: In particular the load average shown in dom0 is broken, it gets a stable positive integer offset (typically something like 3.0 or 5.0 on my systems). I could reproduce this on 2 different 4-core/HT nehalem systems as well as an old 2-core pentiumD system. It seems like dom0 kernel gets initialized with all cpus on boot but the data structures do not get correctly reinitialized when the number of cpus reduces after xend started ? While this is basically a cosmetic issue, it makes the host monitoring a hell. Further, the clock in domUs drifts now, while it was stable in 2.6.31, I'm submitting a separate bug for that. Reproducible: Always Steps to Reproduce: 1.configure Xen dom0 with some domU 2.set dom0-cpus to >0 but lower value that ncpu 3.start xend 4.look at top Actual Results: top - 17:59:26 up 22:52, 3 users, load average: 5.00, 5.00, 5.00 Tasks: 172 total, 1 running, 169 sleeping, 0 stopped, 2 zombie Cpu0 : 2.3%us, 1.2%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st Cpu1 : 2.6%us, 1.2%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1026560k total, 597420k used, 429140k free, 12k buffers Swap: 8385924k total, 14676k used, 8371248k free, 267036k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5913 root 20 0 8588 1156 576 S 1 0.1 3:22.53 xenstored 5922 root 20 0 294m 34m 2116 S 1 3.4 4:17.65 xend 5816 root 20 0 0 0 0 S 0 0.0 2:13.16 netback/0 25286 root 20 0 8668 1148 788 R 0 0.1 0:00.19 top 28712 root 35 15 77504 5428 1856 S 0 0.5 2:15.57 snmpd 1 root 20 0 12408 636 596 S 0 0.1 0:00.93 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.30 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.96 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 Expected Results: .. would expect load average of 0.00 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c1 --- Comment #1 from Samuel Kvasnica <samuel.kvasnica@ims.co.at> 2011-01-03 17:24:18 UTC --- forgot to write: the kernel is exactly: 2.6.34.7-42-xen xen is: xen-4.0.1_02-94.2 (but that does not seem to play a role) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c Charles Arnold <carnold@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|jdouglas@novell.com |jbeulich@novell.com QAContact|qa@suse.de |jdouglas@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c2 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Found By|--- |Community User InfoProvider| |samuel.kvasnica@ims.co.at --- Comment #2 from Jan Beulich <jbeulich@novell.com> 2011-01-10 16:35:06 UTC --- So you can observe this only when you reduce the number of Dom0's vCPU-s post-boot (i.e. is "dom0_max_vcpus=" allowing you to work around this issue)? Does this depend on the offlining being done during system (xend) initialization, or can this also be observed if you do remove vCPU-s manually once the system is up? Since there is very little Xen-specific code involved here - did you check whether on a native kernel, soft-offlining CPUs would have similar effects? Does the reported number correlate in any way with the number of CPUs originally assigned to Dom0 and or the number of those that got removed? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c3 --- Comment #3 from Samuel Kvasnica <samuel.kvasnica@ims.co.at> 2011-01-10 21:36:49 UTC --- Uhm, this is going to be complicated. I don't see any straight mathematical relationship of this offset to the number of cpus, at least there must be other parameter involved. It began with 2.6.34 kernel, well actually already 2.6.33 if you look into my old good Bug #584554. I get following offsets: PentiumD 2-core: dom0-cpus=1: 1.0 Xeon X3450 4-cores + HT: dom0-cpus=2: 5.0 Xeon E5530 4-cores + HT (1 socket only): dom0-cpus=2, 2.0 or 3.0 ..so I wouldn't dare to fit a 5-th degree polynomial over 3 points... I just rebooted the pentiumD system several times. Results so far: - native kernel, playing with /sys/devices/system/cpu/cpu1/online => correct load average - dom0_max_vcpus=1 => correct load average - I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see: top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40 Tasks: 172 total, 1 running, 171 sleeping, 0 stopped, 0 zombie Cpu0 : 0.1%us, 0.4%sy, 0.2%ni, 99.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st xm vcpu-list 0 Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 346.7 any cpu Domain-0 0 1 - --p 20.7 any cpu ..why are there actually 2 CPUs shown ? and if I invoke 'xm vcpu-set 0 2' now, load average gets correct. Switching vcpu-set to 1/2 several times => still everything correct. - however (!), I've observed cases (about 30%) when this did not happen at all during manual xend start, I have the pentiumD system in such state at the moment and will wait if it climbs up over night. I would say it is very likely to happen if xend is started automatically during the boot process => race condition ? Other specialities: there is openais and drbd running on all these systems. And there are no DomUs running while testing this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c4 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P4 - Low --- Comment #4 from Jan Beulich <jbeulich@novell.com> 2011-01-11 12:57:27 UTC --- (In reply to comment #3)
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong?
xm vcpu-list 0 Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 346.7 any cpu Domain-0 0 1 - --p 20.7 any cpu
...why are there actually 2 CPUs shown ?
Because vCPU-s can't be removed altogether, they can only be marked unused (or really, paused). If the resources tied to this worry you, dom0_max_vcpus= is your friend.
- however (!), I've observed cases (about 30%) when this did not happen at all during manual xend start, I have the pentiumD system in such state at the moment and will wait if it climbs up over night. I would say it is very likely to happen if xend is started automatically during the boot process => race condition ?
Does it perhaps matter whether the CPU being removed happens to be under load? If so, the question then would (again) be whether the same applies to native. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c5 --- Comment #5 from Samuel Kvasnica <samuel.kvasnica@ims.co.at> 2011-01-11 13:32:00 UTC --- (In reply to comment #4)
(In reply to comment #3)
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong? wrong, it is the +1.0 offset case
...why are there actually 2 CPUs shown ?
Because vCPU-s can't be removed altogether, they can only be marked unused (or really, paused). If the resources tied to this worry you, dom0_max_vcpus= is your friend. ok
- however (!), I've observed cases (about 30%) when this did not happen at all during manual xend start, I have the pentiumD system in such state at the moment and will wait if it climbs up over night. I would say it is very
just looked at that box again, it did not climb up, seems to be happening only on xend start
to happen if xend is started automatically during the boot process => race condition ?
Does it perhaps matter whether the CPU being removed happens to be under load? If so, the question then would (again) be whether the same applies to native.
I will test this, cpu load is very probably involved. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c6 --- Comment #6 from Jan Beulich <jbeulich@novell.com> 2011-01-11 16:01:01 UTC --- (In reply to comment #5)
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong? wrong, it is the +1.0 offset case
But your original description said that the numbers would all be equal - here, they aren't even close.
just looked at that box again, it did not climb up, seems to be happening only on xend start ... I will test this, cpu load is very probably involved.
I wasn't able to get to see anything unusual doing "xm vcpu-set 0 ..." with or without load (other than the per-CPU load percentages being sticky). It would be really helpful to have a means to reproduce this outside of xend starting, as that would be rather cumbersome to debug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c7 --- Comment #7 from Samuel Kvasnica <samuel.kvasnica@ims.co.at> 2011-01-11 16:59:56 UTC ---
Which you consider right or wrong? wrong, it is the +1.0 offset case
But your original description said that the numbers would all be equal - here, they aren't even close.
ehm, I did 'rcxend start', waited a minute and took an xterm snapshot. Since I did not to wait so long, the 5-minute and 15-minute average did not creep up yet, but they would, of course.
I wasn't able to get to see anything unusual doing "xm vcpu-set 0 ..." with or without load (other than the per-CPU load percentages being sticky).
It would be really helpful to have a means to reproduce this outside of xend starting, as that would be rather cumbersome to debug.
More results comming very soon... But does it mean, you did not see this on your systems so far ? I'm a bit afraid of some other interdependency like e.g. drbd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c8 --- Comment #8 from Jan Beulich <jbeulich@novell.com> 2011-01-11 17:11:19 UTC --- (In reply to comment #7)
But does it mean, you did not see this on your systems so far ? I'm a bit afraid of some other interdependency like e.g. drbd.
I indeed didn't see it yet, but also didn't try the xend way (as I'm convinced this can't be the only one), and I'm doing this with .37 rather than .34 (knowing that the Xen specific time handling code didn't really change between the two). Figuring out the conditions is the most important thing, and until we can reproduce it we entirely depend on your input. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c9 --- Comment #9 from Jan Beulich <jbeulich@novell.com> 2011-07-06 08:15:10 UTC --- Ping? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c10 Stefan Behlert <behlert@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED InfoProvider|samuel.kvasnica@ims.co.at | Resolution| |NORESPONSE --- Comment #10 from Stefan Behlert <behlert@suse.com> 2012-06-19 12:03:08 UTC --- No answer for several months. Re-open when you have the requested information, thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com