[Bug 662083] New: process accounting kind of broken on kernel-xen 2.6.34.x dom0
https://bugzilla.novell.com/show_bug.cgi?id=662083 https://bugzilla.novell.com/show_bug.cgi?id=662083#c0 Summary: process accounting kind of broken on kernel-xen 2.6.34.x dom0 Classification: openSUSE Product: openSUSE 11.3 Version: Final Platform: x86-64 OS/Version: openSUSE 11.3 Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: samuel.kvasnica@ims.co.at QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 After having quite hard times with the soft-lockup nightmare on 2.6.31.x kernels with xen I'm happy to say that the 2.6.34 seem to be soft-lockup free by now (Yes, it got better on nehalem by the patch, but something is still broken there, just not so frequent...). But there are other 'cosmetic' glitches related to clock which were not present in 2.6.31: In particular the load average shown in dom0 is broken, it gets a stable positive integer offset (typically something like 3.0 or 5.0 on my systems). I could reproduce this on 2 different 4-core/HT nehalem systems as well as an old 2-core pentiumD system. It seems like dom0 kernel gets initialized with all cpus on boot but the data structures do not get correctly reinitialized when the number of cpus reduces after xend started ? While this is basically a cosmetic issue, it makes the host monitoring a hell. Further, the clock in domUs drifts now, while it was stable in 2.6.31, I'm submitting a separate bug for that. Reproducible: Always Steps to Reproduce: 1.configure Xen dom0 with some domU 2.set dom0-cpus to >0 but lower value that ncpu 3.start xend 4.look at top Actual Results: top - 17:59:26 up 22:52, 3 users, load average: 5.00, 5.00, 5.00 Tasks: 172 total, 1 running, 169 sleeping, 0 stopped, 2 zombie Cpu0 : 2.3%us, 1.2%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st Cpu1 : 2.6%us, 1.2%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1026560k total, 597420k used, 429140k free, 12k buffers Swap: 8385924k total, 14676k used, 8371248k free, 267036k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5913 root 20 0 8588 1156 576 S 1 0.1 3:22.53 xenstored 5922 root 20 0 294m 34m 2116 S 1 3.4 4:17.65 xend 5816 root 20 0 0 0 0 S 0 0.0 2:13.16 netback/0 25286 root 20 0 8668 1148 788 R 0 0.1 0:00.19 top 28712 root 35 15 77504 5428 1856 S 0 0.5 2:15.57 snmpd 1 root 20 0 12408 636 596 S 0 0.1 0:00.93 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.30 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.96 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 Expected Results: .. would expect load average of 0.00 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c1
--- Comment #1 from Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c
Charles Arnold
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c2
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c3
--- Comment #3 from Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c4
Jan Beulich
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong?
xm vcpu-list 0 Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 346.7 any cpu Domain-0 0 1 - --p 20.7 any cpu
...why are there actually 2 CPUs shown ?
Because vCPU-s can't be removed altogether, they can only be marked unused (or really, paused). If the resources tied to this worry you, dom0_max_vcpus= is your friend.
- however (!), I've observed cases (about 30%) when this did not happen at all during manual xend start, I have the pentiumD system in such state at the moment and will wait if it climbs up over night. I would say it is very likely to happen if xend is started automatically during the boot process => race condition ?
Does it perhaps matter whether the CPU being removed happens to be under load? If so, the question then would (again) be whether the same applies to native. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c5
--- Comment #5 from Samuel Kvasnica
(In reply to comment #3)
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong? wrong, it is the +1.0 offset case
...why are there actually 2 CPUs shown ?
Because vCPU-s can't be removed altogether, they can only be marked unused (or really, paused). If the resources tied to this worry you, dom0_max_vcpus= is your friend. ok
- however (!), I've observed cases (about 30%) when this did not happen at all during manual xend start, I have the pentiumD system in such state at the moment and will wait if it climbs up over night. I would say it is very
just looked at that box again, it did not climb up, seems to be happening only on xend start
to happen if xend is started automatically during the boot process => race condition ?
Does it perhaps matter whether the CPU being removed happens to be under load? If so, the question then would (again) be whether the same applies to native.
I will test this, cpu load is very probably involved. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c6
--- Comment #6 from Jan Beulich
- I've observed this happening immediately after starting 'rcxend start' by hand, in that state I see:
top - 22:15:37 up 42 min, 3 users, load average: 0.97, 0.67, 0.40
Which you consider right or wrong? wrong, it is the +1.0 offset case
But your original description said that the numbers would all be equal - here, they aren't even close.
just looked at that box again, it did not climb up, seems to be happening only on xend start ... I will test this, cpu load is very probably involved.
I wasn't able to get to see anything unusual doing "xm vcpu-set 0 ..." with or without load (other than the per-CPU load percentages being sticky). It would be really helpful to have a means to reproduce this outside of xend starting, as that would be rather cumbersome to debug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c7
--- Comment #7 from Samuel Kvasnica
Which you consider right or wrong? wrong, it is the +1.0 offset case
But your original description said that the numbers would all be equal - here, they aren't even close.
ehm, I did 'rcxend start', waited a minute and took an xterm snapshot. Since I did not to wait so long, the 5-minute and 15-minute average did not creep up yet, but they would, of course.
I wasn't able to get to see anything unusual doing "xm vcpu-set 0 ..." with or without load (other than the per-CPU load percentages being sticky).
It would be really helpful to have a means to reproduce this outside of xend starting, as that would be rather cumbersome to debug.
More results comming very soon... But does it mean, you did not see this on your systems so far ? I'm a bit afraid of some other interdependency like e.g. drbd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c8
--- Comment #8 from Jan Beulich
But does it mean, you did not see this on your systems so far ? I'm a bit afraid of some other interdependency like e.g. drbd.
I indeed didn't see it yet, but also didn't try the xend way (as I'm convinced this can't be the only one), and I'm doing this with .37 rather than .34 (knowing that the Xen specific time handling code didn't really change between the two). Figuring out the conditions is the most important thing, and until we can reproduce it we entirely depend on your input. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c9
--- Comment #9 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=662083
https://bugzilla.novell.com/show_bug.cgi?id=662083#c10
Stefan Behlert
participants (1)
-
bugzilla_noreply@novell.com