[Bug 589788] New: Xen hypervisor memory utilization does not reflect sum of domain allocations (leaks over time)
http://bugzilla.novell.com/show_bug.cgi?id=589788 http://bugzilla.novell.com/show_bug.cgi?id=589788#c0 Summary: Xen hypervisor memory utilization does not reflect sum of domain allocations (leaks over time) Classification: openSUSE Product: openSUSE 11.2 Version: Final Platform: x86-64 OS/Version: openSUSE 11.2 Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: ffejes@searshc.com QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100218 Fedora/3.6.1-2.fc14 Firefox/3.6 Over the past couple months I've noticed that a few of our Xen servers are reporting (via xentop) hypervisor memory utilization discrepancies between the sum of VM memory allocations and the actual memory used. In time, the free memory drops to a point where no new VMs can be started. Below is recent sample xentop output with extra columns chopped. The hypervisor reports that it is using 32gb of ram, however the sum of the running VMs is 16gb. Ballooning is disabled and dom0_mem is set to 4gb. xentop - 15:38:20 Xen 3.4.1_19718_04-2.1 12 domains: 2 running, 10 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 33549248k total, 33241552k used, 307696k free CPUs: 4 @ 2992MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) xxxxxxxxxx --b--- 1584 0.2 547840 1.6 547840 1.6 Domain-0 -----r 165894 34.3 4192768 12.5 no limit n/a xxxxxxxxxx --b--- 15166 22.3 3071916 9.2 3076096 9.2 xxxxxxxxxx -----r 1328 77.8 1028012 3.1 1028096 3.1 xxxxxxxxxx --b--- 55433 4.7 1027880 3.1 1028096 3.1 xxxxxxxxxx --b--- 67527 4.8 1027880 3.1 1028096 3.1 xxxxxxxxxx --b--- 73019 6.8 1027880 3.1 1028096 3.1 xxxxxxxxxx --b--- 61266 6.8 1027880 3.1 1028096 3.1 xxxxxxxxxx --b--- 78880 6.2 1028012 3.1 1028096 3.1 xxxxxxxxxx --b--- 969 7.7 516012 1.5 516096 1.5 xxxxxxxxxx --b--- 10544 1.5 1027880 3.1 1028096 3.1 xxxxxxxxxx --b--- 21001 5.8 1027880 3.1 1028096 3.1 This behavior is only seen on our "development" Xen servers where we routinely create/clone/destroy a large number of (mostly HVM) VMs, however when doing any of those functions I cannot reproduce the "leak". That is to say, when I create a VM the hypervisor memory free statistic drops in an amount corresponding to the size of the VM. When the VM is destroyed, the free statistic increases back to where it was previously. I have not been able to find any way to get the hypervisor to reclaim some of the lost memory and, as a result, I have been forced to reboot the server. On systems with 20+ VMs where we don't have the option for live migration this can be rather traumatic. Thank you! Reproducible: Couldn't Reproduce Steps to Reproduce: 1. 2. 3. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c
Jason Douglas
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c1
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c2
--- Comment #2 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c3
--- Comment #3 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c4
--- Comment #4 from Jan Beulich
The 'xm debug-key' command exits with a 0 status and no text output for either q or H.
The output goes to the Xen log (i.e. you have to run "xm dmesg" afterwards).
Regarding logs, it's tricky to annotate them with bad behavior information since I have not yet narrowed down any particular time or activity that causes the memory discrepancies to occur. For example, on the server for the logs I'll attach now, it has held pretty solid at around 61gb used for a couple weeks now, despite the fact that we've only had around 25-30gb worth of VM allocations running at any one time. Is there any sort of periodic command I could run that would generate information that would be useful in tracking this down?
Precisely the actions described above (i.e. the two debug keys plus "xm info"). (In reply to comment #3)
Created an attachment (id=349791) --> (http://bugzilla.novell.com/attachment.cgi?id=349791) [details] Xen Info output for a server exhibiting the behavior
This isn't useful alone - it ought to match up with log and debug key output. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c5
--- Comment #5 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c6
--- Comment #6 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c7
--- Comment #7 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c8
--- Comment #8 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c9
--- Comment #9 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c10
--- Comment #10 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c11
--- Comment #11 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c12
--- Comment #12 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c13
--- Comment #13 from Frank Fejes
What block device protocol are you using for your guests? In particular, do you observe a difference in behavior between using tap:... and file: ones?
The vast majority of these are tap:aio, though we have a few file: guests. Right now we've begun creating all our guests with phy: devices in another environment (CentOS/Xen 3.4.2) and it would appear that we have not triggered this problem. Is there a thought that perhaps this would be a workaround? Thanks again. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c14
--- Comment #14 from Jan Beulich
The vast majority of these are tap:aio, though we have a few file: guests. Right now we've begun creating all our guests with phy: devices in another environment (CentOS/Xen 3.4.2) and it would appear that we have not triggered this problem. Is there a thought that perhaps this would be a workaround? Thanks again.
Depends on what part of your reply you mean with "this". Using file: instead of any of the tap: variants likely is a workaround, which is why we asked the question in #12 (i.e. we had hoped you could confirm this theory). As per further analysis done on SLE11 SP1, using tap:tapdisk:aio: might also be a workaround (but requires one or more tools side fixes to deal with the tapdisk: part of the protocol specification). -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c15
--- Comment #15 from Frank Fejes
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c18
Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c19
--- Comment #19 from Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c20
--- Comment #20 from Vadim Ponomarev
Would you be able to rebuild the hypervisor if we provided you with a debugging patch?
patch is welcome -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c21
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c22
--- Comment #22 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c23
--- Comment #23 from Vadim Ponomarev
We actually appear to have a (kernel side) fix for this meanwhile, so if you would test this instead - I'll attach the patch in a second.
patch helped no more difference in free_memory before domu start and after domu shutdown for tap:qcow2 and tap:aio -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c
Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c
Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c24
Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c
Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c25
--- Comment #25 from Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c26
--- Comment #26 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c27
--- Comment #27 from Vadim Ponomarev
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c28
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c29
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=589788
http://bugzilla.novell.com/show_bug.cgi?id=589788#c30
--- Comment #30 from Henry Laurent
https://bugzilla.novell.com/show_bug.cgi?id=589788
https://bugzilla.novell.com/show_bug.cgi?id=589788#c31
Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com