[Bug 954765] New: systemd in TumbleWeed adding processes to multiple cgroup controllers causes performance regressions
http://bugzilla.suse.com/show_bug.cgi?id=954765 Bug ID: 954765 Summary: systemd in TumbleWeed adding processes to multiple cgroup controllers causes performance regressions Classification: openSUSE Product: openSUSE Tumbleweed Version: 2015* Hardware: All OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: mel.gorman@microfocus.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- The version of systemd in TumbleWeed matches what the behaviour of current upstream by adding user processes to multiple cgroup controllers. This incurs a severe performance impact on a variety of workloads. Early testing on a 3.12-based kernel and a similar version of systemd showed that adding processes to the blkio controller causes problems between a journal-related kernel thread in one cgroup and the IO submitter in another. To give the fairness guarantees required by the controller the journal thread and IO submitted stall waiting on further IO requests that never happen. The cpuacct controller simply adds a lot of overhead to give the guarantees required by that controller. This incurs a large penalty on scheduler-intensive workloads. The memory controller has different semantics for direct reclaim whereby old data could be artificially preserved even when it's unused. There is also additional overhead during reclaim as multiple cgroups get scanned. The controllers are meant to give guarantees about access to resources and there is a performance penalty incurred to give those guarantees. To illustrate the point, Frank Bui ran a simple benchmark on openSUSE 13.2 userspace with just these packages updated to Tumbleweed - systemd-224-1.1.x86_64 - kernel-default-4.2.4-1.2.x86_64 The use of the older userspace avoids us having to worry about changes in other packages. In this version of systemd, Delegate=yes is set in /usr/lib/systemd/system/user@.service . This causes a simple shell to be in multiple controllers as can be illustrated by this; # for FILE in `find /sys/fs/cgroup -name tasks | grep user-0.slice`; do grep -H $$ $FILE; done /sys/fs/cgroup/devices/user.slice/user-0.slice/tasks:1820 /sys/fs/cgroup/memory/user.slice/user-0.slice/tasks:1820 /sys/fs/cgroup/cpu,cpuacct/user.slice/user-0.slice/session-3.scope/tasks:1820 /sys/fs/cgroup/systemd/user.slice/user-0.slice/session-3.scope/tasks:1820 The difference between Delegate=no and Delegate=yes is as follows pipetest test test run-1-delegate-no run-delegate-yes Min Time 4.65 ( 0.00%) 7.74 (-66.45%) 1st-qrtle Time 5.52 ( 0.00%) 9.89 (-79.17%) 2nd-qrtle Time 5.59 ( 0.00%) 10.44 (-86.76%) 3rd-qrtle Time 5.64 ( 0.00%) 10.66 (-89.01%) Max-90% Time 5.67 ( 0.00%) 10.76 (-89.77%) Max-93% Time 5.68 ( 0.00%) 10.77 (-89.61%) Max-95% Time 5.69 ( 0.00%) 10.78 (-89.46%) Max-99% Time 5.78 ( 0.00%) 10.81 (-87.02%) Max Time 5.88 ( 0.00%) 10.88 (-85.03%) Mean Time 5.55 ( 0.00%) 10.19 (-83.40%) Best99%Mean Time 5.55 ( 0.00%) 10.18 (-83.39%) Best95%Mean Time 5.54 ( 0.00%) 10.16 (-83.23%) Best90%Mean Time 5.54 ( 0.00%) 10.12 (-82.88%) Best50%Mean Time 5.46 ( 0.00%) 9.72 (-77.94%) Best10%Mean Time 5.19 ( 0.00%) 8.70 (-67.46%) Best5%Mean Time 5.02 ( 0.00%) 8.23 (-64.00%) Best1%Mean Time 4.66 ( 0.00%) 7.83 (-68.13%) That is basically saying that the systemd default incurs a massive penalty (83.4% on average) with the current Tumbleweed kernel. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
Mel Gorman
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c1
--- Comment #1 from Dr. Werner Fink
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c2
--- Comment #2 from Mel Gorman
Maybe it's worth to watch the talk about control groups at systemd.conf 2015 on
So I watched it but I still don't see why the activation of the resource controllers is necessary via Delegate=yes. Note that I see no problem with grouping related processes together, it's the controller activation I'm confused by. The talk opens by saying that cgroups are a means of hierarchically labelling processes and uses this to manage service lifetime. cgroups are primarily about resource control and guaranteeing of fairness which is what some of the controllers do. As a side-effect, this can map PIDs to services and systemd uses this for service managmement and notification about service shutdown. While I can see why the grouping of processes is desirable to cleanly startup and shutdown services, I cannot see why the controllers get activated even when full resource control is not required. It's known the resource control enforcement incurs an 83% penalty on a scheduler microbenchmark due to the cpuacct controller. I observed myself a case where dbench4 regressed 80% due to the blkio controller as the journalling kernel thread and IO submitter were in separate cgroups. Even if the unified hierarchy was fully in place (and I understand why the separate cgroup hierarchies is difficult), it would not remove the overhead. For example, any process in the user slice with the cpuacct controller activated is going to force all processes to update what is essentially global data (the cumulative cpu usage for all processes in the user slice). Multiple processes updating the same data incurs a high penalty due to cache misses. Even if the overhead was zero (and I don't know anyone who is working specifically on eliminating the overhead), there is a semantic difference when controllers are enabled. For example, the memory controller creates per-memcg LRU lists. In the event of global memory pressure, those lists are reclaimed proportionally to each other. Their existence alters the order memory is reclaimed in and now a relatively new process can get reclaimed prematurely because it's being aged relative to other older cgroups that are idle. Now I can see why such enabling resource control would be a good idea in some cases. For example, virtual machines or containers could justify being resource controlled to avoid interfering with each other but that's a special case. It seems like a very bad idea to incur the same overhead and semantic differences for a single user on a single machine running basic workloads that do not require strict resource control. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c3
--- Comment #3 from Mike Galbraith
Even if the overhead was zero (and I don't know anyone who is working specifically on eliminating the overhead), there is a semantic difference when controllers are enabled.
Case in point for the CPU controller: nice level, SCHED_BATCH and SCHED_IDLE classes lose their global scope when group scheduling is enabled. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
Mel Gorman
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c4
--- Comment #4 from Mel Gorman
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c5
--- Comment #5 from Franck Bui
It is possible this was an oversight. I do not have the full picture of what was planned but it appears that the intent was to activate this only for new containers. It's just the case that in Tumbleweed that normal sessions (or at least ssh sessions from root) are also impacted. If only processes within a container or a VM were resource controlled then it would limit the impact of this bug.
I think the intent was to really activate this for 'root' user sessions too. Regular user sessions are left alone for now unless unified hierarchy is supported and used by the kernel. See commit which introduced the Delegate property for some details: https://github.com/systemd/systemd/commit/a931ad47a8623163a29d898224d8a8c117... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c6
--- Comment #6 from Mel Gorman
(In reply to Mel Gorman from comment #4)
It is possible this was an oversight. I do not have the full picture of what was planned but it appears that the intent was to activate this only for new containers. It's just the case that in Tumbleweed that normal sessions (or at least ssh sessions from root) are also impacted. If only processes within a container or a VM were resource controlled then it would limit the impact of this bug.
I think the intent was to really activate this for 'root' user sessions too. Regular user sessions are left alone for now unless unified hierarchy is supported and used by the kernel.
This means that any process launched by root be it manual launch of a server or a benchmark is going to be throttled. It cannot be what they really intended or if they did, it's a massive overhead to occur for very little, if any, gain.
See commit which introduced the Delegate property for some details: https://github.com/systemd/systemd/commit/ a931ad47a8623163a29d898224d8a8c1177ffdaf
The commit to me seems to say that it was only intended to be activated for containers :( -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c7
--- Comment #7 from Franck Bui
(In reply to Franck Bui from comment #5)
(In reply to Mel Gorman from comment #4)
It is possible this was an oversight. I do not have the full picture of what was planned but it appears that the intent was to activate this only for new containers. It's just the case that in Tumbleweed that normal sessions (or at least ssh sessions from root) are also impacted. If only processes within a container or a VM were resource controlled then it would limit the impact of this bug.
I think the intent was to really activate this for 'root' user sessions too. Regular user sessions are left alone for now unless unified hierarchy is supported and used by the kernel.
This means that any process launched by root be it manual launch of a server or a benchmark is going to be throttled. It cannot be what they really intended or if they did, it's a massive overhead to occur for very little, if any, gain.
See commit which introduced the Delegate property for some details: https://github.com/systemd/systemd/commit/ a931ad47a8623163a29d898224d8a8c1177ffdaf
The commit to me seems to say that it was only intended to be activated for containers :(
From the commit message:
"Delegate=yes should also be set for user@.service, so that systemd --user can run, controlling its own cgroup tree." user@.service also handles root user session (in that case it's named 'user@0.service) and this unit is a priviliged one when executed for root user. And still from the commit message: "For priviliged units this resource control property ensures that the processes have all controllers systemd manages enabled." -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c8
--- Comment #8 from Mel Gorman
The commit to me seems to say that it was only intended to be activated for containers :(
From the commit message:
"Delegate=yes should also be set for user@.service, so that systemd --user can run, controlling its own cgroup tree."
user@.service also handles root user session (in that case it's named 'user@0.service) and this unit is a priviliged one when executed for root user.
And still from the commit message:
"For priviliged units this resource control property ensures that the processes have all controllers systemd manages enabled."
Given the level of performance damage this decision incurs, can we not disable it until such time as either the controllers are all fixed to reduce the overhead? It seems to gain us very little except in specific corner cases but incurs a penalty for basic scenarios. As it is, LEAP is going to be damaged long term in comparison to other distributions. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c9
--- Comment #9 from Franck Bui
Given the level of performance damage this decision incurs, can we not disable it until such time as either the controllers are all fixed to reduce the overhead? It seems to gain us very little except in specific corner cases but incurs a penalty for basic scenarios. As it is, LEAP is going to be damaged long term in comparison to other distributions.
Hi Mel, Leap is following SLE12 so it should be disabled unless it hasn't got the latest updates yet. I don't know how Leap is updated, but I'll make sure it gets the workaround. Regarding the rest of the distros (TW, Factory, ...) based on a newer version of systemd, we're still waiting for upstream to see what will be the decision taken for this matter. AFAIK we havent got any feedback to the bug you opened upstream unfortunately. Thanks, -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=954765
http://bugzilla.suse.com/show_bug.cgi?id=954765#c12
Franck Bui
participants (1)
-
bugzilla_noreply@novell.com