[opensuse-kernel] Why suse still choose SLAB as the default?
Hi, I would like to know why SUSE through SLES and openSUSE still uses CONFIG_SLAB as the allocator instead of CONFIG_SLUB ?. I've seen worst case scenarios locking up buffer_head when there is fragmentation and the number of slabs are in millions. Any particular case where it performs better than SLUB . I'm not investigating on low end systems where SLOB is another option. Regards, Navin -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/27/15 10:04 PM, Navin Parakkal wrote:
Hi, I would like to know why SUSE through SLES and openSUSE still uses CONFIG_SLAB as the allocator instead of CONFIG_SLUB ?.
I've seen worst case scenarios locking up buffer_head when there is fragmentation and the number of slabs are in millions.
Any particular case where it performs better than SLUB . I'm not investigating on low end systems where SLOB is another option.
Mel can probably comment further, but there are two answers. 1) Inertia 2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) iQIcBAEBAgAGBQJV39qKAAoJEB57S2MheeWyXi0QAKBMWIWhkag9VaT0CFyTuf27 R9hSk11t7u4AAwKprOOgixJjnKwhdsKmJgNXq/KHCH8DbJ5j4RnQfKOASLTi0Ktv 4VHjHoiqfKo7DZhs950dTOa8+cFT24S9zuvOaT/iXrOIXOs80s4zlmOFLsKhQQQ3 75CcE7hZdo8mJJ03zFX2OdojURjgyK3ruZ8EwZo7PEbnvlPOgj+3jF3beu8WdH0r SSLoYNUB2i3C5tEXjtl6BlhYN4sMV+lJVPNjk7/HEhY+SkSe9MxXr2fPd7LXQKds xMXza3CDHuwfn+pKvdFsOfDjnHspi56DXBwTxGrAA1rK39XKPPCtvNzyzoaEAWws /XG7Nf9zhQwfRn/hYmWpA5kPkLI/apJTqEpXeN5gvZZPSeRN7KJEQYSudsbTjpp4 TNnj5W5MRAB8Crj56jxub6rF5dMLJlJYJtzqoO2m/m/TYLQsBN4aGf84P0ZLBKcP z9j0fnOzRCDhNaSMZn/QNXZDdiMAJVVB33gC9MtHnqJzHi/CqRtN1UeMfONj5IRV oS4sMgtSR5cvQIs9LtY32QmF6ADjod4nbPnXV7U+WB+lZk+j3RCXKKAmU/nVZH+j 4IwZeGx2Z5uFiyWwRzGYRCEDKv8CHlmD1xyIA8oIgPkNdpRb3UIYA9MF1wO+fUe+ HQMOeCcjoMmX58manBaA =MXwQ -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 8/27/15 10:04 PM, Navin Parakkal wrote:
Hi, I would like to know why SUSE through SLES and openSUSE still uses CONFIG_SLAB as the allocator instead of CONFIG_SLUB ?.
I've seen worst case scenarios locking up buffer_head when there is fragmentation and the number of slabs are in millions.
Any particular case where it performs better than SLUB . I'm not investigating on low end systems where SLOB is another option.
Mel can probably comment further, but there are two answers.
1) Inertia
i'll skip this one.
2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
https://oss.oracle.com/projects/codefragments/src/trunk/bufferheads Once you allocate around 200M+ objects using this buffer.ko. The you insmod https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/ and do fragment of 8192 > /proc/test I found that on RHEL 6.6 which use CONFIG_SLAB=Y it had cpu stuck on a 128 GB phys RAM and 32 cpu box. But on Centos 7.1 which uses CONFIG_SLUB=Y , i didn't notice the problem of cpu getting stuck or that message in dmesg. i did insmod buffer.ko then i did 987654321 > /proc/test , it allocated around 400 million + objects in buffer_head as per /proc/slabinfo. I might have done something wrong or there could be other things that are missing. It would be of much help and good learning if you can help. Regards, Navin -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia
i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
https://oss.oracle.com/projects/codefragments/src/trunk/bufferheads
Once you allocate around 200M+ objects using this buffer.ko. The you insmod https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/ and do fragment of 8192 > /proc/test
I would rather appreciate a real life scenario showing SLUB performing (significantly) better, not an artificially crafted extreme testcase. Michal Kubeček -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri, Aug 28, 2015 at 11:03 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia
i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
https://oss.oracle.com/projects/codefragments/src/trunk/bufferheads
Once you allocate around 200M+ objects using this buffer.ko. The you insmod https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/ and do fragment of 8192 > /proc/test
I would rather appreciate a real life scenario showing SLUB performing (significantly) better, not an artificially crafted extreme testcase.
Well i have to admit and agree here that after 800 million + buffer head objects , RHEL 7.1 with CONFIG_SLUB=y with same hardware configuration ie 32 cpus and 128GB is also getting into soft lockup. But CONFIG_SLAB is hits this faster at like 200+ million range of objects. Well for real lice scenario , if you have access to suse customer bug reports, you can lookup on the below, there are some details. [Bug 936077] L3: Watchdog detected hard LOCKUP . Michal has provided the details. What worries me is that the comment you shouldn't touch /proc/slabinfo. slabtop certainly does. It would be good if someone can tell me where SLAB performs better than SLUB and vice versa and what is the common between them that could be causing this. Maybe some bit of learning. Regards, Navin -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri 28-08-15 11:16:22, Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 11:03 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia
i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
https://oss.oracle.com/projects/codefragments/src/trunk/bufferheads
Once you allocate around 200M+ objects using this buffer.ko. The you insmod https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/ and do fragment of 8192 > /proc/test
I would rather appreciate a real life scenario showing SLUB performing (significantly) better, not an artificially crafted extreme testcase.
Well i have to admit and agree here that after 800 million + buffer head objects , RHEL 7.1 with CONFIG_SLUB=y with same hardware configuration ie 32 cpus and 128GB is also getting into soft lockup.
But CONFIG_SLAB is hits this faster at like 200+ million range of objects.
Well for real lice scenario , if you have access to suse customer bug reports, you can lookup on the below, there are some details.
[Bug 936077] L3: Watchdog detected hard LOCKUP .
Michal has provided the details. What worries me is that the comment you shouldn't touch /proc/slabinfo. slabtop certainly does.
Yes and slabtop like /proc/slabinfo are both debugging tools which shouldn't be overused. It is true that slabinfo sucks for SLAB but that alone is not a reason to change the allocator especially when there was no clear winner as per Mel who has compared those two recently.
It would be good if someone can tell me where SLAB performs better than SLUB and vice versa and what is the common between them that could be causing this. Maybe some bit of learning.
IIRC none of the allocator outperformed consistently. Some loads were slightly better off with SLUB others with SLAB. Mel will know better and can share some numbers. We haven't changed the default because SLAB has been working well for us for years. I do not remember any significant bug in that area for years. -- Michal Hocko SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri, Aug 28, 2015 at 1:21 PM, Michal Hocko <mhocko@suse.cz> wrote:
On Fri 28-08-15 11:16:22, Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 11:03 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia
i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
https://oss.oracle.com/projects/codefragments/src/trunk/bufferheads
Once you allocate around 200M+ objects using this buffer.ko. The you insmod https://oss.oracle.com/projects/codefragments/src/trunk/fragment-slab/ and do fragment of 8192 > /proc/test
I would rather appreciate a real life scenario showing SLUB performing (significantly) better, not an artificially crafted extreme testcase.
Well i have to admit and agree here that after 800 million + buffer head objects , RHEL 7.1 with CONFIG_SLUB=y with same hardware configuration ie 32 cpus and 128GB is also getting into soft lockup.
But CONFIG_SLAB is hits this faster at like 200+ million range of objects.
Well for real lice scenario , if you have access to suse customer bug reports, you can lookup on the below, there are some details.
[Bug 936077] L3: Watchdog detected hard LOCKUP .
Michal has provided the details. What worries me is that the comment you shouldn't touch /proc/slabinfo. slabtop certainly does.
Yes and slabtop like /proc/slabinfo are both debugging tools which shouldn't be overused. It is true that slabinfo sucks for SLAB but that alone is not a reason to change the allocator especially when there was no clear winner as per Mel who has compared those two recently.
It would be good if someone can tell me where SLAB performs better than SLUB and vice versa and what is the common between them that could be causing this. Maybe some bit of learning.
IIRC none of the allocator outperformed consistently. Some loads were slightly better off with SLUB others with SLAB. Mel will know better and can share some numbers.
We haven't changed the default because SLAB has been working well for us for years. I do not remember any significant bug in that area for years. --
Can we do some caching like discussed in this thread ? https://lkml.org/lkml/2015/8/12/964 That got me thinking that this would atleast alleviate the problem . The proc stuff gets updated only per second or 1/2 a second for example . If you want more detailed and realtime info go to /sys. I don't know how much of this is feasible , but thought would be worth some discussion and learning. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri 28-08-15 14:08:10, Navin Parakkal wrote: [...]
Can we do some caching like discussed in this thread ? https://lkml.org/lkml/2015/8/12/964
Some level of caching might be possible, I would have to think about it some more but slab unlike vmalloc is much more volatile so we would end up updating the cached values quite often in the end and that requires the locks. It would be better to collect this information during allocation but that risks affecting hot paths which is not desirable.
That got me thinking that this would atleast alleviate the problem . The proc stuff gets updated only per second or 1/2 a second for example . If you want more detailed and realtime info go to /sys.
Any runtime throttling can be done from the userspace (just do not read the file _that_ often). This is different from what you refer to above because it is glibc which reads /proc/meminfo so an in-kernel throttling is appropriate there. So far I haven't heard _why_ collecting that information during the normal load is useful? What kind of decision you can make based on the numbers? Sure it might be interesting when _debugging_ an unexpected behavior but I wouldn't call that a production run and a sub-optimality is not nice but it is not a killer either.
I don't know how much of this is feasible , but thought would be worth some discussion and learning.
-- Michal Hocko SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 2015-08-28 T 07:33 +0200 Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney wrote:
1) Inertia
i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
Well said. And that's why probably "cautiousness" is the better word here, ... So long - MgE -- Matthias G. Eckermann - Senior Product Manager SUSE® Linux Enterprise SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk. ==== Suse already changed this. The onus proving the use of the outdated tech is on SuSE as they stopped using the default since 2.6.23.
From stackoverflow:
Slab is the original, based on Bonwick's seminal paper and available since Linux kernel version 2.2. It is a faithful implementation of Bonwick's proposal, augmented by the multiprocessor changes described in Bonwick's follow-up paper[2]. Slub is the next-generation replacement memory allocator, which has been the default in the Linux kernel since 2.6.23. It continues to employ the basic "slab" model, but fixes several deficiencies in Slab's design, particularly around systems with large numbers of processors. Slub is simpler than Slab. What should you use? Slub, unless you are building a kernel for an embedded device with limited in memory. In that case, I would benchmark Slub versus SLOB and see what works best for your workload. There is no reason to use Slab; it will likely be removed from future Linux kernel releases. ============== I.e. if suse's aiming at small embedded devices/or optimizing it for toasters and microwave ovens, SLAB is recommended. But if suse wants to support computers with *more processors* (seems to be the trend for most customers), then SLUB is recommended. So to turn the inertial question around, why GOOD reasons does Suse have for staying with a memory allocator designed for single procesor-embeded devices vs. larger multi-core systems? Of course, it's too bad such conservative thinking didn't go into areas regarding system boot and service management.... *ahem*.... -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Hello Linda, it seems you haven't quite understood what was written. On Fri, 28 Aug 2015 12:01:11 -0700 "Linda A. Walsh" <suse@tlinx.org> wrote:
[...] From stackoverflow:
First, what makes stackoverflow a better source of information than an in-house benchmark?
[...] What should you use? Slub, unless you are building a kernel for an embedded device with limited in memory. In that case, I would benchmark Slub versus SLOB and see what works best for your workload. There is no reason to use Slab; it will likely be removed from future Linux kernel releases.
==============
I.e. if suse's aiming at small embedded devices/or optimizing it for toasters and microwave ovens, SLAB is recommended.
Wrong. Read the above quotation again. If you aim at small embedded devices, you should use SLOB, not SLAB. This was even mentioned in this thread before.
But if suse wants to support computers with *more processors* (seems to be the trend for most customers), then SLUB is recommended.
The wonders of using passive voice. WHO made the recommendation? What did they use apart from argumentum ad novitatem? (See https://en.wikipedia.org/wiki/Appeal_to_novelty)
So to turn the inertial question around, why GOOD reasons does Suse have for staying with a memory allocator designed for single procesor-embeded devices vs. larger multi-core systems?
We, our partners and our customers have tested this memory allocator on systems with thousands of CPUs and a variety of different workloads. No bugs were found, and the performance was not an issue. Why should SUSE suddenly throw away results of all that testing?
Of course, it's too bad such conservative thinking didn't go into areas regarding system boot and service management.... *ahem*....
While I would probably agree here, it's not really relevant to the discussion at hand. Even if I eventually agree that switching to systemd/dracut (whatever you had in mind) was a mistake, does it mean that SUSE must from now on repeat the same mistake in other areas (choosing the kernel allocator)? Cheers, Petr Tesarik -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/30/15 6:55 AM, Petr Tesarik wrote:
Hello Linda,
it seems you haven't quite understood what was written.
On Fri, 28 Aug 2015 12:01:11 -0700 "Linda A. Walsh" <suse@tlinx.org> wrote:
[...] From stackoverflow:
First, what makes stackoverflow a better source of information than an in-house benchmark?
Indeed. Not only do the engineers responsible for maintaining SLAB in our kernels have decades of combined Linux kernel memory management experience, they have roles within the larger global kernel community that put them in the position to have a serious part of any conversation that ends with "it will be likely removed from future kernel releases." - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) iQIcBAEBAgAGBQJV46Z+AAoJEB57S2MheeWyEGEP/2YqxyHd43EzTUO/65AbGQi/ +9yt09U9bVYw+pNm/ZskqaErnLtgcV8ssGYhBsW4RLJ3eQpLef4oWZEYsxTsdAv3 QmPcHUgiKYo9wyq37Jpw/dUpDglYqEYX+mVCwQF0vtd1dwPr/kl6DMT+sB5wUDjq JykeM/lSzpOaryLrrqRBgKoZbf9tOKMxW+c9QAWR25GmvDSxCvm8W0BYPo+pPY1X KJPbgdOBVxG/N10A1jD4Hm9qnoep5iYONnKD7ajq1WAOOjS28HsgSPuMs/l+4JVc sDcW6VRPm7s04Eh+wTMsDrqpvRSPxtNU9MWFqAdB6haqdhvDq6rXnIJzvtz43VK7 focjMOev5GVKB61GXgUSu2KK4XTI+l0aouL9O4f8WDy7gxkXzywR5Eiz8nU156Uz UvO8CzuBiIeImrZyD3viPZbImmHzSc7CUOzjM33KxnkaEbclpg8Axn2FgY0R+eK7 kbbwLqqoxt4BHlADGLgZt4gJEtuM6FpvtLqwyCq4z3KDyiQQucWbwGZHJMHnmASE KA9ZanBWKKCRZY9nyj7g3Md/fMQ9Sqv2LEP0nahiuldZsF9GW6LS68UXpWeLYfp0 cT5eW9UUGKvURGURWuVYIw+8I8Btt0IBj8fhtMLphZdPkVx9EQ1NkP/lsjZ4fhLh +Ccy1CQo/mwNk2Z7gVho =Eh5W -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Fri, Aug 28, 2015 at 12:01:11PM -0700, Linda A. Walsh wrote:
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
==== Suse already changed this. The onus proving the use of the outdated tech is on SuSE as they stopped using the default since 2.6.23.
That assumes that the upstream default was changed for a compelling reason -- it wasn't. The commit that enabled it by default simply read "There are some reports that 2.6.22 has SLUB as the default. Not true! This will make SLUB the default for 2.6.23." It was asserted at the time, by the author of SLUB, that slab had inherent unfixable flaws and that it needed to go away but there never was a universal consensus on this. Around the same time, SLUB was found by multiple independent tests to be slower than SLAB. There were attempts to address some limitations in the SLUB implementation (https://lwn.net/Articles/311502/) but they were never finished as the author moved on.
From stackoverflow:
Slab is the original, based on Bonwick's seminal paper and available since Linux kernel version 2.2. It is a faithful implementation of Bonwick's proposal, augmented by the multiprocessor changes described in Bonwick's follow-up paper[2].
Slub is the next-generation replacement memory allocator, which has been the default in the Linux kernel since 2.6.23. It continues to employ the basic "slab" model, but fixes several deficiencies in Slab's design, particularly around systems with large numbers of processors. Slub is simpler than Slab.
What should you use? Slub, unless you are building a kernel for an embedded device with limited in memory. In that case, I would benchmark Slub versus SLOB and see what works best for your workload. There is no reason to use Slab; it will likely be removed from future Linux kernel releases.
==============
I.e. if suse's aiming at small embedded devices/or optimizing it for toasters and microwave ovens, SLAB is recommended.
A stack overflow post supported by no data is not the basis for making a decision on what small object allocator to use in the kernel. SLOB is an allocator that may be designed for a toaster or a microwave oven. SLAB has been and currently is used on machines with >= 1024 CPUs. It is extremely rare to identify a workload where SLAB is the limiting factor except in the case where the debugging interfaces are used aggressively. I say extremely rare because I'm not aware of one. The only case where I heard that SLUB mattered was on specialised applications that heavily relied on the performance of the SLUB fast path and *only* the SLUB fast path. That scenario does not apply to many workloads.
But if suse wants to support computers with *more processors* (seems to be the trend for most customers), then SLUB is recommended.
A recommendation by stack overflow is rarely an authorative source. The author of SLUB used to consistently assert that it SLUB was the only way that more processors could be effectively supported but this was rarely supported by independent tests. For example, SLUB became the default in 2007. In 2010, tests by a Google engineer indicated that SLUB was 11.5% to 20.9% slower than SLAB (https://lkml.org/lkml/2010/7/14/448). Five years later, the same engineer states that SLUB is still problematic for workloads they care about (https://lkml.org/lkml/2015/8/27/559). I know at the time I briefly looked at the SLUB performance and also found it to be much slower but did not do anything the information as I was not working on distribution kernels at the time.
So to turn the inertial question around, why GOOD reasons does Suse have for staying with a memory allocator designed for single procesor-embeded devices vs. larger multi-core systems?
I do not speak for SUSE so do not consider this an official response but here is my take on it; In the 8 years since SLUB became the default, there has been a significant number of commits aimed at improving the performance of SLUB. According to my own research, this 8 years of effort has brought the performance of SLUB *in line* with SLAB and it rarely significantly exceeds it except in microbenchmarks. I know that vendors for very large machines have conducted their own performance evaluation of the kernel and in that time to the best of my knowledge, SLAB has not been identified as a limiting factor. I'm not aware of a compelling reason to pick one over the other at the moment but the fact that SLUB was known to have severe performance regressions 3 years after it became the default means that openSUSE staying on SLAB was probably the correct choice. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Tue, Sep 1, 2015 at 7:21 PM, Mel Gorman <mgorman@suse.de> wrote:
On Fri, Aug 28, 2015 at 12:01:11PM -0700, Linda A. Walsh wrote:
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
====
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default . Sounds like a good thing to do because backup is like read source and write dest in both in direct I/O. Your buffer caches hardly grow but when you don't you explicity specify direct i/o when you copy files of huge size like 1000 files of 1G or 10000 files of 100 MB , the buffer_head and ext4_inode_cache or xfs_inode_cache grows very fast. When you are backing up TeraBytes of data ie full export and backup , i think it is better not to fill up the kernel caches that you are going to seldom use. Any thoughts on this ? -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu, Sep 03, 2015 at 12:07:03PM +0530, Navin Parakkal wrote:
On Tue, Sep 1, 2015 at 7:21 PM, Mel Gorman <mgorman@suse.de> wrote:
On Fri, Aug 28, 2015 at 12:01:11PM -0700, Linda A. Walsh wrote:
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
====
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default . Sounds like a good thing to do because backup is like read source and write dest in both in direct I/O. Your buffer caches hardly grow but when you don't you explicity specify direct i/o when you copy files of huge size like 1000 files of 1G or 10000 files of 100 MB , the buffer_head and ext4_inode_cache or xfs_inode_cache grows very fast.
When you are backing up TeraBytes of data ie full export and backup , i think it is better not to fill up the kernel caches that you are going to seldom use. Any thoughts on this ?
The madvise and fadvise calls are there to discard unnecessary data read during backup. There is a prototype wrapper that forces it to happen at https://code.google.com/p/pagecache-mangagement/. Alternative, a backup running inside a memcg would not be able to fill the memory with unnecessary data as it would be forced to reclaim in a timely fashion. It's not a problem that's directly related to either SLAB or SLUB as the page reclaim and shrinker interface that handles both remains the same. There may be an indirect problem when freeing many millions of slab objects at the same time but a soft lockup in such a path would likely be fixable with a cond_resched() call at the correct time. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu, Sep 03, 2015 at 12:07:03PM +0530, Navin Parakkal wrote:
On Tue, Sep 1, 2015 at 7:21 PM, Mel Gorman <mgorman@suse.de> wrote:
On Fri, Aug 28, 2015 at 12:01:11PM -0700, Linda A. Walsh wrote:
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
====
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default . Sounds like a good thing to do because backup is like read source and write dest in both in direct I/O. Your buffer caches hardly grow but when you don't you explicity specify direct i/o when you copy files of huge size like 1000 files of 1G or 10000 files of 100 MB , the buffer_head and ext4_inode_cache or xfs_inode_cache grows very fast.
When you are backing up TeraBytes of data ie full export and backup , i think it is better not to fill up the kernel caches that you are going to seldom use. Any thoughts on this ?
The madvise and fadvise calls are there to discard unnecessary data read during backup. There is a prototype wrapper that forces it to happen at https://code.google.com/p/pagecache-mangagement/. Alternatively, a backup running inside a memcg would not be able to fill the memory with unnecessary data as it would be forced to reclaim in a timely fashion. It's not a problem that's directly related to either SLAB or SLUB as the page reclaim and shrinker interface that handles both remains the same. There may be an indirect problem when freeing many millions of slab objects at the same time but a soft lockup in such a path would likely be fixable with a cond_resched() call at the correct time. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu 03-09-15 12:07:03, Navin Parakkal wrote: [...]
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default .
I am still not sure I understand what is _the problem_. Do you see any considerable stalls during backups caused by excessive slab usage? You have mentioned that you can see stalls during an artificial load earlier but nothing about a real world example. -- Michal Hocko SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 03/09/2015 12:51, Michal Hocko wrote:
On Thu 03-09-15 12:07:03, Navin Parakkal wrote: [...]
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default . ... When you are backing up TeraBytes of data ie full export and backup , i think it is better not to fill up the kernel caches that you are going to seldom use.
I am still not sure I understand what is _the problem_. Do you see any considerable stalls during backups caused by excessive slab usage? You have mentioned that you can see stalls during an artificial load earlier but nothing about a real world example.
As end user, I can suggest an scenario. I don't know if it relates to SLAB/SLUB, but I know that it is a real scenario similar to the one Navin describes. On some systems, when writing the openSUSE DVD iso image to an USB stick, as the stick is slow writing, the kernel tries to cache as much as it can in memory, maybe the full 4 gigs. This, apparently, has the consequence of starving other processes of cache use and ram, on some machines, causing the entire machine to crawl and become unresponsive for minutes. Using certain flags with dd to drop the cache helps, but it would be better to tell the kernel to limit the amount of cache used for a certain device, or per process, or have it automatically use the proper decisions, whatever they be :-) -- Saludos/Cheers, Carlos E.R. (Minas-Morgul - W7) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu 03-09-15 13:38:22, Carlos E. R. wrote:
On 03/09/2015 12:51, Michal Hocko wrote:
On Thu 03-09-15 12:07:03, Navin Parakkal wrote: [...]
The problem doesn't occur with Oracle systems because during backup they have ways to use direct I/O and many DBAs set that by default . ... When you are backing up TeraBytes of data ie full export and backup , i think it is better not to fill up the kernel caches that you are going to seldom use.
I am still not sure I understand what is _the problem_. Do you see any considerable stalls during backups caused by excessive slab usage? You have mentioned that you can see stalls during an artificial load earlier but nothing about a real world example.
As end user, I can suggest an scenario. I don't know if it relates to SLAB/SLUB, but I know that it is a real scenario similar to the one Navin describes.
On some systems, when writing the openSUSE DVD iso image to an USB stick, as the stick is slow writing, the kernel tries to cache as much as it can in memory, maybe the full 4 gigs. This, apparently, has the consequence of starving other processes of cache use and ram, on some machines, causing the entire machine to crawl and become unresponsive for minutes.
This doesn't seem to have anything to do with SL.B, though. It is more about amount of dirty pages which are allowed to be accumulated before writers get throttled (dirty_{backgroud_}ratio resp _bytes). And yes, we have seen these issues, especially with slow devices in the past a lot. The situation should be _somehow_ better these days but it is still recommended to use _bytes variants to throttle writers or start the writeback sooner. It would be, of course, much better if system could auto-tune but this is a hard problem which still waits for solving. -- Michal Hocko SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
(Another repeat response for some due to using the wrong email address again) On Fri, Aug 28, 2015 at 12:01:11PM -0700, Linda A. Walsh wrote:
Michal Kubecek wrote:
On Friday 28 of August 2015 10:34:48 Navin Parakkal wrote:
On Fri, Aug 28, 2015 at 9:20 AM, Jeff Mahoney <jeffm@suse.com> wrote:
1) Inertia i'll skip this one.
Well, you shouldn't. To change things, one should have good reason. The more intrusive the change, the stronger the arguments for it should be. Unless there is a substantial gain, the change is not worth the effort and the risk.
==== Suse already changed this. The onus proving the use of the outdated tech is on SuSE as they stopped using the default since 2.6.23.
That assumes that the upstream default was changed for a compelling reason -- it wasn't. The commit that enabled it by default simply read "There are some reports that 2.6.22 has SLUB as the default. Not true! This will make SLUB the default for 2.6.23." It was asserted at the time, by the author of SLUB, that slab had inherent unfixable flaws and that it needed to go away but there never was a universal consensus on this. Around the same time, SLUB was found by multiple independent tests to be slower than SLAB. There were attempts to address some limitations in the SLUB implementation (https://lwn.net/Articles/311502/) but they were never finished as the author moved on.
From stackoverflow:
Slab is the original, based on Bonwick's seminal paper and available since Linux kernel version 2.2. It is a faithful implementation of Bonwick's proposal, augmented by the multiprocessor changes described in Bonwick's follow-up paper[2].
Slub is the next-generation replacement memory allocator, which has been the default in the Linux kernel since 2.6.23. It continues to employ the basic "slab" model, but fixes several deficiencies in Slab's design, particularly around systems with large numbers of processors. Slub is simpler than Slab.
What should you use? Slub, unless you are building a kernel for an embedded device with limited in memory. In that case, I would benchmark Slub versus SLOB and see what works best for your workload. There is no reason to use Slab; it will likely be removed from future Linux kernel releases.
==============
I.e. if suse's aiming at small embedded devices/or optimizing it for toasters and microwave ovens, SLAB is recommended.
A stack overflow post supported by no data is not the basis for making a decision on what small object allocator to use in the kernel. SLOB is an allocator that may be designed for a toaster or a microwave oven. SLAB has been and currently is used on machines with >= 1024 CPUs. It is extremely rare to identify a workload where SLAB is the limiting factor except in the case where the debugging interfaces are used aggressively. I say extremely rare because I'm not aware of one. The only case where I heard that SLUB mattered was on specialised applications that heavily relied on the performance of the SLUB fast path and *only* the SLUB fast path. That scenario does not apply to many workloads.
But if suse wants to support computers with *more processors* (seems to be the trend for most customers), then SLUB is recommended.
A recommendation by stack overflow is rarely an authorative source. The author of SLUB used to consistently assert that it SLUB was the only way that more processors could be effectively supported but this was rarely supported by independent tests. For example, SLUB became the default in 2007. In 2010, tests by a Google engineer indicated that SLUB was 11.5% to 20.9% slower than SLAB (https://lkml.org/lkml/2010/7/14/448). Five years later, the same engineer states that SLUB is still problematic for workloads they care about (https://lkml.org/lkml/2015/8/27/559). I know at the time I briefly looked at the SLUB performance and also found it to be much slower but did not do anything the information as I was not working on distribution kernels at the time.
So to turn the inertial question around, why GOOD reasons does Suse have for staying with a memory allocator designed for single procesor-embeded devices vs. larger multi-core systems?
I do not speak for SUSE so do not consider this an official response but here is my take on it; In the 8 years since SLUB became the default, there has been a significant number of commits aimed at improving the performance of SLUB. According to my own research, this 8 years of effort has brought the performance of SLUB *in line* with SLAB and it rarely significantly exceeds it except in microbenchmarks. I know that vendors for very large machines have conducted their own performance evaluation of the kernel and in that time to the best of my knowledge, SLAB has not been identified as a limiting factor. I'm not aware of a compelling reason to pick one over the other at the moment but the fact that SLUB was known to have severe performance regressions 3 years after it became the default means that openSUSE staying on SLAB was probably the correct choice. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu, Aug 27, 2015 at 11:50:35PM -0400, Jeff Mahoney wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 8/27/15 10:04 PM, Navin Parakkal wrote:
Hi, I would like to know why SUSE through SLES and openSUSE still uses CONFIG_SLAB as the allocator instead of CONFIG_SLUB ?.
I've seen worst case scenarios locking up buffer_head when there is fragmentation and the number of slabs are in millions.
Any particular case where it performs better than SLUB . I'm not investigating on low end systems where SLOB is another option.
Mel can probably comment further, but there are two answers.
1) Inertia 2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
This question was raised before and my answer at the time was http://lists.opensuse.org/opensuse-kernel/2014-06/msg00047.html . Much of what I said there is still relevant even if it lacked data at the time. I recently looked at the performance of SLAB vs SLUB on 4 mid-range servers using kernel 4.1. The objective was to see if there was a compelling reason to switch to SLUB in openSUSE. I didn't release the report as it would need a lot of polish before it'd be fit for public consumption. tldr: There is no known compelling reason to make the switch but there is no known barrier to it either. SLUB is potentially easier to debug but it has some potential degradation over long periods of time that are not quantified. There were also some gaps in the study that ideally would be closed before making a switch. Right now, I'm favouring keeping the status quo but would be open to hearing about a realistic workload or hardware configuration that clearly distinguished between them. Longer version; SLOB was ruled out as an option because it's known to not scale on even low-end hardware. It really is for embedded environments with constrained memory. The following workloads were evaluated dbt5 on ext4 pgbench on ext4 netperf with multiple sizes, both TCP and UDP, streaming and RR netperf with multiple sizes, both TCP and UDP RR netperf with multiple instances with 256 bytes fixed size Hackbench with pipes Hackbench with sockets Deduplication and compression of large files Ultimately, there was little difference in performance between SLAB and SLUB but crucially, neither was consistently better or worse than the other. I'm not going to write up the results in email format but by and large, they are not that interesting. Instead here is a slightly editted version of the study's conclusion. === begin report extract === Early evaluations of SLAB and SLUB indicated that SLUB was slower in a variety of workloads. The shorter paths and smaller management was theoritically beneficial but offset by a reliance on high-order pages and a long slow path. In recent years there have been a number of aggressive attempts to improve the performance of SLUB while SLAB remained stable. This study indicates that there is either little difference in performance in many cases. In cases where there are large performance differences then neither SLAB nor SLUB is the consistent winner. For network workloads, it appeared that SLUB was generally faster for TCP-based workloads but we cannot make a general statement on whether openSUSE users workloads prefer TCP over UDP performance. The observation of this study is that there are no known performance-related barriers to using SLUB in openSUSE if it's based on a 4.1 kernel albeit there is also no compelling reason to make the switch. With no known distinction in performance then the fact that SLUB can be debugged at runtime is compelling and the lower memory footprint is attractive. This is offset by some limitations in the study. A key limitation of the network portions of this study was the fact that it was mostly over localhost. An evaluation was conducted on a pair of machines that match the description but the results showed no difference between SLAB and SLUB. The result is not that interesting as they indicate that the network speed was the limiting factor. It would be necessary to install faster network cards or a faster switch to properly conduct that evaluation. The machines that were used in 2007 and 2012 to demonstrate network-related performance regressions were also 4-node machines which is an important factor as SLAB and SLUB differ on how remote data is managed. That configuration should be replicated if possible. A second important gap in the study was an evaluation of an in-memory database. There was a configuration included but the results were not reported. Kernel 4.1 has a regression that caused the workload to swap and other data indicates that this regression was introduced between 4.0 and 4.1. The root cause of this regression is not known at the time of writing. The final recommendation is to keep the status quo until the gaps are addressed or a compelling reason is found to justify the risk of switching. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Thu, Aug 27, 2015 at 11:50:35PM -0400, Jeff Mahoney wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 8/27/15 10:04 PM, Navin Parakkal wrote:
Hi, I would like to know why SUSE through SLES and openSUSE still uses CONFIG_SLAB as the allocator instead of CONFIG_SLUB ?.
I've seen worst case scenarios locking up buffer_head when there is fragmentation and the number of slabs are in millions.
Any particular case where it performs better than SLUB . I'm not investigating on low end systems where SLOB is another option.
Mel can probably comment further, but there are two answers.
1) Inertia 2) Our testing hasn't shown any clear winner under all workloads between the two allocators and our mm experts have many years of experience working with the slab code.
This is a repeat response as my previous message got held up by the moderator as mgorman@suse.de is not subscribed. Sorry to those that get it twice. This question was raised before and my answer at the time was http://lists.opensuse.org/opensuse-kernel/2014-06/msg00047.html . Much of what I said there is still relevant even if it lacked data at the time. I recently looked at the performance of SLAB vs SLUB on 4 mid-range servers using kernel 4.1. The objective was to see if there was a compelling reason to switch to SLUB in openSUSE. I didn't release the report as it would need a lot of polish before it'd be fit for public consumption. tldr: There is no known compelling reason to make the switch but there is no known barrier to it either. SLUB is potentially easier to debug but it has some potential degradation over long periods of time that are not quantified. There were also some gaps in the study that ideally would be closed before making a switch. Right now, I'm favouring keeping the status quo but would be open to hearing about a realistic workload or hardware configuration that clearly distinguished between them. Longer version; SLOB was ruled out as an option because it's known to not scale on even low-end hardware. It really is for embedded environments with constrained memory. The following workloads were evaluated dbt5 on ext4 pgbench on ext4 netperf with multiple sizes, both TCP and UDP, streaming and RR netperf with multiple sizes, both TCP and UDP RR netperf with multiple instances with 256 bytes fixed size Hackbench with pipes Hackbench with sockets Deduplication and compression of large files Ultimately, there was little difference in performance between SLAB and SLUB but crucially, neither was consistently better or worse than the other. I'm not going to write up the results in email format but by and large, they are not that interesting. Instead here is a slightly editted version of the study's conclusion. === begin report extract === Early evaluations of SLAB and SLUB indicated that SLUB was slower in a variety of workloads. The shorter paths and smaller management was theoritically beneficial but offset by a reliance on high-order pages and a long slow path. In recent years there have been a number of aggressive attempts to improve the performance of SLUB while SLAB remained stable. This study indicates that there is either little difference in performance in many cases. In cases where there are large performance differences then neither SLAB nor SLUB is the consistent winner. For network workloads, it appeared that SLUB was generally faster for TCP-based workloads but we cannot make a general statement on whether openSUSE users workloads prefer TCP over UDP performance. The observation of this study is that there are no known performance-related barriers to using SLUB in openSUSE if it's based on a 4.1 kernel albeit there is also no compelling reason to make the switch. With no known distinction in performance then the fact that SLUB can be debugged at runtime is compelling and the lower memory footprint is attractive. This is offset by some limitations in the study. A key limitation of the network portions of this study was the fact that it was mostly over localhost. An evaluation was conducted on a pair of machines that match the description but the results showed no difference between SLAB and SLUB. The result is not that interesting as they indicate that the network speed was the limiting factor. It would be necessary to install faster network cards or a faster switch to properly conduct that evaluation. The machines that were used in 2007 and 2012 to demonstrate network-related performance regressions were also 4-node machines which is an important factor as SLAB and SLUB differ on how remote data is managed. That configuration should be replicated if possible. A second important gap in the study was an evaluation of an in-memory database. There was a configuration included but the results were not reported. Kernel 4.1 has a regression that caused the workload to swap and other data indicates that this regression was introduced between 4.0 and 4.1. The root cause of this regression is not known at the time of writing. The final recommendation is to keep the status quo until the gaps are addressed or a compelling reason is found to justify the risk of switching. -- Mel Gorman SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (10)
-
Carlos E. R.
-
Jeff Mahoney
-
Linda A. Walsh
-
Matthias G. Eckermann
-
Mel Gorman
-
Mel Gorman
-
Michal Hocko
-
Michal Kubecek
-
Navin Parakkal
-
Petr Tesarik