x86-64-v2: oggenc, zstd
== oggenc/libvorbis First, I'll reconfirm my observation from 2009 (https://lwn.net/Articles/357445/); still holds! AMD AXP: ion.wav File length: 16m 06.0s i586: Elapsed time: 1m 58.1s Rate: 8.1879 i586 -msse -mfpmath=sse: Elapsed time: 1m 41.6s Rate: 9.5162 (+16%) All remaining numbers are with AMD 3700X. jump.wav File length: 219m 48.0s -m32 Elapsed time: 4m 13.8s Rate: 51.9569 (-44.9%) -m32+sse: Elapsed time: 2m 50.0s Rate: 77.5843 (-17.7%) -m32+sse2: Elapsed time: 2m 48.1s Rate: 78.4664 (-16.8%) x32(debian); Elapsed time: 2m 21.4s Rate: 93.2656 x86-64: Elapsed time: 2m 19.8s Rate: 94.3681 (baseline) x86-64-v2: Elapsed time: 2m 16.5s Rate: 96.5983 (+2.3%) x86-64-v3: Elapsed time: 2m 11.6s Rate: 100.2360 (+6.2%) == oggdec jump.ogg -m32: real 0m25.519s user 0m23.618s sys 0m1.892s -m32+sse: real 0m20.964s user 0m19.231s sys 0m1.720s -m32+sse2: real 0m20.929s user 0m19.026s sys 0m1.896s x32(deb): real 0m19.510s user 0m17.552s sys 0m1.924s x86-64: real 0m18.878s user 0m17.066s sys 0m1.808s real 0m18.892s user 0m17.469s sys 0m1.416s real 0m19.157s user 0m17.345s sys 0m1.804s x86-64-v2: real 0m19.617s user 0m17.534s sys 0m1.688s real 0m19.032s user 0m17.489s sys 0m1.536s real 0m19.286s user 0m17.301s sys 0m1.972s x86-64-v3: real 0m19.489s user 0m17.306s sys 0m1.899s real 0m18.898s user 0m16.920s sys 0m1.972s real 0m19.046s user 0m17.151s sys 0m1.889s == zstd -15 linux-6.0.tar Note that zstd has hand-crafted assembly and may pick BMI2 already. Just like with libvorbis, what is being tested here is the improvement of the *assembler-free* paths (but that much should be obvious from -march). opensuse i586: real 1m50.580s user 1m50.621s sys 0m0.184s debian x32: real 1m55.176s user 1m55.180s sys 0m0.580s opensuse x86_64 (-O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto -ffat-lto-objects): real 2m0.109s user 2m0.156s sys 0m0.244s real 1m59.523s user 1m59.569s sys 0m0.224s real 2m0.330s user 2m0.339s sys 0m0.260s fromsource CFLAGS=-O2: real 1m41.340s user 1m41.486s sys 0m0.504s real 1m40.347s user 1m40.445s sys 0m0.340s real 1m41.182s user 1m41.343s sys 0m0.496s fromsource CFLAGS=-O2 -march=x86-64-v2: real 1m39.001s user 1m39.128s sys 0m0.288s real 1m39.615s user 1m39.761s sys 0m0.344s real 1m39.999s user 1m40.165s sys 0m0.660s == zstd -10 linux-6.0.tar opensuse i586: real 0m32.076s user 0m32.130s sys 0m0.192s debian x32: real 0m22.772s user 0m22.819s sys 0m0.232s opensuse x86_64: real 0m17.454s user 0m17.583s sys 0m0.428s real 0m17.678s user 0m17.773s sys 0m0.340s real 0m17.769s user 0m17.916s sys 0m0.377s fromsource CFLAGS=-O2: real 0m18.801s user 0m18.954s sys 0m0.297s real 0m19.430s user 0m19.574s sys 0m0.599s real 0m18.818s user 0m18.914s sys 0m0.335s fromsource CFLAGS=-O2 -march=x86-64-v2: real 0m18.491s user 0m18.620s sys 0m0.361s real 0m18.495s user 0m18.660s sys 0m0.294s real 0m18.698s user 0m18.796s sys 0m0.489s == zstd -3 linux-6.0.tar opensuse i586: real 0m4.757s user 0m4.866s sys 0m0.116s debian x32: real 0m3.702s user 0m3.843s sys 0m0.122s opensuse x86_64: real 0m4.749s user 0m4.826s sys 0m0.201s real 0m4.839s user 0m4.919s sys 0m0.153s real 0m4.858s user 0m4.882s sys 0m0.212s fromsource CFLAGS=-O2 real 0m2.996s user 0m3.140s sys 0m0.478s real 0m3.027s user 0m3.165s sys 0m0.601s real 0m3.064s user 0m3.174s sys 0m0.352s fromsource -march=x86-64-v2: real 0m3.008s user 0m3.158s sys 0m0.269s real 0m2.987s user 0m3.214s sys 0m0.528s real 0m2.996s user 0m3.172s sys 0m0.589s == Summary == Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant.
== Summary ==
Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant. Thanks for the numbers Jan. Question remains if those micro servers
Am 06.10.22 um 09:45 schrieb Jan Engelhardt: that people have around are really used for performance crititical tasks. But the "if any" still makes me believe we should offer V3 compiled versions for tools where it matters and ignore all of V2 for Tumbleweed. Disclaimer: SUSE ALP is requiring V2 and its relationship to openSUSE Leap is yet to be defined. We're discussing Tumbleweed here :) Greetings, Stephan
On 07. 10. 22, 7:48, Stephan Kulow wrote:
Am 06.10.22 um 09:45 schrieb Jan Engelhardt:
== Summary ==
Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant. Thanks for the numbers Jan.
+1
But the "if any" still makes me believe we should offer V3 compiled versions for tools where it matters and ignore all of V2 for Tumbleweed.
Agreed, this makes very sense to me. But maybe we should establish some depreciation policy for TW too? For example drop support something like 10-15 years after the last model released? I am not even sure we can deprecate 32 bits now (due to those small laptops released few years back). thanks, -- js suse labs
On Fri, 7 Oct 2022 08:25:08 +0200 Jiri Slaby wrote:
For example drop support something like 10-15 years after the last model released? I am not even sure we can deprecate 32 bits now (due to those small laptops released few years back).
I do not consider the release date of CPUs as a good indicator for their later usability. Sometimes at this date they are pure slideware and the actual hardware, e.g. in laptops, only gets available a year or more later. On the other hand, the performance versions of an architecture can still compete with budget versions of several later generations. Yes, systems mainly doing heavy work will be upgraded when faster hardware and budget is available. But the former high performance systems remain suitable for doing ordinary work for long time. Kind regards, Dieter
Am 07.10.22 um 15:45 schrieb dieter:
On Fri, 7 Oct 2022 08:25:08 +0200 Jiri Slaby wrote:
For example drop support something like 10-15 years after the last model released? I am not even sure we can deprecate 32 bits now (due to those small laptops released few years back).
I do not consider the release date of CPUs as a good indicator for their later usability. Sometimes at this date they are pure slideware and the actual hardware, e.g. in laptops, only gets available a year or more later.
Right, there might be some interval between the initial release and last shipment. I think it's the latter that's more relevant.
On the other hand, the performance versions of an architecture can still compete with budget versions of several later generations. Yes, systems mainly doing heavy work will be upgraded when faster hardware and budget is available. But the former high performance systems remain suitable for doing ordinary work for long time.
Also agreed, a time threshold is kind of arbitrary and doesn't take the speed of hardware evolution into account. I would view it as undesirable if we artificially cut off hardware that can otherwise still reasonably run Tumbleweed, independent of when it was released. So the original Pentium is likely fair game to cut off, as it's likely too slow to do anything interesting and probably doesn't even support enough memory for booting. (Though I can't find anything on memory limits right now, so please feel free to correct me on this.) At the end of x86_32 it gets a bit muddy. But upping the baseline to SSE should definitely be good. Sure, with this criterion we might need a long time to cut off -v1, especially if software doesn't get considerably slower or more memory hungry. (Which some of us are working hard to avert.) But I don't see this as a bad thing: sure, being able to run Linux everywhere might only be a symbolic value to some, but it's an important symbolic value. And I don't buy the argument that we're leaving lots of performance on the table, due to generic performance improvements (bigger caches, deeper pipelines, better branch prediction etc.) and dynamic dispatch. I've already laid out why newer instructions are already heavily used, and are already used in most places where they matter. Aaron
On Sat, Oct 08, 2022 at 01:24:23AM +0200, Aaron Puchert wrote:
Am 07.10.22 um 15:45 schrieb dieter:
On Fri, 7 Oct 2022 08:25:08 +0200 Jiri Slaby wrote:
For example drop support something like 10-15 years after the last model released? I am not even sure we can deprecate 32 bits now (due to those small laptops released few years back).
I do not consider the release date of CPUs as a good indicator for their later usability. Sometimes at this date they are pure slideware and the actual hardware, e.g. in laptops, only gets available a year or more later.
Right, there might be some interval between the initial release and last shipment. I think it's the latter that's more relevant.
On the other hand, the performance versions of an architecture can still compete with budget versions of several later generations. Yes, systems mainly doing heavy work will be upgraded when faster hardware and budget is available. But the former high performance systems remain suitable for doing ordinary work for long time.
Also agreed, a time threshold is kind of arbitrary and doesn't take the speed of hardware evolution into account. I would view it as undesirable if we artificially cut off hardware that can otherwise still reasonably run Tumbleweed, independent of when it was released.
So the original Pentium is likely fair game to cut off, as it's likely too slow to do anything interesting and probably doesn't even support enough memory for booting. (Though I can't find anything on memory limits right
I used to have a late Pentium board, and it went to up to something like 150MHz, and something like 128MB RAM (theoretically, 64MB was SIMM which was doable and another 64MB was DIMM of a layout that was basically vaporware). I trashed it more than 10 years ago when it became unusable for general purpose distributions, and for embedded it was way too big and power hungry. You can get similar spec ARM systems these days but they are not meant for general purpose computing, and are way smaller.
now, so please feel free to correct me on this.) At the end of x86_32 it gets a bit muddy. But upping the baseline to SSE should definitely be good.
And that's the thing. The x86-64-v2 includes an assortment of features. There is some pocnt, lahf, and 16byte atomics which are largely uncontroversial because most CPUs have them. And then there is SSE all the way to SSE4.2 but there are quite a few CPUs that go only up to 4.1. I don't know what the practical difference is other than trashing a lot of usable CPus.
Sure, with this criterion we might need a long time to cut off -v1, especially if software doesn't get considerably slower or more memory hungry. (Which some of us are working hard to avert.)
It will happen anyway, and in 10 years the CPUs that don't have SSE4.2 are likely going to be unusable for other reasons as well. But then we will have the same conversation about v3 or v4 because CPu vendors like to differentiate between product lines by (not) including some features. Thanks Michal
So the original Pentium is likely fair game to cut off, as it's likely too slow to do anything interesting and probably doesn't even support enough memory for booting. (Though I can't find anything on memory limits right now, so please feel free to correct me on this.) From what I found there existed socket 7 mainboards supporting up to 512MB of RAM. I do not know if the required memory modules were actually available and affordable. For most chipsets and mainboards the memory covered by the level 2 cache was limited to 64MB. So I would also assume original Pentium (P54C) are hardly usable any more to run a current Linux desktop. Pentium II and Pentium III with sufficient RAM could still be usable. And Athlon-XP with 2GHz and sufficient memory is definitely usable with a light weight windowmanager like icewm. There the problem with 32-bit Tumblweed is that SSE2 instructions are sneaking into some
On Sat, 8 Oct 2022 01:24:23 +0200 Aaron Puchert wrote: packages which are not supported.
At the end of x86_32 it gets a bit muddy. But upping the baseline to SSE should definitely be good. I would consider SSE is probably supported by all CPUs which are suitable for running Tumbleweed. SSE2 is too high a requirement in my opinion.
Sure, with this criterion we might need a long time to cut off -v1, especially if software doesn't get considerably slower or more memory hungry. (Which some of us are working hard to avert.) I think there are also other components which will define the lifetime of a system, like PATA or SATA1 hard disks dying and availability of replacement.
For x86-64-v2 I would consider the tiny (if at all) performance gain is not worth obsoleting any system which a user still considers usable. Kind regards, Dieter
Am 07.10.22 um 15:45 schrieb dieter:
I do not consider the release date of CPUs as a good indicator for their later usability. Sometimes at this date they are pure slideware and the actual hardware, e.g. in laptops, only gets available a year or more later. On the other hand, the performance versions of an architecture can still compete with budget versions of several later generations. Yes, systems mainly doing heavy work will be upgraded when faster hardware and budget is available. But the former high performance systems remain suitable for doing ordinary work for long time. Yes. But the question remains on how many of those you plan to run tumbleweed on.
Greetings, Stephan
On Sat, 8 Oct 2022 08:13:11 +0200 Stephan Kulow wrote:
But the former high performance
systems remain suitable for doing ordinary work for long time. Yes. But the question remains on how many of those you plan to run tumbleweed on.
Tumbleweed is now the only maintained openSUSE distribution supporting x86_32bit. When Leap 15 will be end of life and only ALP requiring x86-64-v2 is available and Tumbleweed also requires x86-64-v2 this means for openSUSE users: 1) dispose of (an) x86-64-v1 system(s) they consider still usable and get something newer in order to keep using a maintained openSUSE version. It could even be that the newer system is slower than their obsoleted system if they can only afford a budget replacement of a former high performance system. 2) stay at a snapshot which still works and have no more bug fixes and security updates, 3) choose another distribution. Hard core openSUSE lovers will choose 1), others might prefer options 2 or 3, I think both should not be the goal of openSUSE - especially when the gain of x86-64-v2 is that small as the data show so far. And my specific numbers: I have 2 x86-64-v1 systems running Leap very fine and another x86-64-v1 system running Tumbleweed a bit slow. Kind regards, Dieter
On 10/7/22 16:18, Stephan Kulow wrote:
== Summary ==
Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant. Thanks for the numbers Jan. Question remains if those micro servers
Am 06.10.22 um 09:45 schrieb Jan Engelhardt: that people have around are really used for performance crititical tasks.
But the "if any" still makes me believe we should offer V3 compiled versions for tools where it matters and ignore all of V2 for Tumbleweed.
Disclaimer: SUSE ALP is requiring V2 and its relationship to openSUSE Leap is yet to be defined. We're discussing Tumbleweed here :)
I think the relationship has been very well defined actually, it has been made quite clear to the community that Leap 15.5 will be the last version of Leap built from Code 15 and that any future version will be built from ALP artifacts, while we don't know enough about what ALP will look like in the end to know which and what kind of artifacts they will be if SUSE makes a decision that all ALP x86_64 artifacts will require V2 or later it would create a huge amount of effort for the openSUSE community to do anything other then that. -- Simon Lees (Simotek) http://simotek.net Emergency Update Team keybase.io/simotek SUSE Linux Adelaide Australia, UTC+10:30 GPG Fingerprint: 5B87 DB9D 88DC F606 E489 CEC5 0922 C246 02F0 014B
Am 07.10.22 um 08:38 schrieb Simon Lees:
Disclaimer: SUSE ALP is requiring V2 and its relationship to openSUSE Leap is yet to be defined. We're discussing Tumbleweed here :)
I think the relationship has been very well defined actually, it has been made quite clear to the community that Leap 15.5 will be the last version of Leap built from Code 15 and that any future version will be built from ALP artifacts, Where was that said? Can you link please?
Greetings, Stephan
On Friday 2022-10-07 07:48, Stephan Kulow wrote:
Am 06.10.22 um 09:45 schrieb Jan Engelhardt:
== Summary ==
Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant.
Question remains if those micro servers that people have around are really used for performance crititical tasks.
My upcoming use case is a big screen projection of a livestream of a LAN gathering. A spare laptop was/is used to run VLC to play 1080p 60fps MPEG4. The CPU demand as per top(1) is something like 80%, or perhaps 75 fps decoding speed. It's critical, but also not critical until the decoding speed is at a borderline 61 fps. In essence, the machine is still good for the task, can probably even take another round of Retbleed/Spectre mitigations or a fourth level of FORTIFY_SOURCE. But the use case prospectively can't take the instakill that would come with downgrading from x64 to i586.
On Freitag, 7. Oktober 2022 07:48:25 CEST Stephan Kulow wrote:
Am 06.10.22 um 09:45 schrieb Jan Engelhardt:
== Summary ==
Gains from x86-64-v2 are small if any, but the penalties for going i586-on-64bit can be really significant.
Thanks for the numbers Jan. Question remains if those micro servers that people have around are really used for performance crititical tasks.
But the "if any" still makes me believe we should offer V3 compiled versions for tools where it matters and ignore all of V2 for Tumbleweed.
Disclaimer: SUSE ALP is requiring V2 and its relationship to openSUSE Leap is yet to be defined. We're discussing Tumbleweed here :)
Greetings, Stephan
I think having different settings in ALP/Leap/SLE vs Tumbleweed would be a very bad idea. Many packages are mostly maintained in Tumbleweed (after all, we have a "Factory first" policy). This also means the packages are tested primarily in Tumbleweed. Running different code in different streams introduces subtle errors. This is not a theoretical problem, I have seen packages failing on Leap and running fine on Tumbleweed again and again. I don't see a reason to invest any significant amount of time in Leap specific problems (last but not least because I have not a single computer running Leap). Of course there are more differences than compile flags between Leap and Tumbleweed, but each additional change is a possible error cause. Kind regards, Stefan
Am 08.10.22 um 22:20 schrieb Stefan Brüns:
This is not a theoretical problem, I have seen packages failing on Leap and running fine on Tumbleweed again and again. I don't see a reason to invest any significant amount of time in Leap specific problems (last but not least because I have not a single computer running Leap). Leap binaries compiled with gcc7 and Tumbleweed binaries with gcc12 is a large enough difference not to worry about compiler flags IMO.
Greetings, Stephan
Of course there are more differences than compile flags between Leap and Tumbleweed, but each additional change is a possible error cause.
Am 06.10.22 um 09:45 schrieb Jan Engelhardt:
== zstd -15 linux-6.0.tar [...] opensuse x86_64 (-O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto -ffat-lto-objects): real 2m0.109s user 2m0.156s sys 0m0.244s real 1m59.523s user 1m59.569s sys 0m0.224s real 2m0.330s user 2m0.339s sys 0m0.260s fromsource CFLAGS=-O2: real 1m41.340s user 1m41.486s sys 0m0.504s real 1m40.347s user 1m40.445s sys 0m0.340s real 1m41.182s user 1m41.343s sys 0m0.496s
Certainly I'm misunderstanding something, or is this benchmark saying that our additional flags make zstd around 20% slower? If it is saying that, would you mind bisecting which flag that is? It certainly seems unexpected to me.
== zstd -3 linux-6.0.tar [...] opensuse x86_64: real 0m4.749s user 0m4.826s sys 0m0.201s real 0m4.839s user 0m4.919s sys 0m0.153s real 0m4.858s user 0m4.882s sys 0m0.212s fromsource CFLAGS=-O2 real 0m2.996s user 0m3.140s sys 0m0.478s real 0m3.027s user 0m3.165s sys 0m0.601s real 0m3.064s user 0m3.174s sys 0m0.352s
Same question here (60%?), this seems like a big difference to me. Also strange that we're slightly better with -10. Aaron
On Saturday 2022-10-08 01:53, Aaron Puchert wrote:
Am 06.10.22 um 09:45 schrieb Jan Engelhardt:
== zstd -15 linux-6.0.tar
Certainly I'm misunderstanding something, or is this benchmark saying that our additional flags make zstd around 20% slower?
Aaah that might just be because /usr/bin/zstd is still version 1.5.2 :-/ Let's redo it then. == /usr/bin/zstd prebuilt (1.5.2!) == L3 real 0m4.766s user 0m4.908s sys 0m0.080s L10 real 0m32.276s user 0m32.353s sys 0m0.132s L15 real 1m49.700s user 1m49.726s sys 0m0.192s == CFLAGS="$(rpm --eval %optflags) a.k.a. -O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables" (1.5.3/git) == level 3:: real 0m2.934s user 0m3.143s sys 0m0.488s real 0m2.967s user 0m3.165s sys 0m0.499s real 0m2.930s user 0m3.142s sys 0m0.482s level 10:: real 0m17.120s user 0m17.247s sys 0m0.236s real 0m17.372s user 0m17.545s sys 0m0.201s level 15:: real 1m30.501s user 1m30.666s sys 0m0.412s real 1m31.255s user 1m31.448s sys 0m0.488s real 1m31.035s user 1m31.178s sys 0m0.344s == CFLAGS="-O2 -g -m64" (1.5.3/git) == L3 real 0m3.001s user 0m3.150s sys 0m0.543s L10 real 0m17.458s user 0m17.595s sys 0m0.212s L15 real 1m30.458s user 1m30.607s sys 0m0.316s == CFLAGS="-O2 -g -march=x86-64-v2" (1.5.3/git) == L3 real 0m2.919s user 0m3.059s sys 0m0.205s L10 real 0m17.261s user 0m17.450s sys 0m0.377s L15 real 1m30.268s user 1m30.343s sys 0m0.424s
participants (8)
-
Aaron Puchert
-
dieter
-
Jan Engelhardt
-
Jiri Slaby
-
Michal Suchánek
-
Simon Lees
-
Stefan Brüns
-
Stephan Kulow