which configuration changes for x86_64-v2+ necessary?
Hi, can someone describe what the exact changes in which OBS projects / packages will be to enable x86_64-v2-or-later-only? My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful. My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working. More packages can then be added if we see "SIGILL" crashes. -- Stefan Seyfried "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." -- Richard Feynman
On Tuesday 2022-09-27 09:18, Stefan Seyfried wrote:
can someone describe what the exact changes in which OBS projects / packages will be to enable x86_64-v2-or-later-only?
osc meta prjconf -e add a line Optflags: x86_64 -march=x86_64(-v2)
My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful.
This would be the time to opine for a https://download.opensuse.org/ports/x86_64/ project.
My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working.
About 15 years ago I could run i586 packages on a 386DX, but that only worked because the kernel had softemulation for 387 floating point instruction. I don't see that for SSSE3/SSE4.2/POPCNT (x8664v2).
On 27.09.22 10:30, Jan Engelhardt wrote:
On Tuesday 2022-09-27 09:18, Stefan Seyfried wrote:
can someone describe what the exact changes in which OBS projects / packages will be to enable x86_64-v2-or-later-only?
osc meta prjconf -e
add a line
Optflags: x86_64 -march=x86_64(-v2)
OK, this would be easy. Isn't there stuff in glibc that decides to no longer build the "legacy hwcaps" stuff etc?
My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful.
This would be the time to opine for a https://download.opensuse.org/ports/x86_64/ project.
Absolutely. Or maybe /ports/x86_64-v2+ and keep the default as-is ;-)
My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working.
About 15 years ago I could run i586 packages on a 386DX, but that only worked because the kernel had softemulation for 387 floating point instruction.
I don't see that for SSSE3/SSE4.2/POPCNT (x8664v2).
But does the compiler even use these often for "normal" software? My naive idea would be to just recompile these packages that actually use these. I'd really like to avoid for now to do for i in $(osc ls openSUSE:Factory); do osc linkpac openSUSE:Factory $i home:seife:factory done and start building *everything* ;-) -- Stefan Seyfried "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." -- Richard Feynman
On Tue, 27 Sep 2022, Stefan Seyfried wrote:
Hi,
can someone describe what the exact changes in which OBS projects / packages will be to enable x86_64-v2-or-later-only?
There is mainly the project config and its 'Optflags' setting but in the past (mostly for non-x86 architectures) we have also hard-wired the lowest supported architecture in the GCC configuration. I now have a pending change there for x86-64-v2 which looks like %if "%{TARGET_ARCH}" == "i586" %if 0%{?sle_version:%sle_version} >= 150000 %if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif %else --with-arch-32=i586 \ %endif --with-tune=generic \ %endif %if "%{TARGET_ARCH}" == "x86_64" %ifnarch %{disable_multilib_arch} --enable-multilib \ %if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif %endif (mightly complicated - there's also still no sle_version in the ALP repositories I could use). The above effectively means -march=x86-64-v2 will be default unless overridden on the command-line. It also means GCCs runtime is built with this setting. Note on SLES 15 we build the 'i586' tree with -march=x86-64 to also get the 32bit multilibs benefit from features of x86-64 where applicable. In Factory we're building the 'i586' tree (and thus 32bit multilibs for x86_64) with -march=i586 so they are usable in a pure 32bit environment as well.
My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful. My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working. More packages can then be added if we see "SIGILL" crashes.
With the GCC change above you'd need to override Optflags in the project config with Optflags: x86_64 -march=x86-64 but note that packages that do not use RPM_OPT_FLAGS appropriately will have x86-64-v2 uses left via GCCs default unless you re-build the compiler. Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)
On Tuesday 2022-09-27 11:09, Stefan Seyfried wrote:
About 15 years ago I could run i586 packages on a 386DX, but that only worked because the kernel had softemulation for 387 floating point instruction.
I don't see that for SSSE3/SSE4.2/POPCNT (x8664v2).
But does the compiler even use these often for "normal" software?
If the compiler sees an opportunity, sure. unsigned int popcnt_for_the_poor(unsigned int v) { unsigned int z =0; for (int i =0; i < 32; ++i) if (v & (1 << i)) ++z; return z; } $ g++ x.cpp -c -O3 -march=x86-64-v2 && objdump -d x.o x.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z19popcnt_for_the_poorj>: 0: 31 c9 xor %ecx,%ecx 2: 31 d2 xor %edx,%edx 4: be 01 00 00 00 mov $0x1,%esi 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 89 f0 mov %esi,%eax 12: d3 e0 shl %cl,%eax 14: 21 f8 and %edi,%eax 16: 83 f8 01 cmp $0x1,%eax 19: 83 da ff sbb $0xffffffff,%edx 1c: 83 c1 01 add $0x1,%ecx 1f: 83 f9 20 cmp $0x20,%ecx 22: 75 ec jne 10 <_Z19popcnt_for_the_poorj+0x10> 24: 89 d0 mov %edx,%eax 26: c3 ret $ g++ x.cpp -c -O3 -march=x86-64-v3 && objdump -d x.o x.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z19popcnt_for_the_poorj>: 0: c5 f9 6e cf vmovd %edi,%xmm1 4: c5 e1 ef db vpxor %xmm3,%xmm3,%xmm3 8: b8 01 00 00 00 mov $0x1,%eax d: c4 e2 7d 58 c9 vpbroadcastd %xmm1,%ymm1 12: c5 f5 db 05 00 00 00 vpand 0x0(%rip),%ymm1,%ymm0 # 1a <_Z19popcnt_for_the_poorj+0x1a> 19: 00 1a: c5 f9 6e e0 vmovd %eax,%xmm4 1e: c5 f5 db 15 00 00 00 vpand 0x0(%rip),%ymm1,%ymm2 # 26 <_Z19popcnt_for_the_poorj+0x26> 25: 00 26: c4 e2 7d 58 e4 vpbroadcastd %xmm4,%ymm4 2b: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 2f: c5 ed 76 d3 vpcmpeqd %ymm3,%ymm2,%ymm2 33: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 37: c5 ed df d4 vpandn %ymm4,%ymm2,%ymm2 3b: c5 ed fa d0 vpsubd %ymm0,%ymm2,%ymm2 3f: c5 f5 db 05 00 00 00 vpand 0x0(%rip),%ymm1,%ymm0 # 47 <_Z19popcnt_for_the_poorj+0x47> 46: 00 47: c5 f5 db 0d 00 00 00 vpand 0x0(%rip),%ymm1,%ymm1 # 4f <_Z19popcnt_for_the_poorj+0x4f> 4e: 00 4f: c5 f5 76 cb vpcmpeqd %ymm3,%ymm1,%ymm1 53: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 57: c5 f5 76 cb vpcmpeqd %ymm3,%ymm1,%ymm1 5b: c5 fd df c4 vpandn %ymm4,%ymm0,%ymm0 5f: c5 fd fa c1 vpsubd %ymm1,%ymm0,%ymm0 63: c5 fd fe c2 vpaddd %ymm2,%ymm0,%ymm0 67: c5 f9 6f c8 vmovdqa %xmm0,%xmm1 6b: c4 e3 7d 39 c0 01 vextracti128 $0x1,%ymm0,%xmm0 71: c5 f1 fe c0 vpaddd %xmm0,%xmm1,%xmm0 75: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 7a: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 7e: c5 f1 73 d8 04 vpsrldq $0x4,%xmm0,%xmm1 83: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 87: c5 f9 7e c0 vmovd %xmm0,%eax 8b: c5 f8 77 vzeroupper 8e: c3 ret
On Tue, 27 Sep 2022, Jan Engelhardt wrote:
On Tuesday 2022-09-27 11:09, Stefan Seyfried wrote:
About 15 years ago I could run i586 packages on a 386DX, but that only worked because the kernel had softemulation for 387 floating point instruction.
I don't see that for SSSE3/SSE4.2/POPCNT (x8664v2).
But does the compiler even use these often for "normal" software?
If the compiler sees an opportunity, sure.
unsigned int popcnt_for_the_poor(unsigned int v) { unsigned int z =0; for (int i =0; i < 32; ++i) if (v & (1 << i)) ++z; return z; }
We only handle int fancy_popcnt (long b) { int c = 0; while (b) { b &= b - 1; c++; } return c; } not your very poor variant. Produces PopCount: .LFB0: .cfi_startproc xorl %eax, %eax popcntq %rdi, %rax ret with v2. Richard.
$ g++ x.cpp -c -O3 -march=x86-64-v2 && objdump -d x.o
x.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z19popcnt_for_the_poorj>: 0: 31 c9 xor %ecx,%ecx 2: 31 d2 xor %edx,%edx 4: be 01 00 00 00 mov $0x1,%esi 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 89 f0 mov %esi,%eax 12: d3 e0 shl %cl,%eax 14: 21 f8 and %edi,%eax 16: 83 f8 01 cmp $0x1,%eax 19: 83 da ff sbb $0xffffffff,%edx 1c: 83 c1 01 add $0x1,%ecx 1f: 83 f9 20 cmp $0x20,%ecx 22: 75 ec jne 10 <_Z19popcnt_for_the_poorj+0x10> 24: 89 d0 mov %edx,%eax 26: c3 ret
$ g++ x.cpp -c -O3 -march=x86-64-v3 && objdump -d x.o
x.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z19popcnt_for_the_poorj>: 0: c5 f9 6e cf vmovd %edi,%xmm1 4: c5 e1 ef db vpxor %xmm3,%xmm3,%xmm3 8: b8 01 00 00 00 mov $0x1,%eax d: c4 e2 7d 58 c9 vpbroadcastd %xmm1,%ymm1 12: c5 f5 db 05 00 00 00 vpand 0x0(%rip),%ymm1,%ymm0 # 1a <_Z19popcnt_for_the_poorj+0x1a> 19: 00 1a: c5 f9 6e e0 vmovd %eax,%xmm4 1e: c5 f5 db 15 00 00 00 vpand 0x0(%rip),%ymm1,%ymm2 # 26 <_Z19popcnt_for_the_poorj+0x26> 25: 00 26: c4 e2 7d 58 e4 vpbroadcastd %xmm4,%ymm4 2b: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 2f: c5 ed 76 d3 vpcmpeqd %ymm3,%ymm2,%ymm2 33: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 37: c5 ed df d4 vpandn %ymm4,%ymm2,%ymm2 3b: c5 ed fa d0 vpsubd %ymm0,%ymm2,%ymm2 3f: c5 f5 db 05 00 00 00 vpand 0x0(%rip),%ymm1,%ymm0 # 47 <_Z19popcnt_for_the_poorj+0x47> 46: 00 47: c5 f5 db 0d 00 00 00 vpand 0x0(%rip),%ymm1,%ymm1 # 4f <_Z19popcnt_for_the_poorj+0x4f> 4e: 00 4f: c5 f5 76 cb vpcmpeqd %ymm3,%ymm1,%ymm1 53: c5 fd 76 c3 vpcmpeqd %ymm3,%ymm0,%ymm0 57: c5 f5 76 cb vpcmpeqd %ymm3,%ymm1,%ymm1 5b: c5 fd df c4 vpandn %ymm4,%ymm0,%ymm0 5f: c5 fd fa c1 vpsubd %ymm1,%ymm0,%ymm0 63: c5 fd fe c2 vpaddd %ymm2,%ymm0,%ymm0 67: c5 f9 6f c8 vmovdqa %xmm0,%xmm1 6b: c4 e3 7d 39 c0 01 vextracti128 $0x1,%ymm0,%xmm0 71: c5 f1 fe c0 vpaddd %xmm0,%xmm1,%xmm0 75: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 7a: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 7e: c5 f1 73 d8 04 vpsrldq $0x4,%xmm0,%xmm1 83: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 87: c5 f9 7e c0 vmovd %xmm0,%eax 8b: c5 f8 77 vzeroupper 8e: c3 ret
-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)
DimStar already tried to enable v2 on Staging:O as he did before for v3 https://build.opensuse.org/projects/openSUSE:Factory:Staging:O/prjconf So far it says: Optflags: x86_64 -march=x86-64-v2 On Tue, 2022-09-27 at 10:55 +0000, Richard Biener wrote:
On Tue, 27 Sep 2022, Stefan Seyfried wrote:
Hi,
can someone describe what the exact changes in which OBS projects / packages will be to enable x86_64-v2-or-later-only?
There is mainly the project config and its 'Optflags' setting but in the past (mostly for non-x86 architectures) we have also hard- wired the lowest supported architecture in the GCC configuration. I now have a pending change there for x86-64-v2 which looks like
%if "%{TARGET_ARCH}" == "i586" %if 0%{?sle_version:%sle_version} >= 150000 %if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif %else --with-arch-32=i586 \ %endif --with-tune=generic \ %endif %if "%{TARGET_ARCH}" == "x86_64" %ifnarch %{disable_multilib_arch} --enable-multilib \ %if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif %endif
(mightly complicated - there's also still no sle_version in the ALP repositories I could use).
The above effectively means -march=x86-64-v2 will be default unless overridden on the command-line. It also means GCCs runtime is built with this setting.
Note on SLES 15 we build the 'i586' tree with -march=x86-64 to also get the 32bit multilibs benefit from features of x86-64 where applicable. In Factory we're building the 'i586' tree (and thus 32bit multilibs for x86_64) with -march=i586 so they are usable in a pure 32bit environment as well.
My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful. My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working. More packages can then be added if we see "SIGILL" crashes.
With the GCC change above you'd need to override Optflags in the project config with
Optflags: x86_64 -march=x86-64
but note that packages that do not use RPM_OPT_FLAGS appropriately will have x86-64-v2 uses left via GCCs default unless you re-build the compiler.
Richard.
-- Luboš Kocman (he/him) openSUSE Leap Release Manager SUSE +1 2345678910 www.suse.com
On 27.09.22 12:55, Richard Biener wrote:
On Tue, 27 Sep 2022, Stefan Seyfried wrote:
My idea is that I'd like to build a set of packages that allow "legacy" machines to still boot / be useful. My guess / hope is, that only a few packages actually need to be built with "legacy" flags enabled (glibc of course, maybe gcc?) and most others will just keep working. More packages can then be added if we see "SIGILL" crashes.
this obviously will not work as Jan and you already explained, too many packages will have x86_64-v2 code hidden somewhere.
With the GCC change above you'd need to override Optflags in the project config with
Optflags: x86_64 -march=x86-64
but note that packages that do not use RPM_OPT_FLAGS appropriately will have x86-64-v2 uses left via GCCs default unless you re-build the compiler.
OK, so I'll patch "my" gcc version to revert that change, and then link all from openSUSE:Factory to home:seife:Factory, still do Optflags: x86_64 -march=x86-64 just to be sure and then I'll see what happens if I update my machine to the repo that is generated by that stunt :-) -- Stefan Seyfried "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." -- Richard Feynman
Am 27.09.22 um 15:03 schrieb Richard Biener:
On Tue, 27 Sep 2022, Jan Engelhardt wrote:
If the compiler sees an opportunity, sure.
unsigned int popcnt_for_the_poor(unsigned int v) { unsigned int z =0; for (int i =0; i < 32; ++i) if (v & (1 << i)) ++z; return z; }
We only handle
int fancy_popcnt (long b) { int c = 0; while (b) { b &= b - 1; c++; } return c; }
not your very poor variant. Produces
PopCount: .LFB0: .cfi_startproc xorl %eax, %eax popcntq %rdi, %rax ret
with v2.
Same for LLVM [1], though I'm beginning to think we should support the "very poor" variant as it naturally comes up when using e.g. <algorithm> on an std::bitset or std::vector<bool>. After all, if you can write the "b &= b - 1" loop, you can also write __builtin_popcount, so we're essentially just doing this to optimize legacy code. It's not really "necessary". Aaron [1] <https://github.com/llvm/llvm-project/blob/llvmorg-15.0.1/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L1616-L1744>
Am 27.09.22 um 11:09 schrieb Stefan Seyfried:
But does the compiler even use these often for "normal" software? My naive idea would be to just recompile these packages that actually use these.
I've written about this before [1]: it might actually be pretty rare. The SIMD instructions are largely special-purpose and POPCNT doesn't have very good loop idiom recognition in both GCC and LLVM (in my view), it's only emitted for __builtin_popcount and some special loop. On top of that it's not an awfully common pattern to count set bits. I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure. Aaron [1] <https://lists.opensuse.org/archives/list/factory@lists.opensuse.org/message/TZRE33GEC42AMT4FA56YRLP6ETCOVHR2/>
On Wednesday 2022-09-28 02:20, Aaron Puchert wrote:
Am 27.09.22 um 15:03 schrieb Richard Biener:
On Tue, 27 Sep 2022, Jan Engelhardt wrote: Stefan Seyfried wrote:
But does the compiler even use these [POPCNT, AVX, etc.] often for "normal" software?
Jan Engelhardt wrote:
If the compiler sees an opportunity, sure. unsigned int popcnt_for_the_poor(unsigned int v){...}
Richard Biener wrote:
We only handle int fancy_popcnt (long b) {...}
While gcc did not manage to make a POPCNT insn out of poor_popcnt, gcc did turn poor_popcnt into a loopless set of vector instructions, therefore showing that "normal" software indeed can suddenly gain "special" instructions. Aaron Puchert wrote:
Same for LLVM [1], though I'm beginning to think we should support the "very poor" variant as it naturally comes up when using e.g. <algorithm> on an std::bitset or std::vector<bool>.
After all, if you can write the "b &= b - 1" loop, you can also write __builtin_popcount [..] so we're essentially just doing this to optimize legacy code. It's not really "necessary".
An original software provider may choose not to utilize __builtin_popcount (I do not think this is available in MSVC under the exact name), and packagers/users downstream don't normally go looking for code and manually replace any code by __builtin_xyz. So, it is fair to demand that compilers like gcc and clang ought to recognize poor_popcnt patterns. We will always have legacy code, might as well deal with it.
On 27/09/2022 12.55, Richard Biener wrote:
%if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif
would be nicer, if this would not depend on suse_version directly, but use a macro set in prjconf. Then there can be ports projects setting it differently.
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used. Just for the hello-world, it produced the same assembly with gcc-11 -O2 -march=x86-64-v2 -S hello.c Ciao Bernhard M.
Am 28.09.22 um 02:38 schrieb Jan Engelhardt:
An original software provider may choose not to utilize __builtin_popcount (I do not think this is available in MSVC under the exact name),
Right, MSVC tends to use different names. It has __popcnt. [1] So you can write a little inline helper function (or macro if you want) with some #ifdefs that covers the three big compilers, and a fallback for other compilers. There is also std::popcount in C++20. [2] So at some point in the future this particular idiom might not be that important anymore.
packagers/users downstream don't normally go looking for code and manually replace any code by __builtin_xyz. It depends. Lots of software won't do this and won't need this. But it's definitely not uncommon to use intrinsics for popcnt or c{t,l}z alias bs{,r}. Also quite popular are the __builtin_{add,sub,mul}_overflow intrinsics, especially the multiplication variant. LLVM is since recently able to convert a division overflow check into a (much faster) check of OF, but this wasn't always the case. Also things gets messy with signed integers. So, it is fair to demand that compilers like gcc and clang ought to recognize poor_popcnt patterns. We will always have legacy code, might as well deal with it.
I wasn't so much saying that we should drop this pattern, but rather that it's probably not so interesting as your initial example. Such code might still arise even if you're willing to use builtins, because you can't run __builtin_popcount on an std::bitset or std::vector<bool> (they don't expose their bit representation). Now with std::bitset you can use bitset::count, which is likely using popcnt if available. But for std::vector you'd have to iterate over the vector, and you have no choice but to write the "very poor" loop. Which makes that idiom a bit more relevant than the one we support, in my view. Aaron [1] <https://learn.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64> [2] <https://en.cppreference.com/w/cpp/numeric/popcount>
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used.
Of course on i586 we can expect -march=native to have a big impact, but x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's wider vectorization with AVX, but lots of code doesn't vectorize well. I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling decisions. So we don't only see newer instructions being used, we also see older instructions being shuffled around or replaced by different older instructions because newer hardware has different latency/throughput characteristics. There is probably no way to reliably assess what changed due to new instructions being available. Aaron
On 28/09/2022 03.28, Aaron Puchert wrote:
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used.
Of course on i586 we can expect -march=native to have a big impact, but x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's wider vectorization with AVX, but lots of code doesn't vectorize well.
I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling decisions. So we don't only see newer instructions being used, we also see older instructions being shuffled around or replaced by different older instructions because newer hardware has different latency/throughput characteristics.
There is probably no way to reliably assess what changed due to new instructions being available.
Aaron
https://rb.zq1.de/temp/zstd-compare.out I created this with a new still-to-be-pushed version of my reproducibleopensuse tools with c="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection" ; optflags1="$c -march=x86-64" optflags2="$c -march=x86-64-v2" rbk The gcc-11 man-page says, that in the case of -march=x86-64* generic tuning is applied and you can see a difference in instructions created, not just in ordering: https://github.com/facebook/zstd/blob/eadb6c874f9d0c9e90c835f8b0181da802361e...
--- old /usr/lib64/libzstd.a/cover.o (disasm) +++ new /usr/lib64/libzstd.a/cover.o (disasm) @@ -122,19 +122,19 @@ jg <COVER_ctx_init + ofs> mov (%rsp),%rax sub %r15,%rbp + movd %ebx,%xmm0 mov %r12,offset(%r13) lea offset(%rbp),%r14 + pinsrd $something,%r11d,%xmm0 mov %esi,offset(%rsp) mov %rax,offset(%r13) - mov %ebx,%eax + mov %r10d,%eax + pmovzxdq %xmm0,%xmm0 lea offset(,%r14,4),%r15 mov %rax,offset(%r13) - mov %r11d,%eax mov %r15,%rdi - mov %rax,offset(%r13) - mov %r10d,%eax - mov %rax,offset(%r13) mov %r14,offset(%r13) + movups %xmm0,offset(%r13) mov %r9d,offset(%rsp) call <COVER_ctx_init + ofs>
Am 28.09.22 um 08:25 schrieb Bernhard M. Wiedemann:
On 28/09/2022 03.28, Aaron Puchert wrote:
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used.
Of course on i586 we can expect -march=native to have a big impact, but x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's wider vectorization with AVX, but lots of code doesn't vectorize well.
I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling decisions. So we don't only see newer instructions being used, we also see older instructions being shuffled around or replaced by different older instructions because newer hardware has different latency/throughput characteristics.
There is probably no way to reliably assess what changed due to new instructions being available.
Aaron
Thanks, that's interesting. Of course we don't know how hot these instructions are, but we can already see what has changed.
I created this with a new still-to-be-pushed version of my reproducibleopensuse tools with
c="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection" ; optflags1="$c -march=x86-64" optflags2="$c -march=x86-64-v2" rbk
The gcc-11 man-page says, that in the case of -march=x86-64* generic tuning is applied
True, but how do you explain the changes in /usr/lib64/libzstd.so.1.5.2 (disasm)? This doesn't seem to use any new instructions. There are also workarounds for false register dependencies, and these would no longer be required if you know you have a newer chip.
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand, pandn, movdqa and por. Probably corresponds to that loop: for (s=0; s<256; s++) { Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s]; if (Counting1[s] > max) max = Counting1[s]; } But the containing function (HIST_count_parallel_wksp) with its manually-unrolled loop also seems like an obvious candidate for dynamic dispatch. Aaron
On Thu, 29 Sep 2022, Aaron Puchert wrote:
Am 28.09.22 um 08:25 schrieb Bernhard M. Wiedemann:
On 28/09/2022 03.28, Aaron Puchert wrote:
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used.
Of course on i586 we can expect -march=native to have a big impact, but x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's wider vectorization with AVX, but lots of code doesn't vectorize well.
I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling decisions. So we don't only see newer instructions being used, we also see older instructions being shuffled around or replaced by different older instructions because newer hardware has different latency/throughput characteristics.
There is probably no way to reliably assess what changed due to new instructions being available.
Aaron
Thanks, that's interesting. Of course we don't know how hot these instructions are, but we can already see what has changed.
I created this with a new still-to-be-pushed version of my reproducibleopensuse tools with
c="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection" ; optflags1="$c -march=x86-64" optflags2="$c -march=x86-64-v2" rbk
The gcc-11 man-page says, that in the case of -march=x86-64* generic tuning is applied
True, but how do you explain the changes in /usr/lib64/libzstd.so.1.5.2 (disasm)? This doesn't seem to use any new instructions.
There are also workarounds for false register dependencies, and these would no longer be required if you know you have a newer chip.
Note that -mtune=generic is free to second-guess the set of CPUs to care about when newer HW capabilities are enabled so the expectation that without any new instructions used the generated code is the same for -march=x86-64 vs -march=x86-64-vN is false. It's not necessarily different, but "generic" tuning isn't a fully independent knob from the -march= setting.
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand, pandn, movdqa and por. Probably corresponds to that loop:
for (s=0; s<256; s++) { Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s]; if (Counting1[s] > max) max = Counting1[s]; }
But the containing function (HIST_count_parallel_wksp) with its manually-unrolled loop also seems like an obvious candidate for dynamic dispatch.
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling. It's way easier to just build everything with -v2 (or -v3) than to start bikeshedding on what _really_ matters. Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)
On Wed, 28 Sep 2022, Bernhard M. Wiedemann wrote:
On 27/09/2022 12.55, Richard Biener wrote:
%if %{suse_version} >= 1599 --with-arch-32=x86-64-v2 \ %else --with-arch-32=x86-64 \ %endif
would be nicer, if this would not depend on suse_version directly, but use a macro set in prjconf. Then there can be ports projects setting it differently.
Hmm. Note this part of the spec file is already quite ugly :/ So we could parse %{optflags} and look for any -march and supply that to the configury (note --with-arch doesn't accept everything that -march= accepts ...) Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)
On Thu, Sep 29, 2022 at 07:22:12AM +0000, Richard Biener wrote:
On Thu, 29 Sep 2022, Aaron Puchert wrote:
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand, pandn, movdqa and por. Probably corresponds to that loop:
for (s=0; s<256; s++) { Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s]; if (Counting1[s] > max) max = Counting1[s]; }
But the containing function (HIST_count_parallel_wksp) with its manually-unrolled loop also seems like an obvious candidate for dynamic dispatch.
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling.
It's way easier to just build everything with -v2 (or -v3) than to
Yes, and you get testing of code generation with these instructions on a large codebase. But other than that it is not clear that it provides a benefit. Skimming through the onnly performance numbers provided so far it looks like -v3 would be mostly win, and -v2 is not even clear win. Because CPU manufacturers like to differentiate product lines by disabling features -v3 excludes even some very recent hardware, so we are left with an option to exclude some older hardware that is at the end of its useful lifetime but people are still using it, and it is not clear it will provide a performance gain.
start bikeshedding on what _really_ matters.
Without the numbers you cannot tell what matters. Making performance 'optimizations' without maesuring the result is pure nonsense. And yes, it's much easier to determine that on the CPU I am using compiling this package with these options gives this benefit for this use case than it is to argue that enabling some random set of CPU features for the whole distro generally benefits the whole distro across all use casese and all CPU models that it still supports after the change. Thanks Michal
Am 29.09.22 um 09:22 schrieb Richard Biener:
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling.
Debug info for the entire system is a lot. I wish we wouldn't strip .symtab, which is usually quite small and gets you half the way there. My biggest CPU users are probably Firefox, codecs and compilers. The former two are doing dispatch quite frequently to my knowledge, the latter are probably unimpressed by new features. (Ok, maybe bextr as Clang uses lots of bit fields.)
It's way easier to just build everything with -v2 (or -v3) than to start bikeshedding on what _really_ matters.
As long as turning a bunch of hardware into scrap metal is on the table I'm afraid we're way out of bikeshedding territory. I think it should at least be discussed to which extent we might use newer hardware features without pulling the nuclear option. Of course simply building the entire distribution with different levels is also an option, but it doesn't seem we've converged on that yet. Aaron
On Fri, Sep 30, 2022 at 01:28:36AM +0200, Aaron Puchert wrote:
Am 29.09.22 um 09:22 schrieb Richard Biener:
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling.
Debug info for the entire system is a lot. I wish we wouldn't strip .symtab, which is usually quite small and gets you half the way there.
My biggest CPU users are probably Firefox, codecs and compilers. The former two are doing dispatch quite frequently to my knowledge, the latter are probably unimpressed by new features. (Ok, maybe bextr as Clang uses lots of bit fields.)
It's way easier to just build everything with -v2 (or -v3) than to start bikeshedding on what _really_ matters.
As long as turning a bunch of hardware into scrap metal is on the table I'm afraid we're way out of bikeshedding territory. I think it should at least be discussed to which extent we might use newer hardware features without pulling the nuclear option.
Probably using something like -march=x86_64-v2 -mno-sse4.2 to reflect actual hardware feature availability. That could have been done by people defining these 'architecture levels' already but they picked arbitrary feature sets, did not provide any performance data, noted that many CPUs lack some fatures, and moved on with their plan to define 'architecture levels'. Thanks Michal
participants (7)
-
Aaron Puchert
-
Bernhard M. Wiedemann
-
Jan Engelhardt
-
Lubos Kocman
-
Michal Suchánek
-
Richard Biener
-
Stefan Seyfried