Re: which configuration changes for x86_64-v2+ necessary?

29 Sep 2022

      On Thu, 29 Sep 2022, Aaron Puchert wrote:
...
Am 28.09.22 um 08:25 schrieb Bernhard M. Wiedemann:
...
On 28/09/2022 03.28, Aaron Puchert wrote:
...
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
...
On 28/09/2022 02.29, Aaron Puchert wrote:
...
I would love to see some kind of assembly diff for building the whole
distribution with x86_64-v2, but this might be pretty hard. Perhaps
Bernhard might be able to do this using the reproducible builds
infrastructure.
I have seen software use -march=native and that always differed. In any
non-trivial code, the compiler will find places where newer instructions
can be used.
Of course on i586 we can expect -march=native to have a big impact, but
x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's
wider vectorization with AVX, but lots of code doesn't vectorize well.
I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling
decisions. So we don't only see newer instructions being used, we also see
older instructions being shuffled around or replaced by different older
instructions because newer hardware has different latency/throughput
characteristics.
There is probably no way to reliably assess what changed due to new
instructions being available.
Aaron
https://rb.zq1.de/temp/zstd-compare.out
Thanks, that's interesting. Of course we don't know how hot these instructions
are, but we can already see what has changed.
...
I created this with a new still-to-be-pushed version of my
reproducibleopensuse tools with
c="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -funwind-tables
-fasynchronous-unwind-tables -fstack-clash-protection" ; optflags1="$c
-march=x86-64" optflags2="$c -march=x86-64-v2" rbk
The gcc-11 man-page says, that in the case of -march=x86-64* generic tuning
is applied
True, but how do you explain the changes in /usr/lib64/libzstd.so.1.5.2
(disasm)? This doesn't seem to use any new instructions.
There are also workarounds for false register dependencies, and these would no
longer be required if you know you have a newer chip.
Note that -mtune=generic is free to second-guess the set of CPUs to care
about when newer HW capabilities are enabled so the expectation that
without any new instructions used the generated code is the same
for -march=x86-64 vs -march=x86-64-vN is false.  It's not necessarily
different, but "generic" tuning isn't a fully independent knob from
the -march= setting.
...
...
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is
pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand,
pandn, movdqa and por. Probably corresponds to that loop:
    for (s=0; s<256; s++) {
        Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s];
        if (Counting1[s] > max) max = Counting1[s];
    }
But the containing function (HIST_count_parallel_wksp) with its
manually-unrolled loop also seems like an obvious candidate for dynamic
dispatch.
Possibly.  I'd say to truly identify worthwhile spots many users would
need to do a lot of system-wide profiling (with all debuginfo installed!)
and collecting the 1000 hottest pieces of code.  Of course I bet that
90% will be in glibc and the kernel and the rest depends on what the
user was doing during the profiling.

It's way easier to just build everything with -v2 (or -v3) than to
start bikeshedding on what _really_ matters.

Richard.

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)