On Thu, 29 Sep 2022, Aaron Puchert wrote:
Am 28.09.22 um 08:25 schrieb Bernhard M. Wiedemann:
On 28/09/2022 03.28, Aaron Puchert wrote:
Am 28.09.22 um 02:58 schrieb Bernhard M. Wiedemann:
On 28/09/2022 02.29, Aaron Puchert wrote:
I would love to see some kind of assembly diff for building the whole distribution with x86_64-v2, but this might be pretty hard. Perhaps Bernhard might be able to do this using the reproducible builds infrastructure.
I have seen software use -march=native and that always differed. In any non-trivial code, the compiler will find places where newer instructions can be used.
Of course on i586 we can expect -march=native to have a big impact, but x86_64 base already has SSE2 and doesn't use x87 anymore. Of course there's wider vectorization with AVX, but lots of code doesn't vectorize well.
I guess -march=xxx also implies -mtune=xxx, or at least changes scheduling decisions. So we don't only see newer instructions being used, we also see older instructions being shuffled around or replaced by different older instructions because newer hardware has different latency/throughput characteristics.
There is probably no way to reliably assess what changed due to new instructions being available.
Aaron
Thanks, that's interesting. Of course we don't know how hot these instructions are, but we can already see what has changed.
I created this with a new still-to-be-pushed version of my reproducibleopensuse tools with
c="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection" ; optflags1="$c -march=x86-64" optflags2="$c -march=x86-64-v2" rbk
The gcc-11 man-page says, that in the case of -march=x86-64* generic tuning is applied
True, but how do you explain the changes in /usr/lib64/libzstd.so.1.5.2 (disasm)? This doesn't seem to use any new instructions.
There are also workarounds for false register dependencies, and these would no longer be required if you know you have a newer chip.
Note that -mtune=generic is free to second-guess the set of CPUs to care about when newer HW capabilities are enabled so the expectation that without any new instructions used the generated code is the same for -march=x86-64 vs -march=x86-64-vN is false. It's not necessarily different, but "generic" tuning isn't a fully independent knob from the -march= setting.
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand, pandn, movdqa and por. Probably corresponds to that loop:
for (s=0; s<256; s++) { Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s]; if (Counting1[s] > max) max = Counting1[s]; }
But the containing function (HIST_count_parallel_wksp) with its manually-unrolled loop also seems like an obvious candidate for dynamic dispatch.
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling. It's way easier to just build everything with -v2 (or -v3) than to start bikeshedding on what _really_ matters. Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)