
On Thu, Sep 29, 2022 at 07:22:12AM +0000, Richard Biener wrote:
On Thu, 29 Sep 2022, Aaron Puchert wrote:
and you can see a difference in instructions created, not just in ordering:
Fair enough, that shaves off a few instructions. Even more impressive is pmaxud, which is apparently replacing a sequence of roughly pcmpgtd, pand, pandn, movdqa and por. Probably corresponds to that loop:
for (s=0; s<256; s++) { Counting1[s] += Counting2[s] + Counting3[s] + Counting4[s]; if (Counting1[s] > max) max = Counting1[s]; }
But the containing function (HIST_count_parallel_wksp) with its manually-unrolled loop also seems like an obvious candidate for dynamic dispatch.
Possibly. I'd say to truly identify worthwhile spots many users would need to do a lot of system-wide profiling (with all debuginfo installed!) and collecting the 1000 hottest pieces of code. Of course I bet that 90% will be in glibc and the kernel and the rest depends on what the user was doing during the profiling.
It's way easier to just build everything with -v2 (or -v3) than to
Yes, and you get testing of code generation with these instructions on a large codebase. But other than that it is not clear that it provides a benefit. Skimming through the onnly performance numbers provided so far it looks like -v3 would be mostly win, and -v2 is not even clear win. Because CPU manufacturers like to differentiate product lines by disabling features -v3 excludes even some very recent hardware, so we are left with an option to exclude some older hardware that is at the end of its useful lifetime but people are still using it, and it is not clear it will provide a performance gain.
start bikeshedding on what _really_ matters.
Without the numbers you cannot tell what matters. Making performance 'optimizations' without maesuring the result is pure nonsense. And yes, it's much easier to determine that on the CPU I am using compiling this package with these options gives this benefit for this use case than it is to argue that enabling some random set of CPU features for the whole distro generally benefits the whole distro across all use casese and all CPU models that it still supports after the change. Thanks Michal