Hi Fabian, Am Do., 22. Dez. 2022 um 15:12 Uhr schrieb Fabian Vogt <fvogt@suse.de>:
The question is what's more niche, moving installations or hwcaps being worth it? I think we might end up needing both hwcaps libs + -vX specific packages.
Agreed, we can always choose to provide both, it's just not yet very well supported in the software stacks yet. something that hwcaps neatly avoids. The main issue is that the rpm architecture is slightly ambiguous now. In ALP, x86_64 is actually x86_64_v2, while in Tumbleweed it is baseline (and in the future there would be an x86_64_v3). fun.
There's the target_clones attribute which handles clones and dispatching automatically. It's supported by GCC as well as Clang:
__attribute__((target_clones("default", "avx", "avx512f"))) float sum(float a[], int c) {
That isn't "automatically" for me. I meant without modifying the upstream/source code with these annotations (where a human has to sit down and reason about which functions likely would benefit from these annotations). Automatically would mean the compiler is clever enough ("heuristics") that identifies which functions should be annotated with that and performs it without modifying the source. Most likely this will have to be profile-guided though. How this could be working is like this: first compile with profile guiding the sources, two times. once with -march=x86-64 and once with -march=x86-64-v3 using the extra flags "-ftree-vectorize -fopt-info-vec" and store away the resulting output.this is a debug print of the compiler that shows how/where it saw vectorizing opportunities: inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors In this case the vectorizing printout is "16 byte vectors" for the -march=x86-64 build, indicating that it did find an architecture sublevel specific optimisation opportunity. If many of those prints are in the same source function, and that source function is considered hot (based on profiling feedback), then the compiler would be able to use target_cloning automatically. as you can see that is quite intensive when done that way (profile guiding, which compiles twice, and then compiles twice for two architectures). It does not seem worthwhile for most of the distro code to invest that kind of effort. Greetings, Dirk