Update on adding additional x86_64-vX subarchitecture leves via glibc hwcaps
Hi all, just as an update before the end of the year holidays. Michael Schroeder, Fabian Vogt and myself have been finishing off all missing pieces of implementation work to be able to transparently utilize the build service to provide glibc-hwcaps overlays of shared libraries optimized for higher architecture levels than baseline. While the original intention was to implement the proposal https://en.opensuse.org/openSUSE:X86-64-Architecture-Levels#Available_option... which is addressing an x86_64 specific need, all of the implemented pieces are reusable to also utilize on other architectures that have similar needs (can think of s390x here for example). What needs to be done as a package? By default, nothing. The x86_64-vX (I suggest targeting -v3 as it has a good balance of performance improvement and hardware compatibility over -v4 or -v2) builds will be limited to those packages that are opting into the glibc-hwcaps feature. if they opt in (which probably needs some analysis as we have seen also seen no increase or slight performance decrease), the only thing needed to do is to add one extra line to the spec file: %{?suse_build_hwcaps_libs} It only works if the package is building a shared library following the shared library policy as a subpackage and provides a `baselibs.conf` file that identifies that package by name. see https://en.opensuse.org/openSUSE:Packaging_Conventions_RPM_Macros#%{?suse_build_hwcaps_libs} for details Currently this macro is undefined as it relies on a number of not-yet-upstream-accepted changes: * https://github.com/rpm-software-management/rpm/pull/2315 * https://github.com/openSUSE/obs-build/pull/904 * https://github.com/openSUSE/obs-build/pull/907 * https://github.com/openSUSE/post-build-checks/pull/54 * https://build.opensuse.org/package/rdiff/Base:System/filesystem?opackage=filesystem&oproject=openSUSE%3AFactory&rev=233 * optional: yet undetermined way of providing the user an ability to "opt into the optimized libraries" (current suggestion is via a specific provides in the system to be provided by the patterns package, but could be also done via some zypper specific feature). Once all those changes are in, we can start implementing this in Tumbleweed, if consensus is given on the approach. Unlikely to happen still in 2022. In my testing, it appears beneficial to enable that for most of the dependencies of ImageMagick, the webbrowser dependencies and also the toolkit libraries (gtk, glib, qt etc) for improved desktop experience. In total that would be around 50 packages to submit with the additional macro, but I don't have a very scientific way of determining whether it is worth it yet. Happy to hear your input or feedback. Greetings, Dirk
Hwcaps wastes disk space for a niche use case (moving an installation frequently between machines with different CPUs) and IIRC is currently broken bc glibc and GCC have different ideas of what a given microarch corresponds to. Also, baselibs.conf should have died long ago, the right way to do this is what glibc does with i586/i686. that way not only libraries would be able to benefit but also language runtimes which have lot of code in main executable.
Hi Bruno, Am Mi., 21. Dez. 2022 um 17:01 Uhr schrieb Bruno Pitrus <brunopitrus@hotmail.com>:
Hwcaps wastes disk space for a niche use case (moving an installation frequently between machines with different CPUs)
only for the case where you want the performance increase of going with a newer microarchitecture, for the libraries that you select to install, and where the recompiled library is actually worth it.
and IIRC is currently broken bc glibc and GCC have different ideas of what a given microarch corresponds to.
A fix for that has been submitted by Fabian as far as I know.
the right way to do this is what glibc does with i586/i686.
Are you referring to IFUNCs? e.g. per function differently compiled code that by runtime resolves to a different implementation? Otherwise, please elaborate? While custom IFUNCs are a nicer solution, it also requires either a compiler that can automatically produce it (afaik no open source implementation exists) or a lot of manual implementation to get upstream. It can still be mixed with this approach. Greetings, Dirk
Look at the glibc package. It is compiled twice on ix86 — once for i586 and once for i686 (which can use CMOV instruction). These are distinct RPM architectures. If you install the i686 build, it replaces the i586 one. (Probably at least some packages would benefit of a pentium4 build, but the detection of SSE2 in RPM has been broken for 20 years)
I maintain the nodejs-electron package. I guess it could benefit from a v2/v3 build, which the hwcaps solution would not allow. This software's upstream claims that merely allowing SSE3 (which isn't even x86-64-v2) produces a significant speed boost and have dropped support for baseline x86-64 in their official builds.
Hi, Am Mittwoch, 21. Dezember 2022, 17:37:07 CET schrieb Dirk Müller:
Hi Bruno,
Am Mi., 21. Dez. 2022 um 17:01 Uhr schrieb Bruno Pitrus <brunopitrus@hotmail.com>:
Hwcaps wastes disk space for a niche use case (moving an installation frequently between machines with different CPUs)
only for the case where you want the performance increase of going with a newer microarchitecture, for the libraries that you select to install, and where the recompiled library is actually worth it.
The question is what's more niche, moving installations or hwcaps being worth it? I think we might end up needing both hwcaps libs + -vX specific packages. Fortunately there's overlap between building packages for hwcaps and arch-level specific packages, so we can experiment around. Dirk is doing tests beyond just libraries as well.
and IIRC is currently broken bc glibc and GCC have different ideas of what a given microarch corresponds to.
A fix for that has been submitted by Fabian as far as I know.
The main divergence was missing BMI(2) in glibc, which is fixed in master meanwhile (2.37+, not in TW yet). I thought that OSXSAVE was also missing, but that's actually already checked, just implicitly in a different place.
the right way to do this is what glibc does with i586/i686.
Are you referring to IFUNCs? e.g. per function differently compiled code that by runtime resolves to a different implementation? Otherwise, please elaborate?
While custom IFUNCs are a nicer solution, it also requires either a compiler that can automatically produce it (afaik no open source implementation exists) or a lot of manual implementation to get upstream. It can still be mixed with this approach.
There's the target_clones attribute which handles clones and dispatching automatically. It's supported by GCC as well as Clang: __attribute__((target_clones("default", "avx", "avx512f"))) float sum(float a[], int c) { float ret = 0; for(int i = 0; i < c; ++i) ret += a[i]; return ret; } (for best results build with -ffast-math) Cheers, Fabian
Greetings, Dirk
Hi Fabian, Am Do., 22. Dez. 2022 um 15:12 Uhr schrieb Fabian Vogt <fvogt@suse.de>:
The question is what's more niche, moving installations or hwcaps being worth it? I think we might end up needing both hwcaps libs + -vX specific packages.
Agreed, we can always choose to provide both, it's just not yet very well supported in the software stacks yet. something that hwcaps neatly avoids. The main issue is that the rpm architecture is slightly ambiguous now. In ALP, x86_64 is actually x86_64_v2, while in Tumbleweed it is baseline (and in the future there would be an x86_64_v3). fun.
There's the target_clones attribute which handles clones and dispatching automatically. It's supported by GCC as well as Clang:
__attribute__((target_clones("default", "avx", "avx512f"))) float sum(float a[], int c) {
That isn't "automatically" for me. I meant without modifying the upstream/source code with these annotations (where a human has to sit down and reason about which functions likely would benefit from these annotations). Automatically would mean the compiler is clever enough ("heuristics") that identifies which functions should be annotated with that and performs it without modifying the source. Most likely this will have to be profile-guided though. How this could be working is like this: first compile with profile guiding the sources, two times. once with -march=x86-64 and once with -march=x86-64-v3 using the extra flags "-ftree-vectorize -fopt-info-vec" and store away the resulting output.this is a debug print of the compiler that shows how/where it saw vectorizing opportunities: inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors inflate.c:76:20: optimized: basic block part vectorized using 32 byte vectors In this case the vectorizing printout is "16 byte vectors" for the -march=x86-64 build, indicating that it did find an architecture sublevel specific optimisation opportunity. If many of those prints are in the same source function, and that source function is considered hot (based on profiling feedback), then the compiler would be able to use target_cloning automatically. as you can see that is quite intensive when done that way (profile guiding, which compiles twice, and then compiles twice for two architectures). It does not seem worthwhile for most of the distro code to invest that kind of effort. Greetings, Dirk
I tried to build a native x86_64_v3 package on OBS (as opposed to a hwcaps package generated by baselibs). Setting `BuildArch: x86_64_v3` fails with “no compatible architectures”, while setting `#!BuildTarget: x86_64_v3-linux` makes the build fail near the end on “installing all built rpms”, claiming the package is built for a different architecture. That's despite the ld.so claiming the build host supports v3. x86_64_v2 does not have any of the above problems — it fails later due to a bug in this script which tries to remove a nonempty directory: https://github.com/openSUSE/post-build-checks/blob/master/post-mkbaselibs-ch...
participants (3)
-
Bruno Pitrus
-
Dirk Müller
-
Fabian Vogt