[opensuse-factory] compiler options and boinc
LS, I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website. So, I wonder what options I might be missing. Or is the binary distributed by TW custom optimized? --- Frans. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Saturday 2020-03-07 23:16, Frans de Boer wrote:
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website.
Did somebody blindly grab something from the manpage without understanding - and conveying to subsequent readers - the implications? -funroll-loops: "This option makes code larger, and may or may not make it run faster." (It does not say it explicitly, but "larger" leads to "slower" at a certain point.) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 2020-03-08 18:24, Jan Engelhardt wrote:
On Saturday 2020-03-07 23:16, Frans de Boer wrote:
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website. Did somebody blindly grab something from the manpage without understanding - and conveying to subsequent readers - the implications?
-funroll-loops: "This option makes code larger, and may or may not make it run faster."
(It does not say it explicitly, but "larger" leads to "slower" at a certain point.)
Yes, after a certain point the benefit of using sequential code can be negative if - for instance - the code runs out the cache, requiring additional memory fetches. And no, it was a man page, but a page on the boinc site itself as well as discussions on various fora. Note that -ffast-math is even inserted automatically by the build system of boinc itself. Apparently, they do not care so much about precision. Then again, it is only boinc itself and does not affect the external tasks. But to be sure, I tested various configurations and found that both options do nothing to the issue. In fact, after experimenting with various other benchmark tools, I see a difference of aprox. 7% between two systems. Using the CPU benchmark from the TW boinc distribution, the difference on one machine is 3,8x (AMD) and the other 6,3x (Intel) as compared with the self compiled programs. So, what did the openSUSE team did to get these benchmark results which do not reflect the real world results? --- Frans. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am 07.03.20 um 23:16 schrieb Frans de Boer:
LS,
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website.
Tumbleweed build seems to use CFLAGS="-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto -ffat-lto-objects -g -g -W -pipe -fno-strict-aliasing -D_REENTRANT" So maybe trying with just "-O2" and no other "optimizations" might be worth trying. If this is as good as the TW version, then I'd try if "-O3" improves. -- Stefan Seyfried "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." -- Richard Feynman -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website.
So, I wonder what options I might be missing. Or is the binary distributed by TW custom optimized?
You can inspect the package sources, there is nothing unusual. [1] The obvious suspect would be link-time optimization (-flto=auto), which is enabled in TW by default. My suspicion: the test for integer operations [2] is split among several functions that a compiler would be unwilling to inline because of their length, but LTO can see the whole program, observe that a function is only used once and then inline it anyway. This might enable further optimizations, for example aliasing information from the caller can be used in the now inlined subroutines. Looking at the floating-point benchmark [3], there seems to be just one function, but then we have this: // External array; store results here so that optimizing compilers // don't do away with their computation. // suggested by Ben Herndon // double extern_array[12]; Well this breaks down with LTO: the linker can "internalize" the array, observing there are no accesses to it outside of this translation unit. This is just speculation though, I didn't look at the actual assembly. If you want to know more about what LTO can do, I can recommend this talk by Teresa Johnson: https://www.youtube.com/watch?v=p9nH2vZ2mNo. It's specifically about LLVM's ThinLTO, but many things apply to LTO in general. [1] https://build.opensuse.org/package/show/network/boinc-client [2] https://github.com/BOINC/boinc/blob/master/client/dhrystone.cpp [3] https://github.com/BOINC/boinc/blob/master/client/whetstone.cpp -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am 08.03.20 um 21:59 schrieb Aaron Puchert:
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website.
So, I wonder what options I might be missing. Or is the binary distributed by TW custom optimized?
You can inspect the package sources, there is nothing unusual. [1]
The obvious suspect would be link-time optimization (-flto=auto), which is enabled in TW by default. My suspicion: the test for integer operations [2] is split among several functions that a compiler would be unwilling to inline because of their length, but LTO can see the whole program, observe that a function is only used once and then inline it anyway. This might enable further optimizations, for example aliasing information from the caller can be used in the now inlined subroutines.
Looking at the floating-point benchmark [3], there seems to be just one function, but then we have this:
// External array; store results here so that optimizing compilers // don't do away with their computation. // suggested by Ben Herndon // double extern_array[12];
Well this breaks down with LTO: the linker can "internalize" the array, observing there are no accesses to it outside of this translation unit.
This is just speculation though, I didn't look at the actual assembly.
If you want to know more about what LTO can do, I can recommend this talk by Teresa Johnson: https://www.youtube.com/watch?v=p9nH2vZ2mNo. It's specifically about LLVM's ThinLTO, but many things apply to LTO in general.
[1] https://build.opensuse.org/package/show/network/boinc-client [2] https://github.com/BOINC/boinc/blob/master/client/dhrystone.cpp [3] https://github.com/BOINC/boinc/blob/master/client/whetstone.cpp
Oh, I missed the most obvious thing: Dhrystone is actually split into two translation units [4], so LTO opens up new inlining opportunities anyway. [4] https://github.com/BOINC/boinc/blob/master/client/dhrystone2.cpp -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 2020-03-08 22:12, Aaron Puchert wrote:
Am 08.03.20 um 21:59 schrieb Aaron Puchert:
I noticed that when I compile boinc, it is always around 4x slower on the CPU benchmarks then the version distributed by Tumbleweed. I do use '-O3 -funroll-loops -ffast-math' as specified on the boinc website.
So, I wonder what options I might be missing. Or is the binary distributed by TW custom optimized? You can inspect the package sources, there is nothing unusual. [1]
The obvious suspect would be link-time optimization (-flto=auto), which is enabled in TW by default. My suspicion: the test for integer operations [2] is split among several functions that a compiler would be unwilling to inline because of their length, but LTO can see the whole program, observe that a function is only used once and then inline it anyway. This might enable further optimizations, for example aliasing information from the caller can be used in the now inlined subroutines.
Looking at the floating-point benchmark [3], there seems to be just one function, but then we have this:
// External array; store results here so that optimizing compilers // don't do away with their computation. // suggested by Ben Herndon // double extern_array[12];
Well this breaks down with LTO: the linker can "internalize" the array, observing there are no accesses to it outside of this translation unit.
This is just speculation though, I didn't look at the actual assembly.
If you want to know more about what LTO can do, I can recommend this talk by Teresa Johnson: https://www.youtube.com/watch?v=p9nH2vZ2mNo. It's specifically about LLVM's ThinLTO, but many things apply to LTO in general.
[1] https://build.opensuse.org/package/show/network/boinc-client [2] https://github.com/BOINC/boinc/blob/master/client/dhrystone.cpp [3] https://github.com/BOINC/boinc/blob/master/client/whetstone.cpp Oh, I missed the most obvious thing: Dhrystone is actually split into two translation units [4], so LTO opens up new inlining opportunities anyway.
[4] https://github.com/BOINC/boinc/blob/master/client/dhrystone2.cpp
Ah, I understand now. So, if I enable LTO I can get the same results. I will check this latter this week. I only wonder if the external tasks are compiled with that same option, but that is beyond TW control. --- Frans -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
I only wonder if the external tasks are compiled with that same option, but that is beyond TW control.
They might not use LTO, but BOINC apps are usually quite well optimized, there was for example an entire community around optimizing the client for SETI@home. [1] Some of that ended up in the upstream application. Many BOINC apps are open source and I'm sure they will happily accept contributions, especially if they improve performance. [1] http://lunatics.kwsn.info/ -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 2020-03-09 23:27, Aaron Puchert wrote:
I only wonder if the external tasks are compiled with that same option, but that is beyond TW control. They might not use LTO, but BOINC apps are usually quite well optimized, there was for example an entire community around optimizing the client for SETI@home. [1] Some of that ended up in the upstream application.
Many BOINC apps are open source and I'm sure they will happily accept contributions, especially if they improve performance.
I did add the -flto switch and found that this was the missing switch. The speed - in this particular instance - improved by 7-8x the original speed. So, I now know where the difference came from. Now it's time to experiment with other software to see what the impact might be to the execution speed. ;) I never used LTO before, he concept was also not clear to me. Now it is. Thnx for the pointers. --- Frans. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
participants (4)
-
Aaron Puchert
-
Frans de Boer
-
Jan Engelhardt
-
Stefan Seyfried