[Bug 1040589] New: bash/gcc/gzip/python differ between builds because of profiling
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 Bug ID: 1040589 Summary: bash/gcc/gzip/python differ between builds because of profiling Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: openSUSE 13.2 Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem Assignee: rguenther@suse.com Reporter: bwiedemann@suse.com QA Contact: qa-bugs@suse.de CC: matz@suse.com Found By: Development Blocker: --- In https://build.opensuse.org/project/prjconf/openSUSE:Factory we have %do_profiling 1 and because of that in our bash.spec we enable gcc's 'profile feedback directed optimizations' but that causes the jobs.o and resulting bash binary to differ between builds, even when running on the same build host. And because of that, build-compare always thinks there is a change and triggers a re-publish and rebuild of depending packages We also have such binary diffs in gcc6 and gcc6.spec calls a make profiledbootstrap The do_profiling macro is used in bash gzip hello python3-base python-base sed xz and in http://rb.zq1.de/compare.factory-20170523/bash-compare.out http://rb.zq1.de/compare.factory-20170523/gcc6-compare.out http://rb.zq1.de/compare.factory-20170523/gzip-compare.out http://rb.zq1.de/compare.factory-20170523/python-base-compare.out http://rb.zq1.de/compare.factory-20170523/python3-base-compare.out we have strange diffs in assembler that I could not trace down to other sources of non-determinism until today. Diffs did go away when building without profiling (it was harder to disable for gcc6 and bash though) Do the profiles just count invocations of functions or do they depend on the type and speed of the system? In the first case, it should be possible to fix the profiling runs to be deterministic, but for that it would be useful to be able to see the differences between runs. How could I diff gcc's .gcda files? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c1 --- Comment #1 from Dr. Werner Fink <werner@suse.com> --- (In reply to Bernhard Wiedemann from comment #0)
In https://build.opensuse.org/project/prjconf/openSUSE:Factory we have %do_profiling 1
and because of that in our bash.spec we enable gcc's 'profile feedback directed optimizations'
but that causes the jobs.o and resulting bash binary to differ between builds, even when running on the same build host.
And because of that, build-compare always thinks there is a change and triggers a re-publish and rebuild of depending packages
We also have such binary diffs in gcc6 and gcc6.spec calls a make profiledbootstrap
The do_profiling macro is used in bash gzip hello python3-base python-base sed xz
and in http://rb.zq1.de/compare.factory-20170523/bash-compare.out http://rb.zq1.de/compare.factory-20170523/gcc6-compare.out http://rb.zq1.de/compare.factory-20170523/gzip-compare.out http://rb.zq1.de/compare.factory-20170523/python-base-compare.out http://rb.zq1.de/compare.factory-20170523/python3-base-compare.out
we have strange diffs in assembler that I could not trace down to other sources of non-determinism until today. Diffs did go away when building without profiling (it was harder to disable for gcc6 and bash though)
Do the profiles just count invocations of functions or do they depend on the type and speed of the system?
In the first case, it should be possible to fix the profiling runs to be deterministic, but for that it would be useful to be able to see the differences between runs. How could I diff gcc's .gcda files?
That is you problem, no mine ... do not touch bash test suite! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1081754 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c6 --- Comment #6 from Bernhard Wiedemann <bwiedemann@suse.com> --- Apparently, .gcda file creation in gcc7 is racy otherwise, why would those SRs help make it build reproducible? https://build.opensuse.org/request/show/596431 sed https://build.opensuse.org/request/show/596586 pcre gcc-4.8 from 42.3 seems to behave the same, so not a regression. (unfortunately no older gccs are available anymore for Factory) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c9 Martin Liška <martin.liska@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |martin.liska@suse.com --- Comment #9 from Martin Liška <martin.liska@suse.com> --- So I did analysis of 2 packages before the changes in Factory happened: 1) gzip - as mentioned a temp file was used and zipped. Temp file name probably causes the divergence as it's always different. 2) sed - it uses pthreads (but not passed as CFLAGS), thus I added -fprofile-update=atomic to cflags. See documentation here: https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html and I've got it, diff in between 2 runs: ./gnulib-tests/nanosleep.gcda: a3000000: 37:PROGRAM_SUMMARY checksum=0x99d0fa93 [snip] ./gnulib-tests/nanosleep.gcda: 01a10000: 14:COUNTERS arcs 7 counts ./gnulib-tests/nanosleep.gcda: 0 2 1 1 0 1 29 30 ./gnulib-tests/nanosleep.gcda: a3000000: 37:PROGRAM_SUMMARY checksum=0xc3046d04 [snip] ./gnulib-tests/nanosleep.gcda: 0 2 1 1 0 1 30 31 ./lib/stat-time.h: stat_time_normalize (int result, struct stat *st _GL_UNUSED) { #if defined __sun && defined STAT_TIMESPEC if (result == 0) { long int timespec_resolution = 1000000000; short int const ts_off[] = { offsetof (struct stat, st_atim), offsetof (struct stat, st_mtim), offsetof (struct stat, st_ctim) }; int i; for (i = 0; i < sizeof ts_off / sizeof *ts_off; i++) { struct timespec *ts = (struct timespec *) ((char *) st + ts_off[i]); long int q = ts->tv_nsec / timespec_resolution; long int r = ts->tv_nsec % timespec_resolution; if (r < 0) { r += timespec_resolution; q--; } ts->tv_nsec = r; /* Overflow is possible, as Solaris 11 stat can yield tv_sec == TYPE_MINIMUM (time_t) && tv_nsec == -1000000000. INT_ADD_WRAPV is OK, since time_t is signed on Solaris. */ if (INT_ADD_WRAPV (q, ts->tv_sec, &ts->tv_sec)) { errno = EOVERFLOW; return -1; } } } #endif return result; } It loops based on how fast time flies. This can't be stable. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c10 Martin Liška <martin.liska@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jh@suse.com) | --- Comment #10 from Martin Liška <martin.liska@suse.com> --- Can't reproduce a binary diff for a 10 runs of 'xz' package. But I guess it's multi-threaded, thus -fprofile-update=atomic should be used. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c11 Martin Liška <martin.liska@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS Assignee|rguenther@suse.com |martin.liska@suse.com --- Comment #11 from Martin Liška <martin.liska@suse.com> --- I've done https://build.opensuse.org/request/show/597754 Then we can check the packages and eventually revert the changes that were done as a workaround. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c43 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Andreas.Stieger@gmx.de --- Comment #43 from Bernhard Wiedemann <bwiedemann@suse.com> --- bison-3.7.3 started to vary from PGO. For profiling it uses a pretty large 'make check' run with 650 substeps. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c44 --- Comment #44 from Andreas Stieger <Andreas.Stieger@gmx.de> --- (In reply to Bernhard Wiedemann from comment #43)
bison-3.7.3 started to vary from PGO.
Can you help me understand if this is a (new) problem in 3.7.3, and give the steps to compare the binaries? Last I found was: http://rb.zq1.de/compare.factory-20200430/bison-compare.out -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c45 --- Comment #45 from Bernhard Wiedemann <bwiedemann@suse.com> --- I found my debugging note from/before 2020-06-29:
PGO again via ASLR, filesys, date
https://build.opensuse.org/request/show/676711 made it reproducible back then and looking closer at old results, 3.5.2 was the first one marked to differ https://rb.zq1.de/compare.factory-20200303/bison-compare.out Here is the general description of the problem of PGO with reproducibility: https://github.com/bmwiedemann/theunreproduciblepackage/tree/master/pgo Would disabling PGO for bison be an option? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c46 --- Comment #46 from Martin Pluskal <mpluskal@suse.com> --- (In reply to Bernhard Wiedemann from comment #45)
I found my debugging note from/before 2020-06-29:
PGO again via ASLR, filesys, date
https://build.opensuse.org/request/show/676711 made it reproducible back then and looking closer at old results, 3.5.2 was the first one marked to differ https://rb.zq1.de/compare.factory-20200303/bison-compare.out
Here is the general description of the problem of PGO with reproducibility: https://github.com/bmwiedemann/theunreproduciblepackage/tree/master/pgo
Would disabling PGO for bison be an option?
No! I do not wish to damage speed and/or optimized size of binaries because of this. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c47 --- Comment #47 from Martin Li��ka <martin.liska@suse.com> --- @Bernhard Wiedemann: Can we close this issue, please? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c48 --- Comment #48 from Bernhard Wiedemann <bwiedemann@suse.com> --- I dont think, this has been fully solved. gcc, bison, python all still suffer from PGO. I suspect bash-4.4 in 15.3 might also be affected. Maybe also MozillaFirefox. One intermediate solution would be to use in these .spec files %if 0%{?do_profiling} && !0%{?only_reproducible_profiling} -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 http://bugzilla.opensuse.org/show_bug.cgi?id=1040589#c57 --- Comment #57 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1040589) was mentioned in https://build.opensuse.org/request/show/962180 Factory / grep -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |1197575 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1040589 Bug 1040589 depends on bug 1197575, which changed state. Bug 1197575 Summary: hello varies from PGO and date http://bugzilla.opensuse.org/show_bug.cgi?id=1197575 What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com