[opensuse-buildservice] OBS build failing in endless loop
We are trying to build a recent version of Ceph in the OBS and failing. The exact same code builds fine in IBS, where one can get a build worker with 8GB of memory. In order to get past the OBS scheduler, however, I have to specify 4GB of memory. We can build locally with "make -j1" and 4GB of memory (i.e. specify build-memory = 4096 in .oscrc). In OBS, however, the build chugs along for some time and then fails like this: [ 9900s] CXX osd/libosd_types_la-PGLog.lo [ 9952s] CXX osd/libosd_types_la-osd_types.lo [10050s] CXX osd/libosd_types_la-ECUtil.lo [10057s] CXXLD libosd_types.la [10057s] CXX osd/libosd_la-PG.lo [10170s] /var/run/obs/worker/1/build/build-vm-kvm: line 191: 11733 Killed "$@" [10170s] ### WATCHDOG MARKER END ### [10171s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [10171s] or the build host has a kernel or hardware problem... The result is unhappy: the builds get retriggered and fail in the same way, over and over . . . and over. . . and over . . . The package in question, where you can see this endless loop in action right now: https://build.opensuse.org/package/show/filesystems:ceph:Unstable/ceph Any ideas? We would really like to get this package to build in OBS. Regards, Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
Update: I don't know what changed, if anything, overnight but this
morning we have some successful builds in OBS!
On Thu, Nov 12, 2015 at 5:20 PM, Nathan Cutler
We are trying to build a recent version of Ceph in the OBS and failing. The exact same code builds fine in IBS, where one can get a build worker with 8GB of memory. In order to get past the OBS scheduler, however, I have to specify 4GB of memory.
We can build locally with "make -j1" and 4GB of memory (i.e. specify build-memory = 4096 in .oscrc). In OBS, however, the build chugs along for some time and then fails like this:
[ 9900s] CXX osd/libosd_types_la-PGLog.lo [ 9952s] CXX osd/libosd_types_la-osd_types.lo [10050s] CXX osd/libosd_types_la-ECUtil.lo [10057s] CXXLD libosd_types.la [10057s] CXX osd/libosd_la-PG.lo [10170s] /var/run/obs/worker/1/build/build-vm-kvm: line 191: 11733 Killed "$@" [10170s] ### WATCHDOG MARKER END ### [10171s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [10171s] or the build host has a kernel or hardware problem...
The result is unhappy: the builds get retriggered and fail in the same way, over and over . . . and over. . . and over . . .
The package in question, where you can see this endless loop in action right now:
https://build.opensuse.org/package/show/filesystems:ceph:Unstable/ceph
Any ideas? We would really like to get this package to build in OBS.
Regards, Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Friday 13 November 2015, 11:41:58 wrote Nathan Cutler:
Update: I don't know what changed, if anything, overnight but this morning we have some successful builds in OBS!
check the "osc jobhistory" If you can see some pattern, it may help to debug the cause. Eg. when it fails on cloud* systems, but works on build2X systems it might be due to Intel vs AMD.
On Thu, Nov 12, 2015 at 5:20 PM, Nathan Cutler
wrote: We are trying to build a recent version of Ceph in the OBS and failing. The exact same code builds fine in IBS, where one can get a build worker with 8GB of memory. In order to get past the OBS scheduler, however, I have to specify 4GB of memory.
We can build locally with "make -j1" and 4GB of memory (i.e. specify build-memory = 4096 in .oscrc). In OBS, however, the build chugs along for some time and then fails like this:
[ 9900s] CXX osd/libosd_types_la-PGLog.lo [ 9952s] CXX osd/libosd_types_la-osd_types.lo [10050s] CXX osd/libosd_types_la-ECUtil.lo [10057s] CXXLD libosd_types.la [10057s] CXX osd/libosd_la-PG.lo [10170s] /var/run/obs/worker/1/build/build-vm-kvm: line 191: 11733 Killed "$@" [10170s] ### WATCHDOG MARKER END ### [10171s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [10171s] or the build host has a kernel or hardware problem...
The result is unhappy: the builds get retriggered and fail in the same way, over and over . . . and over. . . and over . . .
The package in question, where you can see this endless loop in action right now:
https://build.opensuse.org/package/show/filesystems:ceph:Unstable/ceph
Any ideas? We would really like to get this package to build in OBS.
Regards, Nathan
-- Adrian Schroeter email: adrian@suse.de SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
Hi Adrian:
check the "osc jobhistory"
If you can see some pattern, it may help to debug the cause.
Unfortunately, I cannot comply. When the job fails with, e.g., [ 3766s] /var/run/obs/worker/2/build/build-vm-kvm: line 191: 21120 Killed "$@" the build restarts automatically and no new entry is added to jobhistory. It is as if the build never happened :-( What is the advantage of restarting, and not recording? Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
FWIW I caught the error again in a live build log. Note the "rcu_sched detected stalls" kernel error at 3035s. . . [ 2619s] CXX mds/StrayManager.lo [ 2715s] CXX mds/Locker.lo [ 3035s] [ 2784.809736] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 6, t=19562 jiffies, g=49154, c=49153, q=293) [ 3071s] CXX mds/Migrator.lo [ 3194s] CXX mds/MDBalancer.lo [ 3224s] CXX mds/CDentry.lo [ 3323s] CXX mds/CDir.lo [ 3331s] CXX mds/CInode.lo [ 3473s] CXX mds/LogEvent.lo [ 3517s] CXX mds/MDSTable.lo [ 3581s] CXX mds/InoTable.lo [ 3624s] CXX mds/JournalPointer.lo [ 3683s] CXX mds/MDSTableClient.lo [ 3696s] CXX mds/MDSTableServer.lo [ 3766s] /var/run/obs/worker/2/build/build-vm-kvm: line 191: 21120 Killed "$@" [ 3766s] ### WATCHDOG MARKER END ### [ 3767s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl) [ 3767s] or the build host has a kernel or hardware problem... Thanks for looking into this! Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
FWIW I caught the error again in a live build log. Note the "rcu_sched detected stalls" kernel error at 3035s. . .
Oh, and BTW this error occurred when I switched from "make -j1" to "make -j2". And might I add that I cannot use "make %{?_smp_mflags}" in the OBS because I can't seem to get a build worker with >4G of memory. So, to summarize: - make -j1: OBS builds take upwards of 6 hours, but complete - make -j2: OBS builds sometimes complete, other times not - auto-restart, no entry in jobhistory - make %{?_smp_mflags}: fail OOM 100% of the time HTHTD (Hope This Helps To Debug) Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Monday 16 November 2015, 09:00:55 wrote Nathan Cutler:
FWIW I caught the error again in a live build log. Note the "rcu_sched detected stalls" kernel error at 3035s. . .
Oh, and BTW this error occurred when I switched from "make -j1" to "make -j2".
And might I add that I cannot use "make %{?_smp_mflags}" in the OBS because I can't seem to get a build worker with >4G of memory.
Yes, only one exists which has 6GB and that one is not always online.
So, to summarize:
- make -j1: OBS builds take upwards of 6 hours, but complete - make -j2: OBS builds sometimes complete, other times not - auto-restart, no entry in jobhistory - make %{?_smp_mflags}: fail OOM 100% of the time
HTHTD (Hope This Helps To Debug)
The question is what to debug here from OBS side actually. We could let jobs fail on OOM, but since the constraints are not configured properly in many packages it will increase more failures. On the other hand we may get then proper constraint configs in the packages. -- Adrian Schroeter email: adrian@suse.de SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
The question is what to debug here from OBS side actually. We could let jobs fail on OOM, but since the constraints are not configured properly in many packages it will increase more failures.
So this error [ 5386s] /var/run/obs/worker/1/build/build-vm-kvm: line 191: 23385 Killed "$@" you consider OOM? Correction: it happens even with "make -j1" :-( Nathan -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
participants (2)
-
Adrian Schröter
-
Nathan Cutler