[opensuse-buildservice] "discarded" workers
Hi, In our private OBS instance often workers become unavailable and their state is reported as "discarded". I'm not sure, but it seems like one cause of this might be when OBS kills a job (like if sources change for the package being built). Just a moment ago a worker became discarded while it was doing a build for a package that we know for a fact had its sources changed during the build. This is what was on the worker's screen (some details obfuscated with %TOKENS%): [ 144s] qemu: terminating on signal 15 from pid 81153 [ 144s] No buildstatus set, either the base system is broken (glibc/bash/perl) [ 144s] or the build host has a kernel or hardware problem... 2015-08-19 09:59:26: build finished '%PACKAGENAME%-%PACKAGEVER%' for project 'home:%USERNAME%:%PROJECTNAME%' repository 'common_xUbuntu_14.04' arch 'x86_64' build discarded... umount: /var/cache/obs/worker/root_1: not mounted umount tmpfs failed: No such file or directory could not kill build in /var/cache/obs/worker/root_1/root The last line is repeated dozens of times, and the worker is hung. The only way I know to fix this is to restart all the workers with /etc/init.d/obsworker. Do you know why workers might hang in discarded states like this, and/or do you know how I can restart a single worker instead of all of them? Thanks. Andy -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Wednesday 19 August 2015, 12:16:02 wrote Andrew Davidoff:
Hi,
In our private OBS instance often workers become unavailable and their state is reported as "discarded". I'm not sure, but it seems like one cause of this might be when OBS kills a job (like if sources change for the package being built). Just a moment ago a worker became discarded while it was doing a build for a package that we know for a fact had its sources changed during the build. This is what was on the worker's screen (some details obfuscated with %TOKENS%):
[ 144s] qemu: terminating on signal 15 from pid 81153 [ 144s] No buildstatus set, either the base system is broken (glibc/bash/perl) [ 144s] or the build host has a kernel or hardware problem...
2015-08-19 09:59:26: build finished '%PACKAGENAME%-%PACKAGEVER%' for project 'home:%USERNAME%:%PROJECTNAME%' repository 'common_xUbuntu_14.04' arch 'x86_64'
build discarded... umount: /var/cache/obs/worker/root_1: not mounted umount tmpfs failed: No such file or directory could not kill build in /var/cache/obs/worker/root_1/root
The last line is repeated dozens of times, and the worker is hung. The only way I know to fix this is to restart all the workers with /etc/init.d/obsworker.
Do you know why workers might hang in discarded states like this, and/or do you know how I can restart a single worker instead of all of them?
It appears usually when a worker gets either explicit killed, or when the worker detects a known bug inside of the VM handling of the host. So when anything of these are part of the build log: $kill_job = "kvm memory page bug" if $tail =~ /BUG: unable to handle kernel NULL pointer dereference at/; $kill_job = "kvm spinnlock bug" if $tail =~ /INFO: rcu_sched self-detected stall on CPU/; $kill_job = "xen soft lockup" if $tail =~ /BUG: soft lockup - CPU#\d+ stuck for/; -- Adrian Schroeter email: adrian@suse.de SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Wed, Aug 19, 2015 at 12:16:02PM -0400, Andrew Davidoff wrote:
[ 144s] qemu: terminating on signal 15 from pid 81153 [ 144s] No buildstatus set, either the base system is broken (glibc/bash/perl) [ 144s] or the build host has a kernel or hardware problem...
2015-08-19 09:59:26: build finished '%PACKAGENAME%-%PACKAGEVER%' for project 'home:%USERNAME%:%PROJECTNAME%' repository 'common_xUbuntu_14.04' arch 'x86_64'
build discarded... umount: /var/cache/obs/worker/root_1: not mounted umount tmpfs failed: No such file or directory could not kill build in /var/cache/obs/worker/root_1/root
That's because there's a 'die("umount tmpfs failed")' statement in cleanup_job(). Because of the die, the state is never set to 'idle' again. I think the correct fix is to not call umount if tmpfs is not mounted anymore/yet. Cheers, Michael. -- Michael Schroeder mls@suse.de SUSE LINUX GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
Thanks for the information. I'll dig into this a bit. Regarding restarting a single worker that has become discarded, is there a way to do this? Thanks. Andy On Thu, Aug 20, 2015 at 4:48 AM, Michael Schroeder <mls@suse.de> wrote:
On Wed, Aug 19, 2015 at 12:16:02PM -0400, Andrew Davidoff wrote:
[ 144s] qemu: terminating on signal 15 from pid 81153 [ 144s] No buildstatus set, either the base system is broken (glibc/bash/perl) [ 144s] or the build host has a kernel or hardware problem...
2015-08-19 09:59:26: build finished '%PACKAGENAME%-%PACKAGEVER%' for project 'home:%USERNAME%:%PROJECTNAME%' repository 'common_xUbuntu_14.04' arch 'x86_64'
build discarded... umount: /var/cache/obs/worker/root_1: not mounted umount tmpfs failed: No such file or directory could not kill build in /var/cache/obs/worker/root_1/root
That's because there's a 'die("umount tmpfs failed")' statement in cleanup_job(). Because of the die, the state is never set to 'idle' again.
I think the correct fix is to not call umount if tmpfs is not mounted anymore/yet.
Cheers, Michael.
-- Michael Schroeder mls@suse.de SUSE LINUX GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Thu, Aug 20, 2015 at 04:09:22PM -0600, Andrew Davidoff wrote:
Thanks for the information. I'll dig into this a bit.
Regarding restarting a single worker that has become discarded, is there a way to do this?
If the single worker is run in screen, you should be able to ^C the worker and then "resurrect" the window by pressing 'n'. Cheers, Michael. -- Michael Schroeder mls@suse.de SUSE LINUX GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Fri, Aug 21, 2015 at 3:26 AM, Michael Schroeder <mls@suse.de> wrote:
On Thu, Aug 20, 2015 at 04:09:22PM -0600, Andrew Davidoff wrote:
Thanks for the information. I'll dig into this a bit.
Regarding restarting a single worker that has become discarded, is there a way to do this?
If the single worker is run in screen, you should be able to ^C the worker and then "resurrect" the window by pressing 'n'.
I just gave this a shot but I must be doing something wrong. I switched to the screen window of the discarded worker and hit ^C, which caused the worker to fully exit and closed that screen window completely. Pressing 'n' from any other screen window simply types the letter 'n', as you'd expect, and meta-n of course just cycles to the next window. Do I misunderstand? Thanks. Andy -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Monday 24 August 2015, 19:29:23 wrote Andrew Davidoff:
On Fri, Aug 21, 2015 at 3:26 AM, Michael Schroeder <mls@suse.de> wrote:
On Thu, Aug 20, 2015 at 04:09:22PM -0600, Andrew Davidoff wrote:
Thanks for the information. I'll dig into this a bit.
Regarding restarting a single worker that has become discarded, is there a way to do this?
If the single worker is run in screen, you should be able to ^C the worker and then "resurrect" the window by pressing 'n'.
I just gave this a shot but I must be doing something wrong. I switched to the screen window of the discarded worker and hit ^C, which caused the worker to fully exit and closed that screen window completely. Pressing 'n' from any other screen window simply types the letter 'n', as you'd expect, and meta-n of course just cycles to the next window. Do I misunderstand?
Do you use OBS < 2.6 ? Your screenrc has no "zombie on" then. We just enabled it again begining of this year. Check /etc/init.d/obsworker for the "zombie on" output line, it is most likely commented out. -- Adrian Schroeter email: adrian@suse.de SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
Andrew Davidoff <davidoff@qedmf.net> writes:
I just gave this a shot but I must be doing something wrong. I switched to the screen window of the discarded worker and hit ^C, which caused the worker to fully exit and closed that screen window completely.
You can resurrect it by changing directory to /var/run/obs/worker/boot, then finding the command line in screenrc which you run here in the shell prefixed with "screen -X" (which sends the command to the running screen instance). Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
On Tue, Aug 25, 2015 at 12:52 AM, Andreas Schwab <schwab@suse.de> wrote:
You can resurrect it by changing directory to /var/run/obs/worker/boot, then finding the command line in screenrc which you run here in the shell prefixed with "screen -X" (which sends the command to the running screen instance).
Excellent, this appears to have worked. "zombie on" is in fact comment out in /etc/init.d/obsworker, though I may leave it that way. Thanks! Andy -- To unsubscribe, e-mail: opensuse-buildservice+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-buildservice+owner@opensuse.org
participants (4)
-
Adrian Schröter
-
Andreas Schwab
-
Andrew Davidoff
-
Michael Schroeder