[opensuse-buildservice] Need help with debugging an endless loop of events - bs_sched never takes a break!
Hello List, I need some help in debugging a situation in my private OBS (2.9.x) instance that has been going on for a couple of months!! The schedulers keep scanning projects in a loop - never taking a break. On a Sunday, after 12 hours and 20 minutes of runtime, here are some numbers: # grep -ci "calculating project dependencies" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:19032 /bs/log/scheduler_local.log:10256 /bs/log/scheduler_x86_64.log:18915 # grep -ci "waiting for an event" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:0 /bs/log/scheduler_local.log:19150 /bs/log/scheduler_x86_64.log:0 I have not managed to find any hints as to what is causing this loop. Any help or pointers would be greatly appreciated! Let me share what I've found / done so far: 0. This problem started after we applied all available updates for SLES12 SP3 on July 31st. We had to, of course, reboot the system after that. I want to plan an upgrade to SLES12 SP5 but the current situation is making it very difficult to plan the upgrade. I just don't know if I'll making things even worse with an upgrade! 1. Dumped scheduler state by running `obs_admin --dump-state x86_64` and looked inside the state file. A few observations: * 'lookat' element has 1,230 items. * 'lookat_oob' element has 11 items. * 'retryevents' element has 66 items. * Number of items in 'prpfinished' and 'prps' do not match. Difference is 106. * 'changed_med' has 2 entries. 2. /bs/events/lastevents is always filled. After stopping all backend services, there are almost 5,000 entries in the file. 3. /srv/www/obs/api/log/backend_access.log has backend calls which do not show up in the production.log file. I thought, all backend calls must go through the API. Since today is Sunday, I suspected that the project consistency check could have been triggered. While it was indeed triggered, that job was finished 4 hours 23 minutes: Performed ConsistencyCheckJob (Job ID: c492970a-bfd7-4683-abf1-816f5eb5a6b6) from DelayedJob(consistency_check) in 259020.22ms 4. I have checked clockworkd.ruby2.5.clock.output to see if there are any strange jobs that are getting triggered. But the problem that I've always faced with this log is that it only logs when a job is triggered while it never contains when a job finished. When I try to run the "jobs:check" command, I get a MySQL error: $ run_in_api rails.ruby2.5 jobs:check rails aborted! ActiveRecord::StatementInvalid: Mysql2::Error: Unknown column 'created_at' in 'where clause': SELECT COUNT(*) FROM `delayed_jobs` WHERE (attempts = 0 AND created_at < '2020-10-25 10:26:42.114022') bin/rails:4:in `require' bin/rails:4:in `<main>' Caused by: Mysql2::Error: Unknown column 'created_at' in 'where clause' bin/rails:4:in `require' bin/rails:4:in `<main>' Tasks: TOP => jobs:check (See full trace by running task with --trace) Clearly, I'm going nowhere with this and I'm desperately trying to improve the situation. Any hints or pointers would be greatly appreciated. Regards, Srinidhi. N�����r��y隊Z)z{.���Wlz��qﮞ˛���m�)z{.��+�:�{Zr�az�'z��j)h���Ǜ�)]���Ǿ� ޮ�^�ˬz��
Hello List, Any hints or pointers? The system load is always high because of schedulers and this is impacting our backup job schedules. Regards, Srinidhi. On Sun, 2020-10-25 at 10:47 +0000, Srinidhi B wrote:
Hello List,
I need some help in debugging a situation in my private OBS (2.9.x) instance that has been going on for a couple of months!!
The schedulers keep scanning projects in a loop - never taking a break. On a Sunday, after 12 hours and 20 minutes of runtime, here are some numbers:
# grep -ci "calculating project dependencies" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:19032 /bs/log/scheduler_local.log:10256 /bs/log/scheduler_x86_64.log:18915
# grep -ci "waiting for an event" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:0 /bs/log/scheduler_local.log:19150 /bs/log/scheduler_x86_64.log:0
I have not managed to find any hints as to what is causing this loop. Any help or pointers would be greatly appreciated!
Let me share what I've found / done so far:
0. This problem started after we applied all available updates for SLES12 SP3 on July 31st. We had to, of course, reboot the system after that. I want to plan an upgrade to SLES12 SP5 but the current situation is making it very difficult to plan the upgrade.
I just don't know if I'll making things even worse with an upgrade!
1. Dumped scheduler state by running `obs_admin --dump-state x86_64` and looked inside the state file. A few observations:
* 'lookat' element has 1,230 items. * 'lookat_oob' element has 11 items. * 'retryevents' element has 66 items. * Number of items in 'prpfinished' and 'prps' do not match. Difference is 106. * 'changed_med' has 2 entries.
2. /bs/events/lastevents is always filled. After stopping all backend services, there are almost 5,000 entries in the file.
3. /srv/www/obs/api/log/backend_access.log has backend calls which do not show up in the production.log file. I thought, all backend calls must go through the API.
Since today is Sunday, I suspected that the project consistency check could have been triggered. While it was indeed triggered, that job was finished 4 hours 23 minutes:
Performed ConsistencyCheckJob (Job ID: c492970a-bfd7-4683-abf1- 816f5eb5a6b6) from DelayedJob(consistency_check) in 259020.22ms
4. I have checked clockworkd.ruby2.5.clock.output to see if there are any strange jobs that are getting triggered. But the problem that I've always faced with this log is that it only logs when a job is triggered while it never contains when a job finished.
When I try to run the "jobs:check" command, I get a MySQL error:
$ run_in_api rails.ruby2.5 jobs:check rails aborted! ActiveRecord::StatementInvalid: Mysql2::Error: Unknown column 'created_at' in 'where clause': SELECT COUNT(*) FROM `delayed_jobs` WHERE (attempts = 0 AND created_at < '2020-10-25 10:26:42.114022') bin/rails:4:in `require' bin/rails:4:in `<main>'
Caused by: Mysql2::Error: Unknown column 'created_at' in 'where clause' bin/rails:4:in `require' bin/rails:4:in `<main>' Tasks: TOP => jobs:check (See full trace by running task with --trace)
Clearly, I'm going nowhere with this and I'm desperately trying to improve the situation. Any hints or pointers would be greatly appreciated.
Regards, Srinidhi. N�����r��y隊Z)z{.���Wlz��qﮞ˛���m�)z{.��+�:�{Zr�az�'z��j)h���Ǜ�)]���Ǿ� ޮ�^�ˬz�
Hello List, I seem to have found the problem. Sending this email so that this might help someone else in future. I stopped all schedulers and started the scheduler using: $ sudo ./bs_sched --testmode x86_64 2>&1 | tee /tmp/sched.log Looking at the /tmp/sched.log file, I started realizing a few things: * A few projects (~100) are using a remoteurl that is no longer accessible from my private OBS instance. * There are a lot of ":schedulerstate.dirty" files inside /srv/obs/build/ directory. I started editing project meta of those projects that weren't already "locked" or whose build were already marked as "disabled". After a few tries, schedulers started taking a break! At 1:55 AM, here are the numbers: # grep -ci "waiting for an event" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:353 /bs/log/scheduler_local.log:2693 /bs/log/scheduler_x86_64.log:124 Much better!! Although, the situation hasn't been resolved completely, I consider this as a good improvement. I'll continue to clean up the remaining projects that are still using the unreachable remoteurl. Regards, Srinidhi. On Thu, 2020-10-29 at 11:12 +0000, Srinidhi B wrote:
Hello List,
Any hints or pointers?
The system load is always high because of schedulers and this is impacting our backup job schedules.
Regards, Srinidhi.
On Sun, 2020-10-25 at 10:47 +0000, Srinidhi B wrote:
Hello List,
I need some help in debugging a situation in my private OBS (2.9.x) instance that has been going on for a couple of months!!
The schedulers keep scanning projects in a loop - never taking a break. On a Sunday, after 12 hours and 20 minutes of runtime, here are some numbers:
# grep -ci "calculating project dependencies" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:19032 /bs/log/scheduler_local.log:10256 /bs/log/scheduler_x86_64.log:18915
# grep -ci "waiting for an event" /bs/log/scheduler_*.log /bs/log/scheduler_i586.log:0 /bs/log/scheduler_local.log:19150 /bs/log/scheduler_x86_64.log:0
I have not managed to find any hints as to what is causing this loop. Any help or pointers would be greatly appreciated!
Let me share what I've found / done so far:
0. This problem started after we applied all available updates for SLES12 SP3 on July 31st. We had to, of course, reboot the system after that. I want to plan an upgrade to SLES12 SP5 but the current situation is making it very difficult to plan the upgrade.
I just don't know if I'll making things even worse with an upgrade!
1. Dumped scheduler state by running `obs_admin --dump-state x86_64` and looked inside the state file. A few observations:
* 'lookat' element has 1,230 items. * 'lookat_oob' element has 11 items. * 'retryevents' element has 66 items. * Number of items in 'prpfinished' and 'prps' do not match. Difference is 106. * 'changed_med' has 2 entries.
2. /bs/events/lastevents is always filled. After stopping all backend services, there are almost 5,000 entries in the file.
3. /srv/www/obs/api/log/backend_access.log has backend calls which do not show up in the production.log file. I thought, all backend calls must go through the API.
Since today is Sunday, I suspected that the project consistency check could have been triggered. While it was indeed triggered, that job was finished 4 hours 23 minutes:
Performed ConsistencyCheckJob (Job ID: c492970a-bfd7-4683-abf1- 816f5eb5a6b6) from DelayedJob(consistency_check) in 259020.22ms
4. I have checked clockworkd.ruby2.5.clock.output to see if there are any strange jobs that are getting triggered. But the problem that I've always faced with this log is that it only logs when a job is triggered while it never contains when a job finished.
When I try to run the "jobs:check" command, I get a MySQL error:
$ run_in_api rails.ruby2.5 jobs:check rails aborted! ActiveRecord::StatementInvalid: Mysql2::Error: Unknown column 'created_at' in 'where clause': SELECT COUNT(*) FROM `delayed_jobs` WHERE (attempts = 0 AND created_at < '2020-10-25 10:26:42.114022') bin/rails:4:in `require' bin/rails:4:in `<main>'
Caused by: Mysql2::Error: Unknown column 'created_at' in 'where clause' bin/rails:4:in `require' bin/rails:4:in `<main>' Tasks: TOP => jobs:check (See full trace by running task with --trace)
Clearly, I'm going nowhere with this and I'm desperately trying to improve the situation. Any hints or pointers would be greatly appreciated.
Regards, Srinidhi. N�����r��y隊Z)z{.���Wlz��qﮞ˛���m�)z{.��+�:�{Zr�az�'z��j)h���Ǜ�)]���Ǿ� ޮ�^�ˬz� �{.n�+�����������n�vǫ�'��칻�&ޢ��������'��- ���w�zf���^�ˬy�������{�0�����Ǩ� N�����r��y隊Z)z{.���Wlz��qﮞ˛���m�)z{.��+�:�{Zr�az�'z��j)h���Ǜ�)]���Ǿ� ޮ�^�ˬz��
participants (1)
-
Srinidhi B