Re: [heroes] postmortem: mariadb galera cluster outage 2018-01-03 19:30 UTC

5 Jan 2018

      Hello,

Am Donnerstag, 4. Januar 2018, 14:47:25 CET schrieb Theo Chatzimichos:
...
3) what we did to avoid the problem from happening again
We changed the time that the daily cron jobs are running on the second
and the third node, in order to avoid the issue is happening again.
So now the daily cron jobs on the first node will run on 19:00, on
the second node at 19:30 and on the third node at 20:00.
I'm afraid logrotate was not (at last not the only) problematic part - 
the galera cluster is down again. First notice from monitoring as at 
19:40 UTC for events.o.o, followed by progress.o.o 5 minutes later.

Interestingly the IRC bot did not say anything about mariadb, but 
monitor.o.o/icinga/ shows mysql is down on all of them again.
...
4) what we could also do to improve the situation
- more frequent backups (eg 4 times per day)
- enable backups on galera3 as well
- make sure that the auto-update script doesn't run at the same time
on all three hosts
- add connect.opensuse.org webpage to the monitoring
.. and ensure that LDAP logins work on the galera machines ;-)

This also means the only things I can do are updating status.o.o 
(already done) and waiting for someone who fixes the galera cluster and 
writes the next postmortem.

Regards,

Christian Boltz
-- 
Encryption is only for terrorists and as such not supported :-)
[Stefan Seyfried in opensuse-packaging]

-- 
To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org
To contact the owner, e-mail: heroes+owner@opensuse.org