[heroes] postmortem: mariadb galera cluster outage 2018-01-03 19:30 UTC

4 Jan 2018

      Hello,

(all times mentioned below are UTC)

1) identifying the problem

yesterday 2018-01-03 at 19:37 the IRC bot sent monitoring messages that the
websites events.opensuse.org and progress.opensuse.org are giving error 500. I
verified the issue and I noticed that the mysql service on all three nodes is
down. I pinged darix to help me with getting it back up, who figured out that
it was caused because logrotate ran at the same time on all nodes (~19:30),
which also restarted the mysql service, and due to that the cluster went down.

2) solving the problem

After we verified that we have recent backups (on the second node), we first
took a backup of /var/lib/mysql on all three nodes. Then we started recreating
the cluster on the master node, which was successful and brought the websites
up. Then we continued on the two slave nodes successfully. As last step we
verified that all the websites were up again, and there was no data loss. We
finished around 20:15, so the total downtime was around 45 minutes.

3) what we did to avoid the problem from happening again

We changed the time that the daily cron jobs are running on the second and the
third node, in order to avoid the issue is happening again. So now the daily
cron jobs on the first node will run on 19:00, on the second node at 19:30 and
on the third node at 20:00.

4) what we could also do to improve the situation

- more frequent backups (eg 4 times per day)
- enable backups on galera3 as well
- make sure that the auto-update script doesn't run at the same time on all
  three hosts
- add connect.opensuse.org webpage to the monitoring

Special thanks to darix that saved the day!
-- 
Theo Chatzimichos <tampakrap@opensuse.org> <tchatzimichos@suse.com>
System Administrator
SUSE Operations and Services Team