[heroes] postmortem: mariadb galera cluster outage 2018-01-03 19:30 UTC
Hello, (all times mentioned below are UTC) 1) identifying the problem yesterday 2018-01-03 at 19:37 the IRC bot sent monitoring messages that the websites events.opensuse.org and progress.opensuse.org are giving error 500. I verified the issue and I noticed that the mysql service on all three nodes is down. I pinged darix to help me with getting it back up, who figured out that it was caused because logrotate ran at the same time on all nodes (~19:30), which also restarted the mysql service, and due to that the cluster went down. 2) solving the problem After we verified that we have recent backups (on the second node), we first took a backup of /var/lib/mysql on all three nodes. Then we started recreating the cluster on the master node, which was successful and brought the websites up. Then we continued on the two slave nodes successfully. As last step we verified that all the websites were up again, and there was no data loss. We finished around 20:15, so the total downtime was around 45 minutes. 3) what we did to avoid the problem from happening again We changed the time that the daily cron jobs are running on the second and the third node, in order to avoid the issue is happening again. So now the daily cron jobs on the first node will run on 19:00, on the second node at 19:30 and on the third node at 20:00. 4) what we could also do to improve the situation - more frequent backups (eg 4 times per day) - enable backups on galera3 as well - make sure that the auto-update script doesn't run at the same time on all three hosts - add connect.opensuse.org webpage to the monitoring Special thanks to darix that saved the day! -- Theo Chatzimichos <tampakrap@opensuse.org> <tchatzimichos@suse.com> System Administrator SUSE Operations and Services Team
Theo Chatzimichos wrote:
3) what we did to avoid the problem from happening again
We changed the time that the daily cron jobs are running on the second and the third node, in order to avoid the issue is happening again. So now the daily cron jobs on the first node will run on 19:00, on the second node at 19:30 and on the third node at 20:00.
4) what we could also do to improve the situation
- more frequent backups (eg 4 times per day) - enable backups on galera3 as well - make sure that the auto-update script doesn't run at the same time on all three hosts - add connect.opensuse.org webpage to the monitoring
It might be worth considering if there really is a need to run logrotate for mysql. mysql usually doesn't log much unless someone has turned on query logging (for debugging). -- Per Jessen, Zürich (14.1°C) openSUSE mailing list admin -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hello, Am Donnerstag, 4. Januar 2018, 14:47:25 CET schrieb Theo Chatzimichos:
3) what we did to avoid the problem from happening again
We changed the time that the daily cron jobs are running on the second and the third node, in order to avoid the issue is happening again. So now the daily cron jobs on the first node will run on 19:00, on the second node at 19:30 and on the third node at 20:00.
I'm afraid logrotate was not (at last not the only) problematic part - the galera cluster is down again. First notice from monitoring as at 19:40 UTC for events.o.o, followed by progress.o.o 5 minutes later. Interestingly the IRC bot did not say anything about mariadb, but monitor.o.o/icinga/ shows mysql is down on all of them again.
4) what we could also do to improve the situation
- more frequent backups (eg 4 times per day) - enable backups on galera3 as well - make sure that the auto-update script doesn't run at the same time on all three hosts - add connect.opensuse.org webpage to the monitoring
.. and ensure that LDAP logins work on the galera machines ;-) This also means the only things I can do are updating status.o.o (already done) and waiting for someone who fixes the galera cluster and writes the next postmortem. Regards, Christian Boltz -- Encryption is only for terrorists and as such not supported :-) [Stefan Seyfried in opensuse-packaging] -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Christian Boltz wrote:
This also means the only things I can do are updating status.o.o (already done) and waiting for someone who fixes the galera cluster and writes the next postmortem.
Earlier today I received the status notifications "Progress - The Project management tool status changed from Major Outage to Operational." I never received any saying "changed from Operational to Major Outage" ? -- Per Jessen, Zürich (4.8°C) openSUSE mailing list admin -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hello, Am Sonntag, 7. Januar 2018, 19:06:26 CET schrieb Per Jessen:
Christian Boltz wrote:
This also means the only things I can do are updating status.o.o (already done) and waiting for someone who fixes the galera cluster and writes the next postmortem.
Earlier today I received the status notifications "Progress - The Project management tool status changed from Major Outage to Operational."
I never received any saying "changed from Operational to Major Outage" ?
Because of the database outage, connect.o.o, progress.o.o and events.o.o were down. While this doesn't sound too bad, the connect.o.o downtime also broke the @opensuse.org mail aliases :-( You are subscribed to status notifications with your opensuse.org mail address, and therefore... [log shortened and slightly edited to avoid feeding spambots] 2018-01-05T20:49:26.390029+00:00 status2 ... dsn=5.1.1, status=bounced (host mx2.suse.de[195.135.220.15] said: 550 5.1.1 <per@o....org>: Recipient address rejected: User unknown in virtual alias table (in reply to RCPT TO command)) The member mail aliases work again since yesterday evening (thanks to Theo for bringing the database master up again!), but we'll obviously have to improve the setup to make it more reliable and errorproof. Oh, and today I noticed that handling status.o.o via haproxy and distributing the load between status1 and status2 is a bad idea. Both use their own standalone MySQL database and the load balancing therefore resulted in a brain split. Fixed now by making status.o.o a CNAME to status1.o.o instead of using haproxy. Regards, Christian Boltz -- Yeah, my wife made serious trouble when I told her that I can't go to vacation because Kyrill needs his flash. And sorry, but she won that argument. [Stephan Kulow in opensuse-factory] -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Christian Boltz wrote:
Hello,
Am Sonntag, 7. Januar 2018, 19:06:26 CET schrieb Per Jessen:
Christian Boltz wrote:
This also means the only things I can do are updating status.o.o (already done) and waiting for someone who fixes the galera cluster and writes the next postmortem.
Earlier today I received the status notifications "Progress - The Project management tool status changed from Major Outage to Operational."
I never received any saying "changed from Operational to Major Outage" ?
Because of the database outage, connect.o.o, progress.o.o and events.o.o were down. While this doesn't sound too bad, the connect.o.o downtime also broke the @opensuse.org mail aliases :-(
You are subscribed to status notifications with your opensuse.org mail address, and therefore...
Ah, got it. Okay. -- Per Jessen, Zürich (2.6°C) openSUSE mailing list admin -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
On Thu, Jan 04, 2018 at 02:47:25PM +0100, Theo Chatzimichos wrote:
Hello,
(all times mentioned below are UTC)
1) identifying the problem
yesterday 2018-01-03 at 19:37 the IRC bot sent monitoring messages that the websites events.opensuse.org and progress.opensuse.org are giving error 500. I verified the issue and I noticed that the mysql service on all three nodes is down. I pinged darix to help me with getting it back up, who figured out that it was caused because logrotate ran at the same time on all nodes (~19:30), which also restarted the mysql service, and due to that the cluster went down.
2) solving the problem
After we verified that we have recent backups (on the second node), we first took a backup of /var/lib/mysql on all three nodes. Then we started recreating the cluster on the master node, which was successful and brought the websites up. Then we continued on the two slave nodes successfully. As last step we verified that all the websites were up again, and there was no data loss. We finished around 20:15, so the total downtime was around 45 minutes.
3) what we did to avoid the problem from happening again
We changed the time that the daily cron jobs are running on the second and the third node, in order to avoid the issue is happening again. So now the daily cron jobs on the first node will run on 19:00, on the second node at 19:30 and on the third node at 20:00.
4) what we could also do to improve the situation
- more frequent backups (eg 4 times per day) - enable backups on galera3 as well - make sure that the auto-update script doesn't run at the same time on all three hosts - add connect.opensuse.org webpage to the monitoring
Special thanks to darix that saved the day!
It happened again twice, so it was definitely not a logrotate issue, but instead it was an upstream bug, possibly this [1]. So now I updated the hosts and mariadb was updated to a new patch version. Let's see if it crashes again now, if it does I'll file a ticket against our package. [1] https://jira.mariadb.org/browse/MDEV-12023 Theo
Hello, Am Mittwoch, 10. Januar 2018, 11:54:03 CET schrieb Theo Chatzimichos:
It happened again twice, so it was definitely not a logrotate issue, but instead it was an upstream bug, possibly this [1]. So now I updated the hosts and mariadb was updated to a new patch version. Let's see if it crashes again now, if it does I'll file a ticket against our package.
It turned out that the mariadb update didn't change anything, and the cluster crashed at 19:31 UTC again - at least it was timely ;-) At least now we know what triggers the crash - as I already guessed [1] yesterday, it's the backup script (no kidding!) This script does database dumps and then optimizes all tables. The good news is that creating the database dumps works. The problematic part is optimizing all tables, therefore we disabled this part of the script now. After this change, the galera cluster survived two test runs of the backup script. The relevant part of the script that triggers the crash is: MYSQL_CHECK="/usr/bin/mysqlcheck" # ... MYSQL="/usr/bin/mysql" # ... function optimize() { if [ -x "$MYSQL_CHECK" ]; then LOG "Starting automatic repair and optimization of the databases/tables" "$MYSQL_CHECK" \ --all-databases \ --skip-database=lost+found \ --compress \ --auto-repair \ --optimize \ -u root 1>/dev/null 2>"$TMPFILE" "$MYSQL" -e "FLUSH QUERY CACHE;" 2>>"$TMPFILE" fi } It typically takes 2 seconds from writing the log entry to the crash. I found two bugreports that describe our problem exactly, including an exact match of our mysqld.log: https://github.com/codership/galera/issues/486 https://jira.percona.com/browse/PXC-881 I also found http://msutic.blogspot.de/2015/10/confusion-and-problems-with-lostfound.html - but the precondition "/var/lib/mysql/lost+found/ exists" doesn't match in our setup. However, we have root-owned mysql_upgrade_info (only) on galera1. It's a file, not a directory like lost+found would be, but it _could_ [2] still somehow be related. Regards, Christian Boltz [1] educated guess after reading the backup script, checking the database dumps' content and timestamps etc. [2] wild guess, and given that lost+found looks like a database directory to mysql while a file doesn't, I doubt the mysql_upgrade_info file is really the problem. OTOH - who would have thought that optimizing all tables crashes the cluster? ;-) -- In C we had to code our own bugs. In C++ we can inherit them. [Prof. Gerald Karam] -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Christian Boltz wrote:
Hello,
Am Mittwoch, 10. Januar 2018, 11:54:03 CET schrieb Theo Chatzimichos:
It happened again twice, so it was definitely not a logrotate issue, but instead it was an upstream bug, possibly this [1]. So now I updated the hosts and mariadb was updated to a new patch version. Let's see if it crashes again now, if it does I'll file a ticket against our package.
It turned out that the mariadb update didn't change anything, and the cluster crashed at 19:31 UTC again - at least it was timely ;-)
Predictability is Good Thing(R) :-)
OTOH - who would have thought that optimizing all tables crashes the cluster? ;-)
Sounds like it ought to be reproducable. Unless we're optimizing in parallel? -- Per Jessen, Zürich (3.4°C) openSUSE mailing list admin -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
participants (3)
-
Christian Boltz
-
Per Jessen
-
Theo Chatzimichos