Re: [heroes] postmortem: mariadb galera cluster outage 2018-01-03 19:30 UTC

10 Jan 2018

      Hello,

Am Mittwoch, 10. Januar 2018, 11:54:03 CET schrieb Theo Chatzimichos:
...
It happened again twice, so it was definitely not a logrotate issue,
but instead it was an upstream bug, possibly this [1]. So now I
updated the hosts and mariadb was updated to a new patch version.
Let's see if it crashes again now, if it does I'll file a ticket
against our package.
[1] https://jira.mariadb.org/browse/MDEV-12023
It turned out that the mariadb update didn't change anything, and the 
cluster crashed at 19:31 UTC again - at least it was timely ;-)

At least now we know what triggers the crash - as I already guessed [1] 
yesterday, it's the backup script (no kidding!) 

This script does database dumps and then optimizes all tables. The good 
news is that creating the database dumps works. The problematic part is 
optimizing all tables, therefore we disabled this part of the script 
now. After this change, the galera cluster survived two test runs of the 
backup script.

The relevant part of the script that triggers the crash is:

MYSQL_CHECK="/usr/bin/mysqlcheck"
# ...
MYSQL="/usr/bin/mysql"
# ...
function optimize() {
        if [ -x "$MYSQL_CHECK" ]; then
                LOG "Starting automatic repair and optimization of the 
                     databases/tables"
                "$MYSQL_CHECK"  \
                                        --all-databases \
                                        --skip-database=lost+found \
                                        --compress \
                                        --auto-repair \
                                        --optimize \
                                        -u root 1>/dev/null 2>"$TMPFILE"
                "$MYSQL" -e "FLUSH QUERY CACHE;" 2>>"$TMPFILE"
        fi
}

It typically takes 2 seconds from writing the log entry to the crash.

I found two bugreports that describe our problem exactly, including an 
exact match of our mysqld.log:
https://github.com/codership/galera/issues/486
https://jira.percona.com/browse/PXC-881

I also found http://msutic.blogspot.de/2015/10/confusion-and-problems-with-lostfound.html - but the precondition "/var/lib/mysql/lost+found/ 
exists" doesn't match in our setup. However, we have root-owned 
mysql_upgrade_info (only) on galera1. It's a file, not a directory like 
lost+found would be, but it _could_ [2] still somehow be related.

Regards,

Christian Boltz

[1] educated guess after reading the backup script, checking the 
    database dumps' content and timestamps etc.

[2] wild guess, and given that lost+found looks like a database 
    directory to mysql while a file doesn't, I doubt the 
    mysql_upgrade_info file is really the problem. OTOH - who would
    have thought that optimizing all tables crashes the cluster? ;-)
-- 
In C we had to code our own bugs. In C++ we can inherit them.
[Prof. Gerald Karam]

-- 
To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org
To contact the owner, e-mail: heroes+owner@opensuse.org