[heroes] Postmortem fow download.opensuse.org unplanned outage on 2018-10-03
Hello, on Tuesday 2018-10-02 around 22:00 UTC the download.opensuse.org mirrorbrain service became unavailable. Relevant ticket [1] Christian Boltz reacted almost immediatelly. As he reported on the heroes@opensuse.org mailing list [2], dmesg showed a filesystem corruption on the /srv partition on the machine that hosts download.opensuse.org, called pontifex2. In order to work around the issue, he started mirrorbrain on provo-mirror.opensuse.org and redirected all the traffic from pontifex2 to provo-mirror. This served well quite a big amount of the requests, but there were still a bunch of issues, both known and newly emerged: - the mirrorbrain database on provo-mirror was quite outdated - there was an outdated SSL certificate - there is no IPv6 on our opensuse infrastructure on Provo The next day 2018-10-03 Theo and Martin took over. The dmesg errors were agreeing with any attempt to access the /srv partition: pontifex2 (download.o.o):~ # ls /srv ls: cannot access '/srv': Input/output error Additionally, we realized that the disk was full, and after discussing the issue on #opensuse-admin on Freenode between various parties, we came to the conclusion that the full disk could even cause the IO errors. We couldn't find any indication though if this was an LVM, XFS or kernel issue. Even with the full disk issue, mirrorbrain should have still be operating and even self-heal. First step was to change the download.o.o and downloadcontent.o.o domains to point to provo-mirror immediatelly so that we can stop services on pontifex2. Then we stopped all the services that were trying to access /srv. Afterwards we increased the disk space to 2TB on the partition and also on the PV, but we gave only one TB extra to the LV, as a workaround for our abovementioned wild guess. We ran a filesystem check, that took some time as well but indicated no errors. After rebooting the machine, everything was back to normal. We switched the DNS records back and everything was back to normal around 13:00 UTC. Steps to be taken to avoid the issue in the future: - Try to reproduce the IO error on a similar setup using LVM, xfs and leap 42.3 or 15.x. Then it will be more clear why this weird IO error happens and how to fix it properly in production. - Figure out why the /srv went out of space so fast - Gather all the issues that prevented provo-mirror to act as a proper mirrorbrain replacement and fix them, in order to have a better failover for any future planned or unplanned outage of pontifex2. [1] https://progress.opensuse.org/issues/41933 [2] https://lists.opensuse.org/heroes/2018-10/msg00002.html Special thanks to: - Martin Caj - Christian Boltz - Per Jessen - Marcus Rueckert (for his precious consulting) Theo, on behalf of the openSUSE Heroes team
participants (1)
-
Theo Chatzimichos