[heroes] Postmortem fow download.opensuse.org unplanned outage on 2018-10-03

11 Oct 2018

      Hello,

on Tuesday 2018-10-02 around 22:00 UTC the download.opensuse.org mirrorbrain
service became unavailable. Relevant ticket [1]

Christian Boltz reacted almost immediatelly. As he reported on the
heroes@opensuse.org mailing list [2], dmesg showed a filesystem corruption on
the /srv partition on the machine that hosts download.opensuse.org, called
pontifex2. In order to work around the issue, he started mirrorbrain on
provo-mirror.opensuse.org and redirected all the traffic from pontifex2 to
provo-mirror. This served well quite a big amount of the requests, but there
were still a bunch of issues, both known and newly emerged:
- the mirrorbrain database on provo-mirror was quite outdated
- there was an outdated SSL certificate
- there is no IPv6 on our opensuse infrastructure on Provo

The next day 2018-10-03 Theo and Martin took over. The dmesg errors were
agreeing with any attempt to access the /srv partition:
    pontifex2 (download.o.o):~ # ls /srv
    ls: cannot access '/srv': Input/output error

Additionally, we realized that the disk was full, and after discussing the
issue on #opensuse-admin on Freenode between various parties, we came to the
conclusion that the full disk could even cause the IO errors. We couldn't find
any indication though if this was an LVM, XFS or kernel issue. Even with the
full disk issue, mirrorbrain should have still be operating and even self-heal.

First step was to change the download.o.o and downloadcontent.o.o domains to
point to provo-mirror immediatelly so that we can stop services on pontifex2.
Then we stopped all the services that  were trying to access /srv. Afterwards
we increased the  disk space to 2TB on the partition and also on the PV, but
we gave only one TB extra to the LV, as a workaround for our abovementioned
wild guess. We ran a filesystem check, that took some time as well but
indicated no errors. After rebooting the machine, everything was back to
normal. We switched the DNS records back and everything was back to normal
around 13:00 UTC.

Steps to be taken to avoid the issue in the future:
- Try to reproduce the IO error on a similar setup using LVM, xfs and leap 42.3
  or 15.x. Then it will be more clear why this weird IO error happens and how
  to fix it properly in production.
- Figure out why the /srv went out of space so fast
- Gather all the issues that prevented provo-mirror to act as a proper
  mirrorbrain replacement and fix them, in order to have a better failover for
  any future planned or unplanned outage of pontifex2.

[1] https://progress.opensuse.org/issues/41933
[2] https://lists.opensuse.org/heroes/2018-10/msg00002.html

Special thanks to:
- Martin Caj
- Christian Boltz
- Per Jessen
- Marcus Rueckert (for his precious consulting)

Theo, on behalf of the openSUSE Heroes team

Theo Chatzimichos

tags

participants (1)