I have a help request response ready, but can't submit it. :( -- Evolution as taught in public schools is, like religion, based on faith, not based on science. Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata
On Mon, 17 Jan 2022 19:49:00 -0500, Felix Miata wrote:
I have a help request response ready, but can't submit it. :(
I can confirm that they're down - the forums admin e-mail alias got a notification from another user as well, and I am not able to reach the site - the error reports that the site is unresponsive (tunnel connection failed). Name resolution seems to be working OK. Jim -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
On 18/01/2022 01.49, Felix Miata wrote:
I have a help request response ready, but can't submit it. :(
I fixed it and here is the writeup: https://www.reddit.com/r/openSUSE/comments/s6s4u6/service_outage_postmortem/
Some people have noticed problems with our forums earlier today. It was down from around 00:01 to 06:16 UTC
Other services were also affected: wiki, id.o.o and with that, logins to openqa and chat.o.o were also impossible.
So why did this outage happen? Here is what I found: on our login-proxy we have a custom AppArmor profile for apache2-worker / httpd processes to limit what it can do in case of a break-in. Also, we have a symlink `/etc/systemd/system/timers.target.wants/suse-online-update.timer -> /usr/lib/systemd/system/suse-online-update.timer` that auto-installs updates daily.
And these two nice security features both did what we asked them to and as computers do what we say (not do what we mean), it installed a new `apache2-worker-2.4.51-3.37.1` package that included a minor upstream version update from the previous 2.4.43 version and that now wanted to create a `/run/httpd.pid.Gy7vP` on start, but the AppArmor profile prevented that, so startup failed and there was no proxying to the services behind and not even a nice error-503 page with an upside-down chameleon.
And since the job defaults to "daily" which seems to be an alias for 00:00, there were no admins around to fix it.
So now, apart from the immediate fix of the AppArmor profile to allow not just `/run/httpd.pid` but also `/run/httpd.pid*`, I used `systemctl edit suse-online-update.timer` to ensure that updates happen at a more convenient time of day:
[Timer] OnCalendar=*-*-* 8:00:00
So chances are, the next outage will not be as long.
Ciao Bernhard M.
On 18/01/2022 09.00, Bernhard M. Wiedemann wrote: ...
So now, apart from the immediate fix of the AppArmor profile to allow not just `/run/httpd.pid` but also `/run/httpd.pid*`, I used `systemctl edit suse-online-update.timer` to ensure that updates happen at a more convenient time of day:
[Timer] OnCalendar=*-*-* 8:00:00
So chances are, the next outage will not be as long.
What about weekends? ;-) -- Cheers / Saludos, Carlos E. R. (from oS Leap 15.3 x86_64 (Erebor-4))
On 18/01/2022 12.16, Carlos E. R. wrote:
On 18/01/2022 09.00, Bernhard M. Wiedemann wrote:
...
So now, apart from the immediate fix of the AppArmor profile to allow not just `/run/httpd.pid` but also `/run/httpd.pid*`, I used `systemctl edit suse-online-update.timer` to ensure that updates happen at a more convenient time of day:
[Timer] OnCalendar=*-*-* 8:00:00
So chances are, the next outage will not be as long.
What about weekends? ;-)
We tend to be awake on weekends as well ;-) The old "daily" included weekends as well. The expectation is that breakages will be the exception and not happen every month. Ciao Bernhard M.
On Tue, 18 Jan 2022 09:00:16 +0100, Bernhard M. Wiedemann wrote:
I fixed it and here is the writeup:
https://www.reddit.com/r/openSUSE/comments/s6s4u6/ service_outage_postmortem/
Thanks, Bernhard, appreciate the info and the quick resolution. -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
participants (4)
-
Bernhard M. Wiedemann
-
Carlos E. R.
-
Felix Miata
-
Jim Henderson