Re: MirrorBrain scanner is stuck

23 May 2021

      Hi Bernhard

Thanks for saving this!

On Sat, 22 May 2021 10:35:40 +0200 "Bernhard M. Wiedemann"
<bwiedemann@suse.de> wrote:
...
...
My mirror hasn't received a single iso or package (ZYpp) download
since yesterday morning. [21/May/2021:08:56:32 +0200]
I fixed it for now my switching pontifex2:
/etc/apache2/vhosts.d/_download.conf
to use mirrordb1 again.
mirrordb2's postgres would not start (probably replication needs some
kicking again)
but I wonder if that replication is really worth all the hassle.
A look in the log file in  /var/lib/pgsql/log showed:
"could not open usermap file "/var/lib/pgsql/data/pg_ident.conf": No
such file or directory" 

So this was a failure from the Migration to Version 13 => my fault. But
nothing special related to a Master -> Slave (note: no HA) setup. As
mirrordb2 was too much out of sync, I just recovered it via repmgr, so
it's up again now, but currently without connections.

Luckily, we have a 2nd system that I did not broke during the upgrade,
so you luckily could switch over to the system that was not affected
(as I successfully linked the correct file there, which allows it to
survive the reboot for the final 15.3 kernel).
...
Often enough HA setups decrease availability via increased complexity
causing increased fragility.
I have to admit that I hear such statements over and over from people -
and I started to become a bit allergic to such generic statements. So
take the following with some sarkasm, please...

I hope that nobody wants to sit in a plane or boat that has no
redundancy. No 2nd motor, no life raft, nothing. If something breaks,
"it's luckily just the sinle plane or boat, which is affected". 
This is how you want to run your software stack. It might be fine for
playground projects, but even openSUSE with it's hundreds or thousands
of users should earn something better. At least in my eyes.

Feel free to remove mirrordb2 (and mirrordb3 in Provo). But I bet that
the problem just arise because I forgot a symlink. This looked more
like a OSI-Layer 7 problem to me.

Especially our PostgreSQL setup is currently not really high-available.

It's just a Master + a Slave because of performance reasons. We had
quite some load problems in the past, when there was just a single
instance. That's why we split the reads (from pontifex) and writes
(from olaf + pontifex) between the two nodes. 

This might not be true any more, as our current machines have more
resources, better software and are also better tuned to survive the
overall load. Thanks to the evolution of the software and people, who
took care of them over all the years.

These people learned a lot while setting up and maintaining the
systems over the years. Now they turned away and have other interests.

It's fair for the "newbies" (I don't see the people currently handling
the infrastructure as real newbies, but I think they are new to these
parts of openSUSE infra) to question the setup and even to do it
differently. But please keep a disaster like a power outage or a
burning data center in mind: IMHO you still want to deliver our
service. So there has to be some "magic", which would allow a fail
over. 

The current setup even turned out to be kind of "maintenance resistant":
we can do maintenance on our infrastructure without carrying too much
about our customers, as most of the systems are redundant. 

BTW: Sadly some service frontends are not able to handle outages. A
(temporary) lost DB connection for example still is a problem for some
of them. This problem is not going away if you place the DB on the node
itself. This way, you just reduce the pressure to the developers to fix
their code. 

With kind regards,
Lars

PS: The additional node in Provo (aka mirrordb3) is currently not
connected to the nodes in NUE, but holds an old copy of their databases
- to be ready for the next power outage of the Nuremberg office. My
plan was to add this node (via the commands above) as additional Slave
to mirrordb1 in Nuremberg. 

But if there is a disaffirmation in running such kind of "HA-Setups"
(and I still say it's just master - slave ;-), I'm happy to leave it as
it is. It just means that we will have (like during the last DC
downtime) no current data in Provo and can not switch over.

Re: MirrorBrain scanner is stuck

Lars Vogdt