Hi Bernhard Thanks for saving this! On Sat, 22 May 2021 10:35:40 +0200 "Bernhard M. Wiedemann" <bwiedemann@suse.de> wrote:
My mirror hasn't received a single iso or package (ZYpp) download since yesterday morning. [21/May/2021:08:56:32 +0200]
I fixed it for now my switching pontifex2: /etc/apache2/vhosts.d/_download.conf to use mirrordb1 again.
mirrordb2's postgres would not start (probably replication needs some kicking again) but I wonder if that replication is really worth all the hassle.
A look in the log file in /var/lib/pgsql/log showed: "could not open usermap file "/var/lib/pgsql/data/pg_ident.conf": No such file or directory" So this was a failure from the Migration to Version 13 => my fault. But nothing special related to a Master -> Slave (note: no HA) setup. As mirrordb2 was too much out of sync, I just recovered it via repmgr, so it's up again now, but currently without connections. Luckily, we have a 2nd system that I did not broke during the upgrade, so you luckily could switch over to the system that was not affected (as I successfully linked the correct file there, which allows it to survive the reboot for the final 15.3 kernel).
Often enough HA setups decrease availability via increased complexity causing increased fragility.
I have to admit that I hear such statements over and over from people - and I started to become a bit allergic to such generic statements. So take the following with some sarkasm, please... I hope that nobody wants to sit in a plane or boat that has no redundancy. No 2nd motor, no life raft, nothing. If something breaks, "it's luckily just the sinle plane or boat, which is affected". This is how you want to run your software stack. It might be fine for playground projects, but even openSUSE with it's hundreds or thousands of users should earn something better. At least in my eyes. Feel free to remove mirrordb2 (and mirrordb3 in Provo). But I bet that the problem just arise because I forgot a symlink. This looked more like a OSI-Layer 7 problem to me. Especially our PostgreSQL setup is currently not really high-available. It's just a Master + a Slave because of performance reasons. We had quite some load problems in the past, when there was just a single instance. That's why we split the reads (from pontifex) and writes (from olaf + pontifex) between the two nodes. This might not be true any more, as our current machines have more resources, better software and are also better tuned to survive the overall load. Thanks to the evolution of the software and people, who took care of them over all the years. These people learned a lot while setting up and maintaining the systems over the years. Now they turned away and have other interests. It's fair for the "newbies" (I don't see the people currently handling the infrastructure as real newbies, but I think they are new to these parts of openSUSE infra) to question the setup and even to do it differently. But please keep a disaster like a power outage or a burning data center in mind: IMHO you still want to deliver our service. So there has to be some "magic", which would allow a fail over. The current setup even turned out to be kind of "maintenance resistant": we can do maintenance on our infrastructure without carrying too much about our customers, as most of the systems are redundant. BTW: Sadly some service frontends are not able to handle outages. A (temporary) lost DB connection for example still is a problem for some of them. This problem is not going away if you place the DB on the node itself. This way, you just reduce the pressure to the developers to fix their code. With kind regards, Lars PS: The additional node in Provo (aka mirrordb3) is currently not connected to the nodes in NUE, but holds an old copy of their databases - to be ready for the next power outage of the Nuremberg office. My plan was to add this node (via the commands above) as additional Slave to mirrordb1 in Nuremberg. But if there is a disaffirmation in running such kind of "HA-Setups" (and I still say it's just master - slave ;-), I'm happy to leave it as it is. It just means that we will have (like during the last DC downtime) no current data in Provo and can not switch over.