[heroes] Please inform admins if you change something
Hi I was just checking status.opensuse.org today to see if there is something where I can help - and was very surprised that "my" machine, keyserver.opensuse.org, was marked as "has major issues"... The fun started here: 2017-08-04 15:07:39|install|sks|1.1.6-1.1|x86_64||repo-oss|379efdc6ef2ed4de875d14414e43de8f4e1a6a17| which introduced the new sks package that switched over to the "machine user/group scheme" requested by Ludwig: the new sks user and group now have a "_" in front: user:_sks; group:_sks - while the old package had simply user:sks; group:sks Whoever logged in and installed that package - instead of keeping the service down and NOT reading the changelog of the package: * Do Jan 19 2017 lars@linux-schulserver.de - follow the recommended user/group naming scheme: https://github.com/LinuxStandardBase/lsb/pull/21 and https://lists.opensuse.org/opensuse-factory/2015-04/msg00336.html => using _sks user and _sks group now ...could you imagine that informing the administrator of the machine at least after you break something would have helped? If you want to take over the machine, feel free to do so, but talking to people you should... :-/ with kind regards, Lars -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
On Wed, Aug 09, 2017 at 06:46:30PM +0200, Lars Vogdt wrote:
Hi
I was just checking status.opensuse.org today to see if there is something where I can help - and was very surprised that "my" machine, keyserver.opensuse.org, was marked as "has major issues"...
The fun started here: 2017-08-04 15:07:39|install|sks|1.1.6-1.1|x86_64||repo-oss|379efdc6ef2ed4de875d14414e43de8f4e1a6a17|
which introduced the new sks package that switched over to the "machine user/group scheme" requested by Ludwig: the new sks user and group now have a "_" in front: user:_sks; group:_sks - while the old package had simply user:sks; group:sks
Whoever logged in and installed that package - instead of keeping the service down and NOT reading the changelog of the package: * Do Jan 19 2017 lars@linux-schulserver.de - follow the recommended user/group naming scheme: https://github.com/LinuxStandardBase/lsb/pull/21 and https://lists.opensuse.org/opensuse-factory/2015-04/msg00336.html => using _sks user and _sks group now
...could you imagine that informing the administrator of the machine at least after you break something would have helped?
If you want to take over the machine, feel free to do so, but talking to people you should... :-/
Hello Lars, I am the person to blame for this. I updated all heroes-managed 42.2 and sle12sp2 machines to 42.3, so statistically one of them had to break :) I did announce that I will do the upgrade on the heroes meeting 10 days ago, see cboltz's summary. I saw that the service broke, and I tried to fix it but I couldn't figure out what was the problem. I did not send you a mail to inform you about the situation because of bad reasons, I don't have a really good excuse for this. Either way, point taken, I will stay away from major upgrades from this machine from now on. Many apologies for breaking it and for not informing you immediatelly, and thanks for fixing it fast Theo
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am Thu, 10 Aug 2017 11:18:51 +0200 schrieb Theo Chatzimichos <tampakrap@opensuse.org>:
I am the person to blame for this. I updated all heroes-managed 42.2 and sle12sp2 machines to 42.3, so statistically one of them had to break :)
Understood your intention, but keyserver as well as status, daffy, elsa and anna were running 42.3 even before the official release - so I don't understand why you did something that was not needed at all...
I did announce that I will do the upgrade on the heroes meeting 10 days ago, see cboltz's summary.
My fault not to read this, sorry.
Either way, point taken, I will stay away from major upgrades from this machine from now on.
No, sorry: point not taken. Please get it correct: I'm not against you (or someone else) trying to fix or enhance things - either the opposite... ;-) But I'm against doing something hidden (which was my fault, as I could have read the minutes, so just take this as one argument in my first post that is not right any more) and especially not talking to the responsible guys if you break something. => doing something good is never a problem => fixing/enhance things is also never a problem => doing a mistake is not nice, but happens all the time => doing a mistake and learn from it is the way it goes since school => NOT TALKING is the real problem. My apologies for not reading the minutes (which is some kind of talking) so I'm also guilty, but I would have expected that you directly inform the admin of the machine in case of a problem. Especially something like a system upgrade (or the de-installation of packages, changed configs, jalla, jalla), that might break things, should be communicated to the main admin before starting the work. You might benefit from a helping hand at the right time - or the admin might want to do that stuff on his own anyway, which frees your time. I know from my own experience that sometimes you just install a normal package update and stuff breaks - and you might spent a lot of time to fix a service that you never wanted to touch. Regardless if you are successful in the end or not: please inform the admin of the service! Inform him about what happened and why - and maybe also what you did to fix it. Missing communication is my concern, not that you did something. I hope this clarifies it? Now, let's use this case to learn and enhance our communication. I am asking myself how we can enhance our internal communication in case something like this happens again? I know that everybody is reachable via different ways - and some communication platforms might not be useful for everyone - so we might need some "table"(?) somewhere were everyone put's some information on how someone else inside the Heroes team can reach her or him. We might also think about creating a ticket as initial step (consolidating the information that was gathered so far) before or while we try to reach an admin, so we already have all information at hand when this admin is available (or we want to include someone else)? So my suggestion for an emergency workflow like "service down - and the main admin is not reachable" is: 1.create a ticket at https://progress.opensuse.org an fill it up with the initial problem/experiment/whatever that lead to the problem (including the start time, if it differs too much from the ticket creation time). 2.enhance the ticket with all the attempts that happen so far to restore the service 3.assign the ticket to the original maintainer/admin of the service 4.try to reach out the admin via the reachability table mentioned above 5.try to include others that might be able to help by pointing them to the ticket 6.everyone involved should update the ticket with his findings until the problem is solved 7.once the problem is solved and might happen again, it's up to the one who created the problem (if a service stopped working without reason, this is the main admin) to write/enhance a wiki page in our internal admin wiki with a runbook (emergency documentation) about the service, the problem and the solution. If the points above make sense to everybody, I would like to get this documented in our wiki. - OK? Regards, Lars -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlmPGHUACgkQzgVLKvYrdYRbIgCgy8Kako8Muyjkhdia7tjDsR1a OKMAnivB6CJ51JagUWYEDggkEjL8bksj =MV48 -----END PGP SIGNATURE----- N�����r��y隊X^�����칻�&ޢ��������'��-���w�zf�����>� ޮ�^�ˬz��
participants (3)
-
Lars Vogdt
-
Lars Vogdt
-
Theo Chatzimichos