Mailinglist Archive: heroes (27 mails)

< Previous Next >
Communication (was: Re: [heroes] Please inform admins if you change something)
Hash: SHA1

Am Thu, 10 Aug 2017 11:18:51 +0200
schrieb Theo Chatzimichos <tampakrap@xxxxxxxxxxxx>:
I am the person to blame for this. I updated all heroes-managed 42.2
and sle12sp2 machines to 42.3, so statistically one of them had to
break :)

Understood your intention, but keyserver as well as status, daffy, elsa
and anna were running 42.3 even before the official release - so I
don't understand why you did something that was not needed at all...

I did announce that I will do the upgrade on the heroes
meeting 10 days ago, see cboltz's summary.

My fault not to read this, sorry.

Either way, point taken, I will stay away from major upgrades from
this machine from now on.

No, sorry: point not taken. Please get it correct: I'm not against you
(or someone else) trying to fix or enhance things - either the
opposite... ;-)

But I'm against doing something hidden (which was my fault, as I could
have read the minutes, so just take this as one argument in my first
post that is not right any more) and especially not talking to the
responsible guys if you break something.

=> doing something good is never a problem
=> fixing/enhance things is also never a problem
=> doing a mistake is not nice, but happens all the time
=> doing a mistake and learn from it is the way it goes since school

=> NOT TALKING is the real problem. My apologies for not reading the
minutes (which is some kind of talking) so I'm also guilty, but I
would have expected that you directly inform the admin of the
machine in case of a problem.
Especially something like a system upgrade (or the de-installation
of packages, changed configs, jalla, jalla), that might break
things, should be communicated to the main admin before starting the
work. You might benefit from a helping hand at the right time - or
the admin might want to do that stuff on his own anyway, which
frees your time.
I know from my own experience that sometimes you just install a
normal package update and stuff breaks - and you might spent a lot
of time to fix a service that you never wanted to touch. Regardless
if you are successful in the end or not: please inform the admin of
the service! Inform him about what happened and why - and maybe also
what you did to fix it.

Missing communication is my concern, not that you did something.

I hope this clarifies it?

Now, let's use this case to learn and enhance our communication. I am
asking myself how we can enhance our internal communication in case
something like this happens again?

I know that everybody is reachable via different ways - and some
communication platforms might not be useful for everyone - so we might
need some "table"(?) somewhere were everyone put's some information on
how someone else inside the Heroes team can reach her or him.

We might also think about creating a ticket as initial step
(consolidating the information that was gathered so far) before or
while we try to reach an admin, so we already have all information at
hand when this admin is available (or we want to include someone else)?

So my suggestion for an emergency workflow like "service down - and the
main admin is not reachable" is:
1.create a ticket at an fill it up with
the initial problem/experiment/whatever that lead to the problem
(including the start time, if it differs too much from the ticket
creation time).
2.enhance the ticket with all the attempts that happen so far to
restore the service
3.assign the ticket to the original maintainer/admin of the service
4.try to reach out the admin via the reachability table mentioned above
5.try to include others that might be able to help by pointing them to
the ticket
6.everyone involved should update the ticket with his findings until
the problem is solved
7.once the problem is solved and might happen again, it's up to the one
who created the problem (if a service stopped working without reason,
this is the main admin) to write/enhance a wiki page in our internal
admin wiki with a runbook (emergency documentation) about the service,
the problem and the solution.

If the points above make sense to everybody, I would like to get this
documented in our wiki. - OK?


Version: GnuPG v2

N‹§²æìr¸›y隊X^®‡¬úéì¹»®&Þ¢—§²ë¢¸¡Ê'µ§-¶¨Âw«zf¢–«¡ë>£ Þ®Š^žË¬zŠà
< Previous Next >