[heroes] Re: RECOVERY Service Alert: anna.infra.opensuse.org/Public SSL cert is OK

26 Jan 2020

      Hi all,

the certificates for *.opensuse.org where about to expire in 28 days or
so. Unfortunately the automatic issuance and deployment of new
certificates (via Let's Encrypt) didn't work as intended - once again.

Luckily we have some monitoring setup for this, so we have been notified
of this problem.

It has been complaing about this for some days now, but I was somehow
hoping for a miracle and for the problem to "fix itself".
(Un)fortunately computers are deterministic after all, so the problem
didn't go away.

Upon login on "crtmgr.infra.opensuse.org", I've noticed that there have
been two failed services, namely:

- dehydrated
- sssd

sssd fails with the following error messages:
...
Jan 26 22:03:57 crtmgr sssd[682]: SSSD couldn't load the configuration database [2]: No such file or directory.
Not sure why sssd is enabled at all on this machine, since the sssd
configuration file doesn't contain anything at all. Maybe this was the
result of some (unfinished) Salt run in the past?

(The machine is currently not managed in Salt intentionally).

I've disabled the sssd for now and consider it to be (totally) unrelated
to the actual issue.

The journal said the following for dehydrated:
...
Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 01:39:07 crtmgr systemd[1]: Failed to start Certificate Update Runner for Dehydrated.
Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Unit entered failed state.
Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Failed with result 'exit-code'.
While dehydrated is a great set of script(s) for Let's Encrypt, there
seems to be no further logging. The man page [1] does not even mention
debug and/or log at all :-/.

We had similar issues in the past (broken dehydrated service), and had
to debug them by running the script with "bash -x", etc. pp.

This time all I did was to reboot the machine (due to pending kernel
updates) and triggering the "dehydrated.service" manually afterwards.

To my surprise, it successfully run, fetched new certificates and
deployed them through the infrastructure, so the "problem is fixed" now.

However, I still don't know why it didn't work when triggered by the
systemd timer and I don't see a way to learn more about the issue.

I'm not super familiar with dehydrated, and I was not the one who
originally set up all of this. Are there any recommendations on how to
deal with issues like this in the future? Have I missed some log/debug
output that might be helpful in understanding what the root cause of the
issue has been? Or is it really impossible to tell why a previous run of
the script failed? This seems odd to me ...

Best regards,
Karol Babioch

[1]:
https://github.com/lukas2511/dehydrated/blob/master/docs/man/dehydrated.1

[heroes] Re: ** RECOVERY Service Alert: anna.infra.opensuse.org/Public SSL cert is OK **

Karol Babioch

[heroes] Re: RECOVERY Service Alert: anna.infra.opensuse.org/Public SSL cert is OK