Hi all, the certificates for *.opensuse.org where about to expire in 28 days or so. Unfortunately the automatic issuance and deployment of new certificates (via Let's Encrypt) didn't work as intended - once again. Luckily we have some monitoring setup for this, so we have been notified of this problem. It has been complaing about this for some days now, but I was somehow hoping for a miracle and for the problem to "fix itself". (Un)fortunately computers are deterministic after all, so the problem didn't go away. Upon login on "crtmgr.infra.opensuse.org", I've noticed that there have been two failed services, namely: - dehydrated - sssd sssd fails with the following error messages:
Jan 26 22:03:57 crtmgr sssd[682]: SSSD couldn't load the configuration database [2]: No such file or directory.
Not sure why sssd is enabled at all on this machine, since the sssd configuration file doesn't contain anything at all. Maybe this was the result of some (unfinished) Salt run in the past? (The machine is currently not managed in Salt intentionally). I've disabled the sssd for now and consider it to be (totally) unrelated to the actual issue. The journal said the following for dehydrated:
Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Main process exited, code=exited, status=1/FAILURE Jan 26 01:39:07 crtmgr systemd[1]: Failed to start Certificate Update Runner for Dehydrated. Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Unit entered failed state. Jan 26 01:39:07 crtmgr systemd[1]: dehydrated.service: Failed with result 'exit-code'.
While dehydrated is a great set of script(s) for Let's Encrypt, there seems to be no further logging. The man page [1] does not even mention debug and/or log at all :-/. We had similar issues in the past (broken dehydrated service), and had to debug them by running the script with "bash -x", etc. pp. This time all I did was to reboot the machine (due to pending kernel updates) and triggering the "dehydrated.service" manually afterwards. To my surprise, it successfully run, fetched new certificates and deployed them through the infrastructure, so the "problem is fixed" now. However, I still don't know why it didn't work when triggered by the systemd timer and I don't see a way to learn more about the issue. I'm not super familiar with dehydrated, and I was not the one who originally set up all of this. Are there any recommendations on how to deal with issues like this in the future? Have I missed some log/debug output that might be helpful in understanding what the root cause of the issue has been? Or is it really impossible to tell why a previous run of the script failed? This seems odd to me ... Best regards, Karol Babioch [1]: https://github.com/lukas2511/dehydrated/blob/master/docs/man/dehydrated.1