commit | dc6e7726bf742fae4f576dd8b95ae2d800378420 | [log] [tgz] |
---|---|---|
author | Jan Kundrát <jan.kundrat@cesnet.cz> | Thu Mar 14 23:13:41 2024 +0100 |
committer | Jan Kundrát <jan.kundrat@cesnet.cz> | Thu Mar 14 23:23:41 2024 +0100 |
tree | c920c8223a5bd22b6e060002fb60e9d5e9d14b96 | |
parent | 7d61c678b26a9db60cb752476d450dba2b2e5b59 [diff] |
lab: save the most recent logs whenever a service crashes We have a central systemd-journald "syslog" server these days, but the logs are very, very verbose, including a full copy of the SPI traffic, for example. This has some merit, but at the same time the log volume is just too much, even in a lab setup. Let's store the most recent one minute worth of logging in case something crashed on any given lab device. This is implemented through a simple Python script which sets up a filter which listens for all systemd messages which say that any service has failed. Once that happens, the code spawns two processes: a `journalctl` for exporting the relevant part of the recent logs, and a `systemd-journal-remote` for storing that just-exported stuff into a native journal file on disk. This two-step thingy is required because `journalctl` cannot really produce a native journal file on disk, and I was thinking that it's a good idea to actually have these stored in a native format -- if only because it allows for some easy filtering. The code also dumps (a part of) that log into a text file, just for convenience. To deploy this, simply run: ansible-playbook -i production site.yml -l czl-logs This includes a workaround for "too old" systemd which by default just wouldn't rotate the log files that are captured from a remote journal. The new files with the "relevant snippet of the logs", however, are *not* rotated in any manner; in my testing it's about 16MB per crash. This means that we have space for about 1500 crashes on that 30GB rootfs, which Should Be Enough For Everybody™. Change-Id: I9261247608cfcc4afe373e72935489c66064e8dd
This is what is currently powering the CI infrastructure tied to our Gerrit. It's mostly about Zuul v3 with Nodepool, log storage, etc.
Note that some pieces (Gerrit itself in particular) are still deployed via Puppet for legacy reasons. That configuration is internal.
# Example: provision the Zuul server ansible-playbook -i production site.yml -l zuul-server