health: avoid double registration of a systemd unit

Recently I saw the following error in our CI pipeline:

   41/73 Test #63: test-health_systemd-units ..................................***Failed    0.19 sec
   50.160 [19252 health T] Registered systemd unit watcher for 'unit1.service'
   50.161 [19252 health D] Systemd unit 'unit1.service' changed state (active running)
   50.163 [19252 health T] Registered systemd unit watcher for 'unit2.service'
   50.163 [19252 health D] Systemd unit 'unit2.service' changed state (activating auto-restart)
   50.165 [19252 health T] Registered systemd unit watcher for 'unit3.service'
   50.166 [19252 health D] Systemd unit 'unit3.service' changed state (failed failed)
   50.167 [19244 health T] Registered systemd unit watcher for 'unit1.service'
   50.168 [19244 health T] Systemd unit 'unit1.service' changed state but it is the same state as before (active, running)
   50.169 [19244 health T] Registered systemd unit watcher for 'unit2.service'
   50.170 [19244 health T] Systemd unit 'unit2.service' changed state but it is the same state as before (activating, auto-restart)
   50.172 [19244 health T] Registered systemd unit watcher for 'unit3.service'
   50.172 [19244 health T] Systemd unit 'unit3.service' changed state but it is the same state as before (failed, failed)
   50.272 [19244 main E] dataFrom /ietf-alarms:alarms/alarm-inventory
   [doctest] doctest version is "2.4.6"
   [doctest] run with "--help" for options
   /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:79
   fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (active, running)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit1.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "cleared"} }) with.
     param  _1 == /sysrepo-ietf-alarms:create-or-update-alarm
     param  _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (active, running) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier,  }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit1.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, cleared } }

   /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:81
   fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (activating, auto-restart)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit2.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "critical"} }) with.
     param  _1 == /sysrepo-ietf-alarms:create-or-update-alarm
     param  _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (activating, auto-restart) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier,  }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit2.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, critical } }

   /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:83
   fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (failed, failed)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit3.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "critical"} }) with.
     param  _1 == /sysrepo-ietf-alarms:create-or-update-alarm
     param  _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (failed, failed) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier,  }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit3.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, critical } }

   ===============================================================================
   /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:63:
   TEST CASE:  systemd unit state monitoring (alarms)

   /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:90: FATAL ERROR: REQUIRE( dataFromSysrepo(srSess, "/ietf-alarms:alarms/alarm-inventory", sysrepo::Datastore::Operational) == std::map<std::string, std::string>{ {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']", ""}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier", ""}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]", "unit1.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]", "unit2.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]", "unit3.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear", "true"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]", "critical"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description", "The systemd service is considered in failed state."} } ) is NOT correct!
     values: REQUIRE( {
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']": "",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id": "velia-alarms:systemd-unit-failure",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier": "",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description": "The systemd service is considered in failed state.",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]": "unit1.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]": "unit2.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]": "unit3.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[4]": "unit1.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[5]": "unit2.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[6]": "unit3.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]": "critical",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear": "true",
   } == {
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']": "",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id": "velia-alarms:systemd-unit-failure",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier": "",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description": "The systemd service is considered in failed state.",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]": "unit1.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]": "unit2.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]": "unit3.service",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]": "critical",
     "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear": "true",
   } )

From both the logger and the failed REQUIRE it is obvious that the units
were registered twice. I think this can happen when the UnitNew signal
comes right after the signal handler is registered but before the
ListUnits call in the constructor.

The fix is trivial: Check that the unit is not yet registered prior to
storing it. However, and this bothers me, is that code that registers
units received from ListUnits and the UnitNew signal handler run in
different threads yet the code is the same: We run registerSystemdUnit
method. But tsan does not pick up any possible data races when accessing
std::map storing the units and their states.

We also guard the access the other member (m_unitState) storing the
last state of a systemd unit. After the PropertiesChanged signal handler
is registered the current thread (reading state of the unit) and the
sdbus++ thread (processing the signal) can both access and modify the
m_unitState member.

Fixes: ccd80c3d ("input: Systemd unit monitor input")
Change-Id: I888cd896a4fc1b531a3a15f4c96075628d6c9e7b
2 files changed
tree: 2b3a841576638c9331495daffaafc2c14bd5c873
  1. .clang-format
  2. .gitmodules
  3. .zuul.yaml
  4. CMakeLists.txt
  5. Doxyfile.in
  6. LICENSE
  7. LICENSE.md
  8. README.md
  9. ci/
  10. cmake/
  11. docs/
  12. src/
  13. tests/
  14. yang/
README.md

YANG System management for embedded devices running Linux

Together with sysrepo, this software provides "general system management" of embedded devices. The target platform is anything that runs Linux with systemd. This runs in production on CzechLight SDN DWDM devices.

Health tracking

This component tracks the overal health state of the system, including various sensors, or the state of systemd units. As an operator-friendly LED at the front panel of the appliance shows the aggregated health state.

System management

Firmware can be updated via RAUC, and various aspects of the system's configuration can be adjusted. This includes a firewall, basic network settings, and authentication management.

Supported YANG models

For a full list, consult the yang/ directory in this repository.