commit | 0f8f15b14f9db7973996b971aaf84cd302db58d9 | [log] [tgz] |
---|---|---|
author | Tomáš Pecka <tomas.pecka@cesnet.cz> | Wed May 17 20:00:48 2023 +0200 |
committer | Tomáš Pecka <tomas.pecka@cesnet.cz> | Fri May 19 12:19:10 2023 +0200 |
tree | 2b3a841576638c9331495daffaafc2c14bd5c873 | |
parent | 0e8748424208ac64702dfba6cc6c3e629d54e62d [diff] |
health: avoid double registration of a systemd unit Recently I saw the following error in our CI pipeline: 41/73 Test #63: test-health_systemd-units ..................................***Failed 0.19 sec 50.160 [19252 health T] Registered systemd unit watcher for 'unit1.service' 50.161 [19252 health D] Systemd unit 'unit1.service' changed state (active running) 50.163 [19252 health T] Registered systemd unit watcher for 'unit2.service' 50.163 [19252 health D] Systemd unit 'unit2.service' changed state (activating auto-restart) 50.165 [19252 health T] Registered systemd unit watcher for 'unit3.service' 50.166 [19252 health D] Systemd unit 'unit3.service' changed state (failed failed) 50.167 [19244 health T] Registered systemd unit watcher for 'unit1.service' 50.168 [19244 health T] Systemd unit 'unit1.service' changed state but it is the same state as before (active, running) 50.169 [19244 health T] Registered systemd unit watcher for 'unit2.service' 50.170 [19244 health T] Systemd unit 'unit2.service' changed state but it is the same state as before (activating, auto-restart) 50.172 [19244 health T] Registered systemd unit watcher for 'unit3.service' 50.172 [19244 health T] Systemd unit 'unit3.service' changed state but it is the same state as before (failed, failed) 50.272 [19244 main E] dataFrom /ietf-alarms:alarms/alarm-inventory [doctest] doctest version is "2.4.6" [doctest] run with "--help" for options /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:79 fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (active, running)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit1.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "cleared"} }) with. param _1 == /sysrepo-ietf-alarms:create-or-update-alarm param _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (active, running) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier, }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit1.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, cleared } } /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:81 fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (activating, auto-restart)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit2.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "critical"} }) with. param _1 == /sysrepo-ietf-alarms:create-or-update-alarm param _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (activating, auto-restart) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier, }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit2.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, critical } } /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:83 fakeAlarmServer.rpcCalled(alarmRPC, std::map<std::string, std::string>{ {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-text", "systemd unit state: (failed, failed)"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier", ""}, {"/sysrepo-ietf-alarms:create-or-update-alarm/resource", "unit3.service"}, {"/sysrepo-ietf-alarms:create-or-update-alarm/severity", "critical"} }) with. param _1 == /sysrepo-ietf-alarms:create-or-update-alarm param _2 == { { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-text, systemd unit state: (failed, failed) }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-id, velia-alarms:systemd-unit-failure }, { /sysrepo-ietf-alarms:create-or-update-alarm/alarm-type-qualifier, }, { /sysrepo-ietf-alarms:create-or-update-alarm/resource, unit3.service }, { /sysrepo-ietf-alarms:create-or-update-alarm/severity, critical } } =============================================================================== /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:63: TEST CASE: systemd unit state monitoring (alarms) /home/ci/src/cesnet-gerrit-public/CzechLight/velia/tests/health_systemd-units.cpp:90: FATAL ERROR: REQUIRE( dataFromSysrepo(srSess, "/ietf-alarms:alarms/alarm-inventory", sysrepo::Datastore::Operational) == std::map<std::string, std::string>{ {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']", ""}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id", "velia-alarms:systemd-unit-failure"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier", ""}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]", "unit1.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]", "unit2.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]", "unit3.service"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear", "true"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]", "critical"}, {"/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description", "The systemd service is considered in failed state."} } ) is NOT correct! values: REQUIRE( { "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']": "", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id": "velia-alarms:systemd-unit-failure", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier": "", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description": "The systemd service is considered in failed state.", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]": "unit1.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]": "unit2.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]": "unit3.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[4]": "unit1.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[5]": "unit2.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[6]": "unit3.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]": "critical", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear": "true", } == { "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']": "", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-id": "velia-alarms:systemd-unit-failure", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/alarm-type-qualifier": "", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/description": "The systemd service is considered in failed state.", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[1]": "unit1.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[2]": "unit2.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/resource[3]": "unit3.service", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/severity-level[1]": "critical", "/alarm-type[alarm-type-id='velia-alarms:systemd-unit-failure'][alarm-type-qualifier='']/will-clear": "true", } ) From both the logger and the failed REQUIRE it is obvious that the units were registered twice. I think this can happen when the UnitNew signal comes right after the signal handler is registered but before the ListUnits call in the constructor. The fix is trivial: Check that the unit is not yet registered prior to storing it. However, and this bothers me, is that code that registers units received from ListUnits and the UnitNew signal handler run in different threads yet the code is the same: We run registerSystemdUnit method. But tsan does not pick up any possible data races when accessing std::map storing the units and their states. We also guard the access the other member (m_unitState) storing the last state of a systemd unit. After the PropertiesChanged signal handler is registered the current thread (reading state of the unit) and the sdbus++ thread (processing the signal) can both access and modify the m_unitState member. Fixes: ccd80c3d ("input: Systemd unit monitor input") Change-Id: I888cd896a4fc1b531a3a15f4c96075628d6c9e7b
Together with sysrepo, this software provides "general system management" of embedded devices. The target platform is anything that runs Linux with systemd. This runs in production on CzechLight SDN DWDM devices.
This component tracks the overal health state of the system, including various sensors, or the state of systemd
units. As an operator-friendly LED at the front panel of the appliance shows the aggregated health state.
Firmware can be updated via RAUC, and various aspects of the system's configuration can be adjusted. This includes a firewall, basic network settings, and authentication management.
For a full list, consult the yang/
directory in this repository.
ietf-access-control-list
, RFC 8519 (with deviations)ietf-hardware
, RFC 8348ietf-system
, RFC 7317 (partial support)ietf-interfaces
, RFC 8343 (generating config for systemd-networkd
, with extensions)ietf-routing
, RFC 8349 (see above)czechlight-system