-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add rule to remove downtime #297
base: main
Are you sure you want to change the base?
Conversation
Let the patch and reboot action end just after starting a reboot and then use a sensor to pick up when the machine is back before removing the downtime.
Also added "EMPTY" and "DOWN"
ea43ded
to
713875b
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #297 +/- ##
==========================================
- Coverage 98.14% 98.12% -0.02%
==========================================
Files 126 126
Lines 4575 4580 +5
Branches 242 243 +1
==========================================
+ Hits 4490 4494 +4
- Misses 76 77 +1
Partials 9 9 ☔ View full report in Codecov by Sentry. |
Update icinga downtime actions and tests to use IcingaObject enums
713875b
to
2daac3b
Compare
lib/workflows/hv_patch_and_reboot.py
Outdated
finally: | ||
remove_downtime( | ||
icinga_account=icinga_account, | ||
object_type="Host", | ||
object_name=hypervisor_name, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave this in and change to except, so the downtime is removed if either the patch or reboot commands fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I only catch SSHException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so, you may also have to re-raise the exception after removing the downtime
@@ -2,7 +2,7 @@ | |||
class_name: HypervisorStateSensor | |||
entry_point: src/hypervisor_state_sensor.py | |||
description: Monitor state of Hypervisors | |||
poll_interval: 600 | |||
poll_interval: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we increase the poll_interval back to 10 mins
rules/hv.remove.downtime.yaml
Outdated
criteria: | ||
trigger.previous_state: | ||
type: equals | ||
pattern: DOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need to be DOWN or DRAINED, so we catch any that reboot faster than the polling interval
5aadc94
to
8a7496f
Compare
Changed criteria to be either DOWN or DRAINED
8a7496f
to
6466ae4
Compare
Description:
This PR:
hypervisor_states.py
to add states for all possible machines. REBOOTED, EMPTY, DOWNSubmitter:
Have you (where applicable):
Reviewer
Does this PR:
lib
directory?lib
layers?