-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Prometheus metrics for LocalQueue #3673
Conversation
Hi @KPostOffice. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
Currently I've implemented this with just a boolean feature gate. I was having trouble figuring out how to pass namespace selector details down to the |
@@ -146,6 +146,9 @@ type ControllerMetrics struct { | |||
// metrics will be reported. | |||
// +optional | |||
EnableClusterQueueResources bool `json:"enableClusterQueueResources,omitempty"` | |||
|
|||
// +optional | |||
EnableLocalQueueMetrics bool `json:"enableLocalQueueMetrics,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to favor API rather than a feature gate? We don't guard other metrics by API. So, I don't see such a need, but let us know if there is something specific about them. If the concern is stability of the system due to potential bugs, then feature gate is enough, we can start from alpha. It would also allow us to simplify the code as feature gate status can be checked from any place, so no need to pass parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree, especially when it comes to passing parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment about increasing cardinality and wanting to leave this behind a long term config field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but in that case I would like to go via the KEP process. Pity the comment does not mention why cardinality is a problem - is it for usability (this could be solved by aggregation), or performance. Do you have some other references why cardinality might be a problem in k8s.
I assume we don't have many more LQs than namepaces, which also let me check what we do in the core k8s. I see that we have metrics depending on Namespace, example. However, in this case we use explicitly CounterOpts.Namespace. Maybe we could also do it this way? PTAL.
If you want this feature in 0.10 I think the only chance is a short KEP, don't change API, and guard it by Alpha feature gate (disabled by default). Then for second iteration of alpha investigate if we need the API switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace
in the example you link isn't a K8 namespace from what I understand. It is the project namespace to avoid prometheus
metrics clashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, are these metrics opt-in or enabled by default? If k8s core enables them by default I don't think we need to worry. I would like to better understand why cardinality is a problem basically
There are a handful that have graduated to stable and about a dozen that are alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y are you ok going via feature gate alpha in 0.10? We may still be likely to the API as in the KEP for 0.11, but this would be less committal as I'm not sure if the API is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the too delayed response.
I would recommend guarding these metrics by Config API due to high cardinality.
This obviously causes high cardinality, and the Prometheus and Grafana query performance will be slowed down especially the big cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but we will not be able to do this for 0.10. Let's update the KEP and revisit the knobs for 0.11 before Beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with v0.11 due to the tighter schedule for v0.10, and this feature is still alpha.
pkg/metrics/metrics.go
Outdated
'status' can have the following values: | ||
- "active" means that the workloads are in the admission queue. | ||
- "inadmissible" means there was a failed admission attempt for these workloads and they won't be retried until cluster conditions, which could make this workload admissible, change`, | ||
}, []string{"local_queue", "namespace", "status"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is acceptable, but let me consider other options:
- "local_queue", "namespace" - as in the proposal
- "name", "namespace"
- "local_queue" - key "namespace/name"
I'm not in favor of (3) because maybe for some use-cases one wants to aggregate metrics by LQ name rather than full key.
My only slight preference for (2.) is that it is less redundant. It is already clear from the metrics name that we are talking LQs. This is not the case for the pending_workloads
metrics for CQs, so I think we don't need to follow the naming pattern for params strictly here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with 2
pkg/cache/clusterqueue.go
Outdated
@@ -250,6 +250,11 @@ func (c *clusterQueue) updateQueueStatus() { | |||
if status != c.Status { | |||
c.Status = status | |||
metrics.ReportClusterQueueStatus(c.Name, c.Status) | |||
if lqMetrics { | |||
for _, lq := range c.localQueues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iteration might be adding unnecessary performance cost. What is the scenario that it needs calling here? Maybe we could move the call per LQ, when we update the specific LQ. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lq status is equal to the cq status. So when the cq status updates, all the cq's associated lqs should have their statuses updated as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very nice to see this contribution, and I would like to include it in 0.10 if time allows. Left some comments about major things which draw my attention during initial pass. I would also like to see some integration tests - I think for most of the metrics we should be able to extend the tests were we check metrics for CQs.
@KPostOffice in the release note, please list all the metrics and their shortened description / purpose. |
/ok-to-test |
/retitle Implement Prometheus metrics for LocalQueue |
pkg/queue/cluster_queue.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move changes applied to this file to pkg/queue/local_queue.go
?
fa02f8b
to
671f425
Compare
@KPostOffice I think this PR still lacks integration tests, given this is early Alpha I would be willing to go ahead based on manual testing, but would like to fill in the gap as soon as possible (possibly yet before the release). WDYT? cc @tenzen-y |
I've added some integration tests to the PR, I've also changed the LQ status metric. It now is a direct reflection of the CRs active status. @mimowo Let me know if this is better than updating the status directly in the cache. It changes the values to |
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
ee72592
to
1fc5b75
Compare
/retest-required |
/retest |
Signed-off-by: Kevin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Please follow up with the knob to enable / disable. However, this is good to me as is for 0.10 since this is alpha (disabled by default) anyway.
LGTM label has been added. Git tree hash: 99c8eff8fdca996d2d218932dd27e7126028efab
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: KPostOffice, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add LocalQueue metrics (no feature gate) Signed-off-by: Kevin <[email protected]> * add all clear and report calls Signed-off-by: Kevin <[email protected]> * add feature gate Signed-off-by: Kevin <[email protected]> * cleanup todos and add more feature gates Signed-off-by: Kevin <[email protected]> * use feature gate instead of config Signed-off-by: Kevin <[email protected]> * cleanup Signed-off-by: Kevin <[email protected]> * add metrics checks to a test Signed-off-by: Kevin <[email protected]> * add lq metrics to cq integration test Signed-off-by: Kevin <[email protected]> * lint fix Signed-off-by: Kevin <[email protected]> * use name instead of local_queue Signed-off-by: Kevin <[email protected]> * update status metric description Signed-off-by: Kevin <[email protected]> * fix key name Signed-off-by: Kevin <[email protected]> * move registerLQ into metrics package Signed-off-by: Kevin <[email protected]> --------- Signed-off-by: Kevin <[email protected]>
/remove-kind api-change |
/release-note-edit
|
* add LocalQueue metrics (no feature gate) Signed-off-by: Kevin <[email protected]> * add all clear and report calls Signed-off-by: Kevin <[email protected]> * add feature gate Signed-off-by: Kevin <[email protected]> * cleanup todos and add more feature gates Signed-off-by: Kevin <[email protected]> * use feature gate instead of config Signed-off-by: Kevin <[email protected]> * cleanup Signed-off-by: Kevin <[email protected]> * add metrics checks to a test Signed-off-by: Kevin <[email protected]> * add lq metrics to cq integration test Signed-off-by: Kevin <[email protected]> * lint fix Signed-off-by: Kevin <[email protected]> * use name instead of local_queue Signed-off-by: Kevin <[email protected]> * update status metric description Signed-off-by: Kevin <[email protected]> * fix key name Signed-off-by: Kevin <[email protected]> * move registerLQ into metrics package Signed-off-by: Kevin <[email protected]> --------- Signed-off-by: Kevin <[email protected]>
* add LocalQueue metrics (no feature gate) Signed-off-by: Kevin <[email protected]> * add all clear and report calls Signed-off-by: Kevin <[email protected]> * add feature gate Signed-off-by: Kevin <[email protected]> * cleanup todos and add more feature gates Signed-off-by: Kevin <[email protected]> * use feature gate instead of config Signed-off-by: Kevin <[email protected]> * cleanup Signed-off-by: Kevin <[email protected]> * add metrics checks to a test Signed-off-by: Kevin <[email protected]> * add lq metrics to cq integration test Signed-off-by: Kevin <[email protected]> * lint fix Signed-off-by: Kevin <[email protected]> * use name instead of local_queue Signed-off-by: Kevin <[email protected]> * update status metric description Signed-off-by: Kevin <[email protected]> * fix key name Signed-off-by: Kevin <[email protected]> * move registerLQ into metrics package Signed-off-by: Kevin <[email protected]> --------- Signed-off-by: Kevin <[email protected]>
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Implementation of LQ metrics KEP
this replaces PR #3609
Which issue(s) this PR fixes:
Fixes #1833
Special notes for your reviewer:
I'm uncertain if I've updated all the metrics in the right places. I still need to write tests, but I figured I'd open the PR as I have it now in case anything is egregiously off.
Does this PR introduce a user-facing change?