Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Prometheus metrics for LocalQueue #3673

Merged
merged 13 commits into from
Dec 5, 2024

Conversation

KPostOffice
Copy link
Contributor

@KPostOffice KPostOffice commented Nov 27, 2024

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implementation of LQ metrics KEP

this replaces PR #3609

Which issue(s) this PR fixes:

Fixes #1833

Special notes for your reviewer:

I'm uncertain if I've updated all the metrics in the right places. I still need to write tests, but I figured I'd open the PR as I have it now in case anything is egregiously off.

Does this PR introduce a user-facing change?

Introduce alpha feature, behind the LocalQueueMetrics feature gate, which allows users to get the prometheus LocalQueues metrics:
local_queue_pending_workloads
local_queue_quota_reserved_workloads_total
local_queue_quota_reserved_wait_time_seconds
local_queue_admitted_workloads_total
local_queue_admission_wait_time_seconds
local_queue_admission_checks_wait_time_seconds
local_queue_evicted_workloads_total
local_queue_reserving_active_workloads
local_queue_admitted_active_workloads
local_queue_status
local_queue_resource_reservation
local_queue_resource_usage

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Nov 27, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @KPostOffice. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

netlify bot commented Nov 27, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 53d2d6a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6751bca647f0ef000873f37b

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2024
@KPostOffice
Copy link
Contributor Author

Currently I've implemented this with just a boolean feature gate. I was having trouble figuring out how to pass namespace selector details down to the scheduler, cache, and queue and then also act on those selector values. I didn't want to introduce client calls to them since I figured making network request would pretty severely hurt performance in those packages.

@@ -146,6 +146,9 @@ type ControllerMetrics struct {
// metrics will be reported.
// +optional
EnableClusterQueueResources bool `json:"enableClusterQueueResources,omitempty"`

// +optional
EnableLocalQueueMetrics bool `json:"enableLocalQueueMetrics,omitempty"`
Copy link
Contributor

@mimowo mimowo Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to favor API rather than a feature gate? We don't guard other metrics by API. So, I don't see such a need, but let us know if there is something specific about them. If the concern is stability of the system due to potential bugs, then feature gate is enough, we can start from alpha. It would also allow us to simplify the code as feature gate status can be checked from any place, so no need to pass parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree, especially when it comes to passing parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment about increasing cardinality and wanting to leave this behind a long term config field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but in that case I would like to go via the KEP process. Pity the comment does not mention why cardinality is a problem - is it for usability (this could be solved by aggregation), or performance. Do you have some other references why cardinality might be a problem in k8s.

I assume we don't have many more LQs than namepaces, which also let me check what we do in the core k8s. I see that we have metrics depending on Namespace, example. However, in this case we use explicitly CounterOpts.Namespace. Maybe we could also do it this way? PTAL.

If you want this feature in 0.10 I think the only chance is a short KEP, don't change API, and guard it by Alpha feature gate (disabled by default). Then for second iteration of alpha investigate if we need the API switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace in the example you link isn't a K8 namespace from what I understand. It is the project namespace to avoid prometheus metrics clashing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, are these metrics opt-in or enabled by default? If k8s core enables them by default I don't think we need to worry. I would like to better understand why cardinality is a problem basically

There are a handful that have graduated to stable and about a dozen that are alpha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y are you ok going via feature gate alpha in 0.10? We may still be likely to the API as in the KEP for 0.11, but this would be less committal as I'm not sure if the API is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the too delayed response.
I would recommend guarding these metrics by Config API due to high cardinality.
This obviously causes high cardinality, and the Prometheus and Grafana query performance will be slowed down especially the big cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we will not be able to do this for 0.10. Let's update the KEP and revisit the knobs for 0.11 before Beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with v0.11 due to the tighter schedule for v0.10, and this feature is still alpha.

'status' can have the following values:
- "active" means that the workloads are in the admission queue.
- "inadmissible" means there was a failed admission attempt for these workloads and they won't be retried until cluster conditions, which could make this workload admissible, change`,
}, []string{"local_queue", "namespace", "status"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is acceptable, but let me consider other options:

  1. "local_queue", "namespace" - as in the proposal
  2. "name", "namespace"
  3. "local_queue" - key "namespace/name"

I'm not in favor of (3) because maybe for some use-cases one wants to aggregate metrics by LQ name rather than full key.

My only slight preference for (2.) is that it is less redundant. It is already clear from the metrics name that we are talking LQs. This is not the case for the pending_workloads metrics for CQs, so I think we don't need to follow the naming pattern for params strictly here. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with 2

@@ -250,6 +250,11 @@ func (c *clusterQueue) updateQueueStatus() {
if status != c.Status {
c.Status = status
metrics.ReportClusterQueueStatus(c.Name, c.Status)
if lqMetrics {
for _, lq := range c.localQueues {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This iteration might be adding unnecessary performance cost. What is the scenario that it needs calling here? Maybe we could move the call per LQ, when we update the specific LQ. PTAL.

Copy link
Contributor Author

@KPostOffice KPostOffice Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lq status is equal to the cq status. So when the cq status updates, all the cq's associated lqs should have their statuses updated as well

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very nice to see this contribution, and I would like to include it in 0.10 if time allows. Left some comments about major things which draw my attention during initial pass. I would also like to see some integration tests - I think for most of the metrics we should be able to extend the tests were we check metrics for CQs.

cc @tenzen-y @dgrove-oss @PBundyra

@mimowo
Copy link
Contributor

mimowo commented Nov 28, 2024

@KPostOffice in the release note, please list all the metrics and their shortened description / purpose.

@mbobrovskyi
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 28, 2024
@tenzen-y
Copy link
Member

/retitle Implement Prometheus metrics for LocalQueue

@k8s-ci-robot k8s-ci-robot changed the title Lq metrics Implement Prometheus metrics for LocalQueue Nov 28, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move changes applied to this file to pkg/queue/local_queue.go?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 3, 2024

@KPostOffice I think this PR still lacks integration tests, given this is early Alpha I would be willing to go ahead based on manual testing, but would like to fill in the gap as soon as possible (possibly yet before the release). WDYT? cc @tenzen-y

@KPostOffice
Copy link
Contributor Author

I've added some integration tests to the PR, I've also changed the LQ status metric. It now is a direct reflection of the CRs active status. @mimowo Let me know if this is better than updating the status directly in the cache. It changes the values to True, False, Unknown. The only drawback with this is some lag to the internal status as well as losing the Terminating status but it avoids looping over all LQs every time the CQ's status is updated

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo December 5, 2024 13:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 5, 2024
@KPostOffice
Copy link
Contributor Author

/retest-required

@mimowo
Copy link
Contributor

mimowo commented Dec 5, 2024

/retest
The integration test failed due to #3633

cmd/kueue/main.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Please follow up with the knob to enable / disable. However, this is good to me as is for 0.10 since this is alpha (disabled by default) anyway.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 99c8eff8fdca996d2d218932dd27e7126028efab

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit cb285cf into kubernetes-sigs:main Dec 5, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.10 milestone Dec 5, 2024
KPostOffice added a commit to KPostOffice/kueue that referenced this pull request Dec 10, 2024
* add LocalQueue metrics (no feature gate)

Signed-off-by: Kevin <[email protected]>

* add all clear and report calls

Signed-off-by: Kevin <[email protected]>

* add feature gate

Signed-off-by: Kevin <[email protected]>

* cleanup todos and add more feature gates

Signed-off-by: Kevin <[email protected]>

* use feature gate instead of config

Signed-off-by: Kevin <[email protected]>

* cleanup

Signed-off-by: Kevin <[email protected]>

* add metrics checks to a test

Signed-off-by: Kevin <[email protected]>

* add lq metrics to cq integration test

Signed-off-by: Kevin <[email protected]>

* lint fix

Signed-off-by: Kevin <[email protected]>

* use name instead of local_queue

Signed-off-by: Kevin <[email protected]>

* update status metric description

Signed-off-by: Kevin <[email protected]>

* fix key name

Signed-off-by: Kevin <[email protected]>

* move registerLQ into metrics package

Signed-off-by: Kevin <[email protected]>

---------

Signed-off-by: Kevin <[email protected]>
@mimowo
Copy link
Contributor

mimowo commented Dec 11, 2024

/remove-kind api-change

@k8s-ci-robot k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Dec 11, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 11, 2024

/release-note-edit

Introduce alpha feature, behind the LocalQueueMetrics feature gate, which allows users to get the prometheus LocalQueues metrics:
local_queue_pending_workloads
local_queue_quota_reserved_workloads_total
local_queue_quota_reserved_wait_time_seconds
local_queue_admitted_workloads_total
local_queue_admission_wait_time_seconds
local_queue_admission_checks_wait_time_seconds
local_queue_evicted_workloads_total
local_queue_reserving_active_workloads
local_queue_admitted_active_workloads
local_queue_status
local_queue_resource_reservation
local_queue_resource_usage

KPostOffice added a commit to KPostOffice/kueue that referenced this pull request Dec 11, 2024
* add LocalQueue metrics (no feature gate)

Signed-off-by: Kevin <[email protected]>

* add all clear and report calls

Signed-off-by: Kevin <[email protected]>

* add feature gate

Signed-off-by: Kevin <[email protected]>

* cleanup todos and add more feature gates

Signed-off-by: Kevin <[email protected]>

* use feature gate instead of config

Signed-off-by: Kevin <[email protected]>

* cleanup

Signed-off-by: Kevin <[email protected]>

* add metrics checks to a test

Signed-off-by: Kevin <[email protected]>

* add lq metrics to cq integration test

Signed-off-by: Kevin <[email protected]>

* lint fix

Signed-off-by: Kevin <[email protected]>

* use name instead of local_queue

Signed-off-by: Kevin <[email protected]>

* update status metric description

Signed-off-by: Kevin <[email protected]>

* fix key name

Signed-off-by: Kevin <[email protected]>

* move registerLQ into metrics package

Signed-off-by: Kevin <[email protected]>

---------

Signed-off-by: Kevin <[email protected]>
KPostOffice added a commit to KPostOffice/kueue that referenced this pull request Dec 11, 2024
* add LocalQueue metrics (no feature gate)

Signed-off-by: Kevin <[email protected]>

* add all clear and report calls

Signed-off-by: Kevin <[email protected]>

* add feature gate

Signed-off-by: Kevin <[email protected]>

* cleanup todos and add more feature gates

Signed-off-by: Kevin <[email protected]>

* use feature gate instead of config

Signed-off-by: Kevin <[email protected]>

* cleanup

Signed-off-by: Kevin <[email protected]>

* add metrics checks to a test

Signed-off-by: Kevin <[email protected]>

* add lq metrics to cq integration test

Signed-off-by: Kevin <[email protected]>

* lint fix

Signed-off-by: Kevin <[email protected]>

* use name instead of local_queue

Signed-off-by: Kevin <[email protected]>

* update status metric description

Signed-off-by: Kevin <[email protected]>

* fix key name

Signed-off-by: Kevin <[email protected]>

* move registerLQ into metrics package

Signed-off-by: Kevin <[email protected]>

---------

Signed-off-by: Kevin <[email protected]>
@mimowo mimowo mentioned this pull request Dec 12, 2024
34 tasks
@tenzen-y tenzen-y mentioned this pull request Dec 13, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Local queues prometheus metrics
6 participants