Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Metrics Labels to Reflect Affected Resources #1344

Open
Suraiya-Hameed opened this issue Sep 15, 2023 · 6 comments · May be fixed by #1673
Open

Enhance Metrics Labels to Reflect Affected Resources #1344

Suraiya-Hameed opened this issue Sep 15, 2023 · 6 comments · May be fixed by #1673
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@Suraiya-Hameed
Copy link
Contributor

Describe the solution you'd like
Metrics like total_node_publish_error and total_rotation_reconcile_error do emit labels such as namespace and pod, these labels currently represent the secrets-store-csi-driver itself. This isn't as helpful as it could be.

It would be beneficial if these metrics provided labels that represent the affected resources, not just the emitter of the error. Specifically, I'd like to see:

affected_namespace: The namespace of the resource that's experiencing the issue.
affected_pod: The specific pod that's facing the error.
affected_secretProviderClass: The SecretProviderClass that's causing the error.

This will improve the observability and debugging experience esp in large-scale deployments with multiple namespaces and resources.

Environment:

  • Secrets Store CSI Driver version: (use the image tag): v1.3.4
  • Kubernetes version: (use kubectl version):v1.25.6
@Suraiya-Hameed Suraiya-Hameed added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 15, 2023
@Suraiya-Hameed
Copy link
Contributor Author

Suraiya-Hameed commented Jan 23, 2024

I'm working on a PR to add 3 additional labels (pod_name, pod_namespace and secret_provider_class) to existing metrics.

Decision Point:
The key decision here is whether to update the existing metrics or introduce new metrics with the added labels. I'm wondering if adding labels to existing metrics will lead to backward compatibility issue.

@aramase Is there a specific considerations or preferences around this? Also please share any documentation or convention around adding new version of existing metrics, if that is the recommended route

@aramase
Copy link
Member

aramase commented Jan 25, 2024

I'm working on a PR to add 3 additional labels (pod_name, pod_namespace and secret_provider_class) to existing metrics.

I think pod_name and pod_namespace would be fine as additional labels. Why is SecretProviderClass required? How often in your deployment does the pod have greater than 1 volume referencing a different secret provider class?

@aramase Is there a specific considerations or preferences around this? Also please share any documentation or convention around adding new version of existing metrics, if that is the recommended route

I would recommend adding new labels to existing metrics. Adding new labels in the metric would now mean there are individual entries for each <pod_name>, <pod_namespace> in case of an error but if the aggregation is done with the old set of labels, there wouldn't be a difference.

@Suraiya-Hameed
Copy link
Contributor Author

Why is SecretProviderClass required? How often in your deployment does the pod have greater than 1 volume referencing a different secret provider class?

There are cases where secrets/certs needs to be fetched from multiple secrets stores, due to security practice around isolation and segmentation, multi-tenant environment etc.. and end up mounting multiple spc

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 24, 2024
@aramase
Copy link
Member

aramase commented Jun 4, 2024

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
4 participants