Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add utility to record metrics for sidecar #78

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Nikhil-Ladha
Copy link

@Nikhil-Ladha Nikhil-Ladha commented Nov 26, 2024

Enable prometheus metrics for the sidecar using the csi-lib-utils metrics package. The httpServer for metrics can be enable by adding the httpEndpoint arg to the sidecar. The server will record metrics for operations like GetMetadataAllocated, GetMetadataDelta in the format
snapshot_metadata_controller_operation_total_seconds_bucket{driver_name="",operation_name="",operation_status="",target_snapshot="", base_snapshot="",le="0.1"}, for GetMetaAllocated operations "base_snapshot"
value will be empty.

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR enables recording prometheus metrics for the snapshot-metadata sidecar.

Which issue(s) this PR fixes:
Fixes #11

Does this PR introduce a user-facing change?:

Enable prometheus metrics for the snapshot-metatadata sidecar

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 26, 2024
Copy link

linux-foundation-easycla bot commented Nov 26, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Nikhil-Ladha / name: Nikhil Ladha (8331a93)

@k8s-ci-robot
Copy link
Contributor

Welcome @Nikhil-Ladha!

It looks like this is your first PR to kubernetes-csi/external-snapshot-metadata 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/external-snapshot-metadata has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @Nikhil-Ladha. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 26, 2024
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 26, 2024
@Nikhil-Ladha Nikhil-Ladha changed the title metrics: add basic common metric utilities metrics: add common metric utilities Nov 26, 2024
@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from 29ec31e to 187c10a Compare November 26, 2024 12:17
@Nikhil-Ladha
Copy link
Author

/cc @Rakshith-R @iPraveenParihar

@k8s-ci-robot
Copy link
Contributor

@Nikhil-Ladha: GitHub didn't allow me to request PR reviews from the following users: Rakshith-R, iPraveenParihar.

Note that only kubernetes-csi members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Rakshith-R @iPraveenParihar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from 187c10a to 816790e Compare December 3, 2024 06:15
@Nikhil-Ladha
Copy link
Author

Nikhil-Ladha commented Dec 5, 2024

Since, there are no metrics being emitted as of now from the sidecar. This is the only metrics being logged in the server.

$ curl --http0.9 -X GET http://10.244.0.18:8080/metrics --output -
# HELP process_start_time_seconds [ALPHA] Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.73339252136e+09
# HELP snapshot_metadata_controller_operations_in_flight [ALPHA] Total number of operations in flight
# TYPE snapshot_metadata_controller_operations_in_flight gauge
snapshot_metadata_controller_operations_in_flight 0

P.S: I have added unit tests to support working of other metrics utility functions. Hope, that should suffice for now.

@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from 816790e to f9be3e6 Compare December 7, 2024 06:18
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 8, 2024
Copy link
Contributor

@carlbraganza carlbraganza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you propose how basic timing of GetMetadataAllocated and GetMetadataDelta would be measured?

pkg/sidecar/sidecar.go Show resolved Hide resolved
@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from f9be3e6 to 1d3a2a0 Compare December 17, 2024 05:56
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2024
pkg/sidecar/sidecar.go Outdated Show resolved Hide resolved
pkg/sidecar/sidecar.go Outdated Show resolved Hide resolved
pkg/sidecar/sidecar.go Outdated Show resolved Hide resolved
pkg/sidecar/sidecar.go Outdated Show resolved Hide resolved
pkg/sidecar/sidecar.go Show resolved Hide resolved
@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from 1d3a2a0 to a3d4faf Compare December 20, 2024 07:57
@carlbraganza
Copy link
Contributor

carlbraganza commented Dec 20, 2024

@Nikhil-Ladha I have a high level question: What was wrong with the original use of the github.com/kubernetes-csi/csi-lib-utils/metrics package? I'm sorry for being so slow in noticing this!

@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from a3d4faf to 00c3a4a Compare January 8, 2025 09:12
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Nikhil-Ladha
Once this PR has been reviewed and has the lgtm label, please assign carlbraganza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 8, 2025
@Nikhil-Ladha
Copy link
Author

@Nikhil-Ladha I have a high level question: What was wrong with the original use of the github.com/kubernetes-csi/csi-lib-utils/metrics package? I'm sorry for being so slow in noticing this!

@carlbraganza updated the PR based on our discussion offline, please take a look now.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 8, 2025
@Nikhil-Ladha Nikhil-Ladha changed the title metrics: add common metric utilities metrics: add utility to record metrics for sidecar Jan 8, 2025
Comment on lines 76 to 78
// Record metrics
s.config.Runtime.RecordMetricsWithLabels(opLabel, runtime.MetadataAllocatedOperationName, startTime, err)

Copy link
Contributor

@carlbraganza carlbraganza Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not record the time taken for s.streamGetMetadataAllocatedResponse(ctx, stream, csiStream)! Also, this logic doesn't record earlier failures in the subroutine. The same is true for the Delta case.

I suggest the following changes to handle all cases with just one call to your record function:

  1. Change the method signature to make err a named return variable. e.g.
func (s *Server) GetMetadataAllocated(req *api.GetMetadataAllocatedRequest, stream api.SnapshotMetadata_GetMetadataAllocatedServer) (err error) {
  1. Remove err from all auto-definitions - you will have to declare some variables explicitly where multiple auto-definitions occur. e.g.
     var (
        csiReq *csi.GetMetadataAllocatedRequest
        csiStream csi.SnapshotMetadata_GetMetadataAllocatedClient
     )
  1. Make sure that the last statement assigns to err:
err = s.streamGetMetadataAllocatedResponse(ctx, stream, csiStream)
  1. At the top of the function you can define something like this:
        defer func(startTime time.Time) {
	    opLabel := map[string]string{
		runtime.LabelTargetSnapshotName: fmt.Sprintf("%s/%s", req.Namespace, req.SnapshotName),
	     }
             s.config.Runtime.RecordMetricsWithLabels(opLabel, runtime.MetadataAllocatedOperationName, startTime, err)
        }(time.Now())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't track the failures in other subroutines purposefully, as I thought if the operation hasn't even reached the CSI driver then what's the use of tracking the metrics for it? Also, regarding the stream response, I wasn't sure if we should be recording that.
Anyway, this seems fine as well and I will update the code accordingly.

Though one question, why do I need to do this-

Remove err from all auto-definitions - you will have to declare some variables explicitly where multiple auto-definitions occur. e.g.

The auto-definitions use the same named err error variable so we should be able to track the err value while recording the metrics in the end. Can you please help me understand what difference does that make?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... you may be right in that it won't get redefined - I don't know the latest Go semantics but some experiments in the playground shows that it is much more restrictive than before. Please check! Also, the UTs should help in ensuring existing behavior continues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code accordingly, please take a look.
Also, add UT for the metrics wherever necessary.

@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from 00c3a4a to adcd2a3 Compare January 15, 2025 10:21
@Rakshith-R
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 17, 2025
@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from adcd2a3 to fd8d379 Compare January 17, 2025 12:41
enable prometheus metrics for the sidecar using the csi-lib-utils
metrics package. The httpServer for metrics can be enable by adding
the httpEndpoint arg to the sidecar. The server will record metrics
for operations like GetMetadataAllocated, GetMetadataDelta in the format
`snapshot_metadata_controller_operation_total_seconds_bucket{driver_name="",operation_name="",operation_status="",target_snapshot="",
base_snapshot="",le="0.1"}`, for GetMetaAllocated operations "base_snapshot"
value will be empty.

Signed-off-by: Nikhil-Ladha <[email protected]>
@Nikhil-Ladha Nikhil-Ladha force-pushed the issues11/add-prometheus-metrics branch from fd8d379 to 8331a93 Compare January 17, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose Prometheus metrics from the sidecar server
4 participants