-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New RFE: Monitoring Kadalu Kubernetes Storage #25
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Aravinda Vishwanathapura <[email protected]>
@leelavg ^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.
Yeah agree.. may be multiple PRs required for metrics itself. One for the framework and other PRs for each metric types. |
- *Storage Units Utilization* | ||
- *Storage units/bricks CPU,Memory and Uptime metrics* | ||
- *CSI Provisioner CPU,Memory and Uptime metrics* | ||
- *CSI Node plugins CPU,Memory and Uptime metrics* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea on how to deploy nodeplugin/exporter.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner
- however as the role and containers in the pods are different, same port mapping can be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅
General Queries:
- Will we be storing collected metrics until Prometheus performs a scrape?
- Are we targeting full implementation for 0.8.6 itself?
- I hope we'll be reusing the existing
exporter.py
- Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation
I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.
|
||
Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows. | ||
|
||
- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. To start with we can show, up_node_plugins
or something.
|
||
Metrics related to the resource counts. | ||
|
||
- Number of Storages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Number of Storage Pools might be a good phrase?
- Will this be just a number or a nested structure differentiating
type
andkadalu_format
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be just a number or a nested structure differentiating type and kadalu_format etc?
Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)
Number of Storage Pools might be a good phrase?
Ack
|
||
==== Health Metrics | ||
|
||
Metrics related to the state of the resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does this mean we'll make data available to the user from which below states can be inferred?
- Same for remaining
How to
questions
|
||
==== Events | ||
|
||
A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Will we be storing "previously collected metrics" to derive the events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)
- *Number of Storage pools* | ||
- *Number of PVs* | ||
- *Number of Storage Units/Bricks* | ||
- *Operator Health* - Operator is running or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "Operator is running or not" with desired state ig?
- *Health of Metrics exporter* | ||
- *CSI Provisioner Health* | ||
- *CSI/Quotad health* | ||
- *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"regular IO from the mount"
- Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the mount available in the CSI provisioner pod.
- *Storage Units Utilization* | ||
- *Storage units/bricks CPU,Memory and Uptime metrics* | ||
- *CSI Provisioner CPU,Memory and Uptime metrics* | ||
- *CSI Node plugins CPU,Memory and Uptime metrics* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner
- however as the role and containers in the pods are different, same port mapping can be used
|
||
[source,yaml] | ||
---- | ||
annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Scrape interval is configurable by user, like another annotation would suffice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling
Yeah, I wrote it in a flow. I will review once for grammatical errors.
Yes. Prometheus definitions will now move to
Some more details I will add soon.
Thanks. |
Thanks for the info. Will get going 😄. |
Signed-off-by: Aravinda Vishwanathapura [email protected]