New RFE: Monitoring Kadalu Kubernetes Storage #25

aravindavk · 2021-08-29T13:42:41Z

Signed-off-by: Aravinda Vishwanathapura [email protected]

Signed-off-by: Aravinda Vishwanathapura <[email protected]>

aravindavk · 2021-08-29T13:44:13Z

@leelavg ^^

vatsa287

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

aravindavk · 2021-08-29T15:49:41Z

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

Yeah agree.. may be multiple PRs required for metrics itself. One for the framework and other PRs for each metric types.

vatsa287 · 2021-08-29T15:07:12Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+- *Storage Units Utilization*
+- *Storage units/bricks CPU,Memory and Uptime metrics*
+- *CSI Provisioner CPU,Memory and Uptime metrics*
+- *CSI Node plugins CPU,Memory and Uptime metrics*


Any idea on how to deploy nodeplugin/exporter.py?

@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

however as the role and containers in the pods are different, same port mapping can be used

leelavg

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

General Queries:

Will we be storing collected metrics until Prometheus performs a scrape?
Are we targeting full implementation for 0.8.6 itself?
I hope we'll be reusing the existing exporter.py
Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.

leelavg · 2021-08-30T03:26:58Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+
+Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.
+
+- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.


Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes

Ack. To start with we can show, up_node_plugins or something.

leelavg · 2021-08-30T03:28:57Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+
+Metrics related to the resource counts.
+
+- Number of Storages


Number of Storage Pools might be a good phrase?

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)

Number of Storage Pools might be a good phrase?

Ack

leelavg · 2021-08-30T03:31:30Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+
+==== Health Metrics
+
+Metrics related to the state of the resources.


Does this mean we'll make data available to the user from which below states can be inferred?

Same for remaining How to questions

leelavg · 2021-08-30T03:34:01Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+
+==== Events
+
+A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,


Will we be storing "previously collected metrics" to derive the events?

Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)

leelavg · 2021-08-30T03:35:46Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+- *Number of Storage pools*
+- *Number of PVs*
+- *Number of Storage Units/Bricks*
+- *Operator Health* - Operator is running or not


"Operator is running or not" with desired state ig?

leelavg · 2021-08-30T03:38:17Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+- *Health of Metrics exporter*
+- *CSI Provisioner Health*
+- *CSI/Quotad health*
+- *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors.


"regular IO from the mount"

Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?

From the mount available in the CSI provisioner pod.

leelavg · 2021-08-30T03:41:09Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+- *Storage Units Utilization*
+- *Storage units/bricks CPU,Memory and Uptime metrics*
+- *CSI Provisioner CPU,Memory and Uptime metrics*
+- *CSI Node plugins CPU,Memory and Uptime metrics*


@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

however as the role and containers in the pods are different, same port mapping can be used

leelavg · 2021-08-30T03:42:55Z

text/0008-kadalu-kubernetes-storage-monitoring.adoc

+
+[source,yaml]
+----
+      annotations:


Scrape interval is configurable by user, like another annotation would suffice here?

Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling

aravindavk · 2021-08-30T05:00:04Z

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

Yeah, I wrote it in a flow. I will review once for grammatical errors.

General Queries:

Will we be storing collected metrics until Prometheus performs a scrape?
No. Operator in future may collect the metrics in periodic interval and stores the two values(current and previous).

Are we targeting full implementation for 0.8.6 itself?
At least the framework and basic metrics. Advanced metrics, events and alerts are future.

I hope we'll be reusing the existing exporter.py

Yes. Prometheus definitions will now move to operator/exporter. The code that is collecting metrics in CSI/exporter will be used with csi/exporter but exports in json format.

Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

Some more details I will add soon.

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.

Thanks.

leelavg · 2021-08-30T05:21:04Z

Thanks for the info. Will get going 😄.

New RFE: Monitoring Kadalu Kubernetes Storage

2f5bac8

Signed-off-by: Aravinda Vishwanathapura <[email protected]>

aravindavk requested review from amarts and vatsa287 August 29, 2021 13:42

vatsa287 approved these changes Aug 29, 2021

View reviewed changes

vatsa287 reviewed Aug 29, 2021

View reviewed changes

leelavg reviewed Aug 30, 2021

View reviewed changes

vatsa287 mentioned this pull request Sep 1, 2021

across: Implement metrics API kadalu/kadalu#643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New RFE: Monitoring Kadalu Kubernetes Storage #25

New RFE: Monitoring Kadalu Kubernetes Storage #25

aravindavk commented Aug 29, 2021

aravindavk commented Aug 29, 2021

vatsa287 left a comment

aravindavk commented Aug 29, 2021

vatsa287 Aug 29, 2021

leelavg Aug 30, 2021

leelavg left a comment •

edited

Loading

leelavg Aug 30, 2021

aravindavk Aug 30, 2021

leelavg Aug 30, 2021

aravindavk Aug 30, 2021

leelavg Aug 30, 2021

leelavg Aug 30, 2021

aravindavk Aug 30, 2021

leelavg Aug 30, 2021

leelavg Aug 30, 2021

aravindavk Aug 30, 2021

leelavg Aug 30, 2021

leelavg Aug 30, 2021

aravindavk Aug 30, 2021

aravindavk commented Aug 30, 2021

leelavg commented Aug 30, 2021


		Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.

		- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.


		==== Health Metrics

		Metrics related to the state of the resources.


		==== Events

		A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,

New RFE: Monitoring Kadalu Kubernetes Storage #25

Are you sure you want to change the base?

New RFE: Monitoring Kadalu Kubernetes Storage #25

Conversation

aravindavk commented Aug 29, 2021

aravindavk commented Aug 29, 2021

vatsa287 left a comment

Choose a reason for hiding this comment

aravindavk commented Aug 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leelavg left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aravindavk commented Aug 30, 2021

leelavg commented Aug 30, 2021

leelavg left a comment •

edited

Loading