Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New RFE: Monitoring Kadalu Kubernetes Storage #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aravindavk
Copy link
Member

Signed-off-by: Aravinda Vishwanathapura [email protected]

Signed-off-by: Aravinda Vishwanathapura <[email protected]>
@aravindavk aravindavk requested review from amarts and vatsa287 August 29, 2021 13:42
@aravindavk
Copy link
Member Author

@leelavg ^^

Copy link
Member

@vatsa287 vatsa287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

@aravindavk
Copy link
Member Author

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

Yeah agree.. may be multiple PRs required for metrics itself. One for the framework and other PRs for each metric types.

- *Storage Units Utilization*
- *Storage units/bricks CPU,Memory and Uptime metrics*
- *CSI Provisioner CPU,Memory and Uptime metrics*
- *CSI Node plugins CPU,Memory and Uptime metrics*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea on how to deploy nodeplugin/exporter.py?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

  • however as the role and containers in the pods are different, same port mapping can be used

Copy link

@leelavg leelavg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

General Queries:

  • Will we be storing collected metrics until Prometheus performs a scrape?
  • Are we targeting full implementation for 0.8.6 itself?
  • I hope we'll be reusing the existing exporter.py
  • Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.


Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.

- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. To start with we can show, up_node_plugins or something.


Metrics related to the resource counts.

- Number of Storages
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Number of Storage Pools might be a good phrase?
  • Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)

Number of Storage Pools might be a good phrase?

Ack


==== Health Metrics

Metrics related to the state of the resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Does this mean we'll make data available to the user from which below states can be inferred?
  • Same for remaining How to questions


==== Events

A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Will we be storing "previously collected metrics" to derive the events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)

- *Number of Storage pools*
- *Number of PVs*
- *Number of Storage Units/Bricks*
- *Operator Health* - Operator is running or not
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Operator is running or not" with desired state ig?

- *Health of Metrics exporter*
- *CSI Provisioner Health*
- *CSI/Quotad health*
- *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regular IO from the mount"

  • Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the mount available in the CSI provisioner pod.

- *Storage Units Utilization*
- *Storage units/bricks CPU,Memory and Uptime metrics*
- *CSI Provisioner CPU,Memory and Uptime metrics*
- *CSI Node plugins CPU,Memory and Uptime metrics*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

  • however as the role and containers in the pods are different, same port mapping can be used


[source,yaml]
----
annotations:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Scrape interval is configurable by user, like another annotation would suffice here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling

@aravindavk
Copy link
Member Author

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

Yeah, I wrote it in a flow. I will review once for grammatical errors.

General Queries:

  • Will we be storing collected metrics until Prometheus performs a scrape?
    No. Operator in future may collect the metrics in periodic interval and stores the two values(current and previous).
  • Are we targeting full implementation for 0.8.6 itself?
    At least the framework and basic metrics. Advanced metrics, events and alerts are future.
  • I hope we'll be reusing the existing exporter.py

Yes. Prometheus definitions will now move to operator/exporter. The code that is collecting metrics in CSI/exporter will be used with csi/exporter but exports in json format.

  • Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

Some more details I will add soon.

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.

Thanks.

@leelavg
Copy link

leelavg commented Aug 30, 2021

Thanks for the info. Will get going 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants