Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New RFE: Monitoring Kadalu Kubernetes Storage #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions text/0008-kadalu-kubernetes-storage-monitoring.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
start_date: 2021-08-29
rfc_pr: https://github.com/kadalu/rfcs/pull/0000
status: SUBMITTED
available_since: (leave this empty)
---

= Monitoring Kadalu Kubernetes Storage

== Summary

Kadalu Storage pods are distributed across the nodes and it is very difficult to collect the metrics from all the Pods and aggregate the details to makes sense of the data. Even though each Pods are capable of exporting the metrics and set the necessary annotations required to make it discover by Prometheus server, it is very difficult to add Prometheus rules and alerts without having the full knowledge of the Gluster Cluster. It is also difficult to fetch these metrics if a non-Prometheus application is interested monitor the Cluster.

The proposed solution collects the metrics from all the Pods managed by Kadalu Operator and exports the *Metrics* and *Events* from single API.

== Authors

- Aravinda Vishwanathapura <[email protected]>


== Motivation

- Business logic in one place - If the metrics are exported by individual pods then business logic should be added in the Prometheus server to make decisions.
- Get all metrics in one place.
- Events and Alerts - Kadalu Operator can generate events and alerts based on the metrics collected.
- Kadalu Operator gets more power - Kadalu Operator can take better decisions for the internal maintenance activities depending on the Cluster state. For example, smart rolling upgrade based on Heal pending and Server pods health metrics.
- Supports non-Prometheus consumers.
- Additional information export for non-Prometheus API users - For example, list of Volume options configured, Version information etc.
- External monitoring integration is easy - Since all metrics are available in single place, it will be easy to integrate the metrics with external Cloud hosted monitoring solutions.

== Detailed design

Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.

- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. To start with we can show, up_node_plugins or something.

- Pod health based on "Running" state - Up if all the containers in the Pod are up, Partial if only a few containers in the Pod are Up. Down if all Containers are not Running. Operator also knows the expected containers or processes, based on the state of the Pod it also marks the state for respective processes. For example, if a server pod is down, then mark server, self heal daemon and exporter states as down. Type of Storage will be fetched from the ConfigMap, self heal daemon is marked as down only if the Volume type is Replicate or Disperse Volume.
- Pod Uptime in seconds.

Instead of exporting the metric with Pod name, Kadalu Operator organizes the metrics as proper hierarchy. For example, Metrics from the server pods will be added to the respective brick in the Volume info.

From the collected list of Pod IPs, call HTTP API(`GET /_api/metrics`) for each IPs and collect the metrics(Only if Pod state is Running). If any IP produces Connection refused errors then mark that Pod's health as Down/Unknown.

Once the metrics are collected from all the Pods, post process the metrics and re-expose them under single API.

.Prometheus Metrics
----
GET /metrics
----

.JSON output
----
GET /metrics.json
----

=== Types of Metrics

==== Count Metrics

Metrics related to the resource counts.

- Number of Storages
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Number of Storage Pools might be a good phrase?
  • Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)

Number of Storage Pools might be a good phrase?

Ack

- Number of PVs
- Number of Pods in the Kadalu namespace.

==== Health Metrics

Metrics related to the state of the resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Does this mean we'll make data available to the user from which below states can be inferred?
  • Same for remaining How to questions


- How to make sure that the Kadalu Operator is running successfully?
- How to know if the CSI Provisioner stopped processing further requests?
- How to know if the Self heal daemon is stopped working?
- How to know if a resource is restarted?

==== Utilization Metrics

Metrics related to the resource utilization like Storage, CPU, Memory, Inodes etc.

- How to know which Storage pool is getting full?
- How to know if any resource is consuming more CPU or Memory?

==== Performance Metrics

Metrics related to Storage I/O performance, Provisioning, mount performance.

- How to measure if the Storage Performance is improved or degraded?
- How quick a PV is provisioned?

==== Events

A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Will we be storing "previously collected metrics" to derive the events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)


- *Restarted Event* - If the current uptime of a resource is less than the previous one.
- *Upgraded Event* - If the current version is greater than the previous one.
- *Option Change* - If the list of Volume options is different than the previously collected options data.
- *New Storage added* - When the latest list of metrics shows a new Storage compared to previously collected data.

These events will be exposed by `GET /events` URL. These events also helps to understand the storage better when viewed along with the historical metrics. For example, if a Performance degradation is observed in the chart and associated event says that a Option is changed or one Replica server Pod was down or the Cluster is upgraded to a new Version.

==== Alerts

All events may not be useful, some events may be very noisy. For example, an Event related to changes to utilization is very noisy. But becomes very important when it meets some criteria.

- When utilization crosses 70%.
- Performance improvement/degrade.
- Storage is down.

=== List of Metrics

- *Number of Storage pools*
- *Number of PVs*
- *Number of Storage Units/Bricks*
- *Operator Health* - Operator is running or not
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Operator is running or not" with desired state ig?

- *Storage Units Health* - Health of Brick processes, `1 for Up, and 0 for Down`
- *Self Heal daemon Health*
- *Health of Metrics exporter*
- *CSI Provisioner Health*
- *CSI/Quotad health*
- *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regular IO from the mount"

  • Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the mount available in the CSI provisioner pod.

- *CSI/Storage and PV utilization*
- *Storage Units Utilization*
- *Storage units/bricks CPU,Memory and Uptime metrics*
- *CSI Provisioner CPU,Memory and Uptime metrics*
- *CSI Node plugins CPU,Memory and Uptime metrics*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea on how to deploy nodeplugin/exporter.py?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

  • however as the role and containers in the pods are different, same port mapping can be used

- *Heal Pending metrics* - Run `glfsheal` command and get the heal pending count per Volume.
- *CSI/Node plugin mounts health*

Advanced metrics can be introduced in the future versions

- *Performance metrics* - Using Gluster Volume profile and other tools available.

=== Implementation

Every Kadalu container(Including the Operator) will have a HTTP server process that exposes one API `/_api/metrics`. This API need not be exposed outside the Cluster. Only Operator needs access to this API.

Operator container will have a HTTP server process that exposes following two new APIs along with `/_api/metrics`.

----
GET /metrics
GET /metrics.json
----

Prometheus annotations required only for Operator Pod yaml file as follows.

[source,yaml]
----
annotations:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Scrape interval is configurable by user, like another annotation would suffice here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling

prometheus.io/scrape: "true"
prometheus.io/port: "8000"
----

=== Health subcommand for `kubectl-kadalu`

Consume `GET /metrics.json` API and present the information as required.