From 9aaa09236dc4fcbf5f9310dd651ce1823da7744c Mon Sep 17 00:00:00 2001 From: Naga Ravi Chaitanya Elluri Date: Fri, 22 May 2020 01:15:58 -0400 Subject: [PATCH] Separate operations based on distribution This commit: - Adds a high level config option called distribution to be able to run operations which are specific to OpenShift in addition to kube as mentioned in https://github.com/openshift-scale/cerberus/issues/42 like inspect_component mode which enables the user to collect logs/events related to the failed component using oc inspect command. - Updates readme to add info about setting the namespaces to monitor in the config depending on the distribution as defaults assumes OpenShift. It also adds blogs and other useful resources related to Cerberus. --- README.md | 25 +++++++++++++++++-------- config/config.yaml | 1 + start_cerberus.py | 12 +++++++++--- 3 files changed, 27 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 74232f3..5828f5b 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # Cerberus -Guardian of Kubernetes Clusters +Guardian of Kubernetes and OpenShift Clusters -Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly. +Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly. ### Workflow ![Cerberus workflow](media/cerberus-workflow.png) @@ -18,6 +18,7 @@ Set the supported components to monitor and the tunings like number of iteration ``` cerberus: + distribution: openshift # Distribution can be kubernetes or openshift kubeconfig_path: ~/.kube/config # Path to kubeconfig watch_nodes: True # Set to True for the cerberus to monitor the cluster nodes watch_cluster_operators: True # Set to True for cerberus to monitor cluster operators. Parameter is optional, will set to True if not specified @@ -52,7 +53,9 @@ tunings: sleep_time: 60 # Sleep duration between each iteration daemon_mode: True # Iterations are set to infinity which means that the cerberus will monitor the resources forever ``` -NOTE: The current implementation can monitor only one cluster from one host. It can be used to monitor multiple clusters provided multiple instances of Cerberus are launched on different hosts. +**NOTE**: The current implementation can monitor only one cluster from one host. It can be used to monitor multiple clusters provided multiple instances of Cerberus are launched on different hosts. + +**NOTE**: The components especially the namespaces needs to be changed depending on the distribution i.e Kubernetes or OpenShift. The default specified in the config assumes that the distribution is OpenShift. #### Run ``` @@ -73,9 +76,9 @@ $ podman pull quay.io/openshift-scale/cerberus $ podman run --name=cerberus --net=host -v :/root/.kube/config:Z -v :/root/cerberus/config/config.yaml:Z -d quay.io/openshift-scale/cerberus:latest $ podman logs -f cerberus ``` -The go/no-go signal ( True or False ) gets published at http://:8080. Note that the cerberus will only support ipv4 for the time being. +The go/no-go signal ( True or False ) gets published at http://``:8080. Note that the cerberus will only support ipv4 for the time being. -NOTE: The report is generated at /root/cerberus/cerberus.report inside the container, it can mounted to a directory on the host in case we want to capture it. +**NOTE**: The report is generated at /root/cerberus/cerberus.report inside the container, it can mounted to a directory on the host in case we want to capture it. #### Report The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. For example: @@ -130,8 +133,11 @@ The user has the option to enable/disable the slack integration ( disabled by de When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a simple http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly. #### Node Problem Detector -[node-problem-detector](https://github.com/kubernetes/node-problem-detector) aims to make various node problems visible to the upstream layers in cluster management stack +[node-problem-detector](https://github.com/kubernetes/node-problem-detector) aims to make various node problems visible to the upstream layers in cluster management stack. + ##### Installation +Please follow the instructions in the [installation](https://github.com/kubernetes/node-problem-detector#installation) section to setup Node Problem Detector on Kubernetes. The following instructions are setting it up on OpenShift: + 1. Create `openshift-node-problem-detector` namespace [ns.yaml](https://github.com/openshift/node-problem-detector-operator/blob/master/deploy/ns.yaml) with `oc create -f ns.yaml` 2. Add cluster role with `oc adm policy add-cluster-role-to-user system:node-problem-detector -z default -n openshift-node-problem-detector` 3. Add security context constraints with `oc adm policy add-scc-to-user privileged system:serviceaccount:openshift-node-problem-detector:default @@ -157,6 +163,7 @@ Following are the components of Kubernetes/OpenShift that Cerberus can monitor t Component | Description | Working ------------------------ | ---------------------------------------------------------------------------------------------------| ------------------------- | Nodes | Watches all the nodes including masters, workers as well as nodes created using custom MachineSets | :heavy_check_mark: | +Namespaces | Watches the pods including the containers running inside the pods in the specified namespaces | :heavy_check_mark: | Etcd | Watches the status of the Etcd member pods | :heavy_check_mark: | OpenShift ApiServer | Watches the OpenShift Apiserver pods | :heavy_check_mark: | Kube ApiServer | Watches the Kube APiServer pods | :heavy_check_mark: | @@ -168,7 +175,9 @@ Ingress | Watches Routers Openshift SDN | Watches SDN pods | :heavy_check_mark: | OVNKubernetes | Watches OVN pods | :heavy_check_mark: | Cluster Operators | Watches all Cluster Operators | :heavy_check_mark: | -Master Nodes Schedule | Watches schedule of Master Nodes | :heavy_check_mark: | +Master Nodes Schedule | Watches schedule of Master Nodes | :heavy_check_mark: | -NOTE: It supports monitoring pods in any namespaces specified in the config, the watch is enabled for system components mentioned above by default as they are critical for running the operations on Kubernetes/OpenShift clusters. +**NOTE**: It supports monitoring pods in any namespaces specified in the config, the watch is enabled for system components mentioned above by default as they are critical for running the operations on Kubernetes/OpenShift clusters. +### Blogs and other useful resources +- https://www.openshift.com/blog/openshift-scale-ci-part-4-introduction-to-cerberus-guardian-of-kubernetes/openshift-clouds diff --git a/config/config.yaml b/config/config.yaml index 7bc4689..999c27c 100644 --- a/config/config.yaml +++ b/config/config.yaml @@ -1,4 +1,5 @@ cerberus: + distribution: openshift # Distribution can be kubernetes or openshift kubeconfig_path: ~/.kube/config # Path to kubeconfig watch_nodes: True # Set to True for the cerberus to monitor the cluster nodes watch_cluster_operators: True # Set to True for cerberus to monitor cluster operators diff --git a/start_cerberus.py b/start_cerberus.py index 29e70f6..12ae686 100644 --- a/start_cerberus.py +++ b/start_cerberus.py @@ -31,6 +31,7 @@ def main(cfg): if os.path.isfile(cfg): with open(cfg, 'r') as f: config = yaml.full_load(f) + distribution = config["cerberus"].get("distribution", "openshift") kubeconfig_path = config["cerberus"].get("kubeconfig_path", "") watch_nodes = config["cerberus"].get("watch_nodes", False) watch_cluster_operators = config["cerberus"].get("watch_cluster_operators", False) @@ -73,8 +74,9 @@ def main(cfg): # Create slack WebCleint when slack intergation has been enabled if slack_integration: slack_integration = slackcli.initialize_slack_client() - - if inspect_components: + + # Run inspection only when the distribution is openshift + if distribution.lower() == "openshift" and inspect_components: logging.info("Detailed inspection of failed components has been enabled") inspect.delete_inspect_directory() @@ -185,8 +187,12 @@ def main(cfg): failed_operators, watch_namespaces_status, failed_pods_components) - if inspect_components: + # Run inspection only when the distribution is openshift + if distribution.lower() == "openshift" and inspect_components: inspect.inspect_components(failed_pods_components) + elif distribution.lower() == "kubernetes" and inspect_components: + logging.info("Skipping the failed component inspection as inspect_components is specific to OpenShift") + cerberus_status = watch_nodes_status and watch_namespaces_status \ and watch_cluster_operators_status