Deploying node health checks by using the Node Health Check Operator - Working with nodes | Nodes

About the Node Health Check Operator
- Understanding the Node Health Check Operator workflow
- About how node health checks prevent conflicts with machine health checks
Installing the Node Health Check Operator by using the web console
Installing the Node Health Check Operator by using the CLI
Gathering data about the Node Health Check Operator
Additional resources

Use the Node Health Check Operator to identify unhealthy nodes. The Operator uses the Self Node Remediation Operator to remediate the unhealthy nodes.

Additional resources

Remediating nodes with the Self Node Remediation Operator

About the Node Health Check Operator

The Node Health Check Operator detects the health of the nodes in a cluster. The NodeHealthCheck controller creates the NodeHealthCheck custom resource (CR), which defines a set of criteria and thresholds to determine the health of a node.

The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider.

When the Node Health Check Operator detects an unhealthy node, it creates a remediation CR that triggers the remediation provider. For example, the controller creates the SelfNodeRemediation CR, which triggers the Self Node Remediation Operator to remediate the unhealthy node.

The NodeHealthCheck CR resembles the following YAML file:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  minHealthy: 51% (1)
  pauseRequests: (2)
    - <pause-test-cluster>
  remediationTemplate: (3)
    apiVersion: self-node-remediation.medik8s.io/v1alpha1
    name: self-node-remediation-resource-deletion-template
    namespace: openshift-operators
    kind: SelfNodeRemediationTemplate
  selector: (4)
    matchExpressions:
      - key: node-role.kubernetes.io/worker
        operator: Exists
  unhealthyConditions: (5)
    - type: Ready
      status: "False"
      duration: 300s (6)
    - type: Ready
      status: Unknown
      duration: 300s (6)

1 Specifies the amount of healthy nodes(in percentage or number) required for a remediation provider to concurrently remediate nodes in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by minHealthy, remediation occurs. The default value is 51%.

Prevents any new remediation from starting, while allowing any ongoing remediations to persist. The default value is empty. However, you can enter an array of strings that identify the cause of pausing the remediation. For example, pause-test-cluster.

During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting.

3 Specifies a remediation template from the remediation provider. For example, from the Self Node Remediation Operator.

4 Specifies a selector that matches labels or expressions that you want to check. The default value is empty, which selects all nodes.

5 Specifies a list of the conditions that determine whether a node is considered unhealthy.

6 Specifies the timeout duration for a node condition. If a condition is met for the duration of the timeout, the node will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy node.

Understanding the Node Health Check Operator workflow

When a node is identified as unhealthy, the Node Health Check Operator checks how many other nodes are unhealthy. If the number of healthy nodes exceeds the amount that is specified in the minHealthy field of the NodeHealthCheck CR, the controller creates a remediation CR from the details that are provided in the external remediation template by the remediation provider. After remediation, the kubelet updates the node’s health status.

When the node turns healthy, the controller deletes the external remediation template.

About how node health checks prevent conflicts with machine health checks

When both, node health checks and machine health checks are deployed, the node health check avoids conflict with the machine health check.

OpenShift Container Platform deploys machine-api-termination-handler as the default MachineHealthCheck resource.

The following list summarizes the system behavior when node health checks and machine health checks are deployed:

If only the default machine health check exists, the node health check continues to identify unhealthy nodes. However, the node health check ignores unhealthy nodes in a Terminating state. The default machine health check handles the unhealthy nodes with a Terminating state.
Example log message
```
INFO MHCChecker	ignoring unhealthy Node, it is terminating and will be handled by MHC	{"NodeName": "node-1.example.com"}
```
If the default machine health check is modified (for example, the unhealthyConditions is Ready), or if additional machine health checks are created, the node health check is disabled.
Example log message
```
INFO controllers.NodeHealthCheck disabling NHC in order to avoid conflict with custom MHCs configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}
```

When, again, only the default machine health check exists, the node health check is re-enabled.

Example log message

INFO controllers.NodeHealthCheck re-enabling NHC, no conflicting MHC configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}

Installing the Node Health Check Operator by using the web console

You can use the OpenShift Container Platform web console to install the Node Health Check Operator.

Prerequisites

Procedure

In the OpenShift Container Platform web console, navigate to Operators → OperatorHub.
Search for the Node Health Check Operator, then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator will be installed to the openshift-operators namespace.
Click Install.

Verification

To confirm that the installation is successful:

Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the openshift-operators namespace and that its status is Succeeded.

If the Operator is not installed successfully:

Navigate to the Operators → Installed Operators page and inspect the Status column for any errors or failures.
Navigate to the Workloads → Pods page and check the logs in any pods in the openshift-operators project that are reporting issues.

Installing the Node Health Check Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Node Health Check Operator.

To install the Operator in your own namespace, follow the steps in the procedure.

To install the Operator in the openshift-operators namespace, skip to step 3 of the procedure because the steps to create a new Namespace custom resource (CR) and an OperatorGroup CR are not required.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a Namespace custom resource (CR) for the Node Health Check Operator:
1. Define the Namespace CR and save the YAML file, for example, node-health-check-namespace.yaml:
  apiVersion: v1 kind: Namespace metadata: name: node-health-check
2. To create the Namespace CR, run the following command:
  $ oc create -f node-health-check-namespace.yaml

Create an OperatorGroup CR:

Define the OperatorGroup CR and save the YAML file, for example, node-health-check-operator-group.yaml:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: node-health-check-operator
  namespace: node-health-check

To create the OperatorGroup CR, run the following command:
```
$ oc create -f node-health-check-operator-group.yaml
```

Create a Subscription CR:

Define the Subscription CR and save the YAML file, for example, node-health-check-subscription.yaml:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
    name: node-health-check-operator
    namespace: node-health-check (1)
spec:
    channel: stable (2)
    installPlanApproval: Manual (3)
    name: node-healthcheck-operator
    source: redhat-operators
    sourceNamespace: openshift-marketplace
    package: node-healthcheck-operator

1	Specify the `Namespace` where you want to install the Node Health Check Operator. To install the Node Health Check Operator in the `openshift-operators` namespace, specify `openshift-operators` in the `Subscription` CR.
2	Specify the channel name for your subscription. To upgrade to the latest version of the Node Health Check Operator, you must manually change the channel name for your subscription from `candidate` to `stable`.
3	Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation.

To create the Subscription CR, run the following command:
```
$ oc create -f node-health-check-subscription.yaml
```

Verification

Verify that the installation succeeded by inspecting the CSV resource:

$ oc get csv -n openshift-operators

Example output

NAME                              DISPLAY                     VERSION  REPLACES PHASE
node-healthcheck-operator.v0.2.0. Node Health Check Operator  0.2.0             Succeeded

Verify that the Node Health Check Operator is up and running:

$ oc get deploy -n openshift-operators

Example output

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
node-health-check-operator-controller-manager  1/1     1            1           10d

Gathering data about the Node Health Check Operator

To collect debugging information about the Node Health Check Operator, use the must-gather tool. For information about the must-gather image for the Node Health Check Operator, see Gathering data about specific features.

Additional resources

Changing the update channel for an Operator
The Node Health Check Operator is supported in a restricted network environment. For more information, see Using Operator Lifecycle Manager on restricted networks.