This is a cache of https://docs.openshift.com/container-platform/4.15/hosted_control_planes/hcp_high_availability/hcp-recovering-etcd-cluster.html. It is a snapshot of the page at 2024-10-14T11:04:32.225+0000.
Recovering a failing <strong>etcd</strong> cluster - High availability for hosted control planes | Hosted control planes | OpenShift Container Platform 4.15
×

In a highly available control plane, three etcd pods run as a part of a stateful set in an etcd cluster. To recover an etcd cluster, identify unhealthy etcd pods by checking the etcd cluster health.

Checking the status of an etcd cluster

You can check the status of the etcd cluster health by logging into any etcd pod.

Procedure
  1. Log in to an etcd pod by entering the following command:

    $ oc rsh -n openshift-etcd -c etcd <etcd_pod_name>
  2. Print the health status of an etcd cluster by entering the following command:

    sh-4.4# etcdctl endpoint status -w table
    Example output
    +------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |          ENDPOINT            |       ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://192.168.1xxx.20:2379 | 8fxxxxxxxxxx    |  3.5.12 |  123 MB |     false |      false |        10 |     180156 |             180156 |        |
    | https://192.168.1xxx.21:2379 | a5xxxxxxxxxx    |  3.5.12 |  122 MB |     false |      false |        10 |     180156 |             180156 |        |
    | https://192.168.1xxx.22:2379 | 7cxxxxxxxxxx    |  3.5.12 |  124 MB |      true |      false |        10 |     180156 |             180156 |        |
    +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Recovering a failing etcd pod

Each etcd pod of a 3-node cluster has its own persistent volume claim (PVC) to store its data. An etcd pod might fail because of corrupted or missing data. You can recover a failing etcd pod and its PVC.

Procedure
  1. To confirm that the etcd pod is failing, enter the following command:

    $ oc get pods -l app=etcd -n openshift-etcd
    Example output
    NAME     READY   STATUS             RESTARTS     AGE
    etcd-0   2/2     Running            0            64m
    etcd-1   2/2     Running            0            45m
    etcd-2   1/2     CrashLoopBackOff   1 (5s ago)   64m

    The failing etcd pod might have the CrashLoopBackOff or Error status.

  2. Delete the failing pod and its PVC by entering the following command:

    $ oc delete pods etcd-2 -n openshift-etcd
Verification
  • Verify that a new etcd pod is up and running by entering the following command:

    $ oc get pods -l app=etcd -n openshift-etcd
    Example output
    NAME     READY   STATUS    RESTARTS   AGE
    etcd-0   2/2     Running   0          67m
    etcd-1   2/2     Running   0          48m
    etcd-2   2/2     Running   0          2m2s