Replacing a failed <strong>master</strong> host | Backup and restore

Removing a failed master host from the etcd cluster
Adding the member back to the cluster
- Adding a master host back to the etcd cluster
- Generating etcd certificates and adding the member to the cluster

This document describes the process to replace a single etcd member. This procedure assumes that there is still an etcd quorum in the cluster.

If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.

If the control plane certificates are not valid on the member being replaced, then you must follow the procedure to recover from expired control plane certificates instead of this procedure.

To replace a single master host:

Remove the member from the etcd cluster.
If the etcd certificates for the master host are valid, then add the member back to the etcd cluster.
If there are no etcd certificates for the master host or they are no longer valid, then generate etcd certificates and add the member to the etcd cluster.

Removing a failed master host from the etcd cluster

Follow these steps to remove a failed master host from the etcd cluster.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have SSH access to an active master host.

Procedure

View the list of Pods associated with etcd.

In a terminal that has access to the cluster, run the following command:

$ oc get pods -n openshift-etcd
NAME                                                     READY   STATUS    RESTARTS   AGE
etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h

Access an active master host.

Run the etcd-member-remove.sh script and pass in the name of the etcd member to remove:

[core@ip-10-0-128-73 ~]$ sudo -E /usr/local/bin/etcd-member-remove.sh etcd-member-ip-10-0-147-172.us-east-2.compute.internal
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd client certs already backed up and available ./assets/backup/
Member 23e4736df4451b32 removed from cluster 6e25bab1bb556673
etcd member etcd-member-ip-10-0-147-172.us-east-2.compute.internal with 23e4736df4451b32 successfully removed..

Verify that the etcd member has been successfully removed from the cluster:

Connect to the running etcd container:

[core@ip-10-0-128-73 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

In the etcd container, export the variables needed for connecting to etcd:

sh-4.2# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

In the etcd container, execute etcdctl member list and verify that the removed member is no longer listed:

sh-4.2#  etcdctl member list -w table

+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
|        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
| 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
|  cbe982c74cbb42f | started |  etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

Adding the member back to the cluster

After you have removed the member from the etcd cluster, use one of the following procedures to add the member to the cluster:

If the etcd certificates for the master host are valid, then add the member back to the etcd cluster.
If there are no etcd certificates for the master host or they are no longer valid, then generate etcd certificates and add the member to the etcd cluster.

Adding a master host back to the etcd cluster

Follow these steps to add a master host back to the etcd cluster. This procedure assumes that you previously removed the master host from the cluster and that its etcd dependencies, such as TLS certificates and DNS, are valid.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have SSH access to the master host to add to the etcd cluster.
You have the IP address of an existing active etcd member.

Procedure

Access the master host to add to the etcd cluster.

You must run this procedure on the master host that is being added to the etcd cluster.

Run the etcd-member-add.sh script and pass in two parameters:

the IP address of an existing etcd member
the name of the etcd member to add

[core@ip-10-0-147-172 ~]$ sudo -E /usr/local/bin/etcd-member-add.sh \
10.0.128.73 \ (1)
etcd-member-ip-10-0-147-172.us-east-2.compute.internal (2)

Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
etcd.conf backup upready exists ./assets/backup/etcd.conf
Stopping etcd..
Waiting for etcd-member to stop
etcd data-dir backup found ./assets/backup/etcd..
Updating etcd membership..
Removing etcd data_dir /var/lib/etcd..

ETCD_NAME="etcd-member-ip-10-0-147-172.us-east-2.compute.internal"
ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-147-172.us-east-2.compute.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-171-108.us-east-2.compute.internal=https://etcd-2.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-128-73.us-east-2.compute.internal=https://etcd-0.clustername.devcluster.openshift.com:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"'
Member  1e42c7070decd39 added to cluster 6e25bab1bb556673
Starting etcd..

1	The IP address of an active etcd member. This is not the IP address of the member that you are adding.
2	The name of the etcd member to add.

Verify that the etcd member has been successfully added to the etcd cluster:

Connect to the running etcd container:

[core@ip-10-0-147-172 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

In the etcd container, export the variables needed for connecting to etcd:

sh-4.2# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

In the etcd container, execute etcdctl member list and verify that the new member is listed:

sh-4.2#  etcdctl member list -w table

+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
|        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
| 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
|  cbe982c74cbb42f | started | etcd-member-ip-10-0-147-172.us-east-2.compute.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.147.172:2379 |
| a752f80bcb0da3e8 | started |   etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

It may take up to 10 minutes for the new member to start.

In the etcd container, execute etcdctl endpoint health and verify that the new member is healthy:

sh-4.2# etcdctl endpoint health --cluster
https://10.0.128.73:2379 is healthy: successfully committed proposal: took = 4.5576ms
https://10.0.147.172:2379 is healthy: successfully committed proposal: took = 5.1521ms
https://10.0.171.108:2379 is healthy: successfully committed proposal: took = 4.2631ms

Verify that the new member is in the list of Pods associated with etcd and that its status is Running.

In a terminal that has access to the cluster, run the following command:

$ oc get pods -n openshift-etcd
NAME                                                     READY   STATUS    RESTARTS   AGE
etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h

Generating etcd certificates and adding the member to the cluster

If the node is new or the etcd certificates on the node are no longer valid, you must generate the etcd certificates before you can add the member to the etcd cluster.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have SSH access to the new master host to add to the etcd cluster.
You have SSH access to the one of the healthy master hosts.
You have the IP address of one of the healthy master hosts.

Procedure

Set up a temporary etcd certificate signer service on one of the healthy master nodes.

Access one of the healthy master nodes and log in to your cluster as a cluster-admin user using the following command.

[core@ip-10-0-143-125 ~]$ sudo oc login https://localhost:6443
Authentication required for https://localhost:6443 (openshift)
Username: kubeadmin
Password:
Login successful.

Obtain the pull specification for the kube-etcd-signer-server image.

[core@ip-10-0-143-125 ~]$ export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)

Run the tokenize-signer.sh script.

Be sure to pass in the -E flag to sudo so that environment variables are properly passed to the script.

[core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh ip-10-0-143-125 (1)
Populating template /usr/local/share/openshift-recovery/template/kube-etcd-cert-signer.yaml.template
Populating template ./assets/tmp/kube-etcd-cert-signer.yaml.stage1
Tokenized template now ready: ./assets/manifests/kube-etcd-cert-signer.yaml

1	The host name of the healthy master, where the signer should be deployed.

Create the signer Pod using the file that was generated.

[core@ip-10-0-143-125 ~]$ sudo oc create -f assets/manifests/kube-etcd-cert-signer.yaml
pod/etcd-signer created

Verify that the signer is listening on this master node.

[core@ip-10-0-143-125 ~]$ ss -ltn | grep 9943
LISTEN   0         128                       *:9943                   *:*

Add the new master host to the etcd cluster.

Access the new master host to be added to the cluster, and log in to your cluster as a cluster-admin user using the following command.

[core@ip-10-0-156-255 ~]$ sudo oc login https://localhost:6443
Authentication required for https://localhost:6443 (openshift)
Username: kubeadmin
Password:
Login successful.

Export two environment variables that are required by the etcd-member-recover.sh script.

[core@ip-10-0-156-255 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)

[core@ip-10-0-156-255 ~]$ export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)

Run the etcd-member-recover.sh script.

Be sure to pass in the -E flag to sudo so that environment variables are properly passed to the script.

[core@ip-10-0-156-255 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.0.143.125 etcd-member-ip-10-0-156-255.ec2.internal (1)
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
etcd.conf backup upready exists ./assets/backup/etcd.conf
Trying to backup etcd client certs..
etcd client certs already backed up and available ./assets/backup/
Stopping etcd..
Waiting for etcd-member to stop
etcd data-dir backup found ./assets/backup/etcd..
etcd TLS certificate backups found in ./assets/backup..
Removing etcd certs..
Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template
Populating template ./assets/tmp/etcd-generate-certs.stage1
Populating template ./assets/tmp/etcd-generate-certs.stage2
Starting etcd client cert recovery agent..
Waiting for certs to generate..
Waiting for certs to generate..
Waiting for certs to generate..
Waiting for certs to generate..
Stopping cert recover..
Waiting for generate-certs to stop
Patching etcd-member manifest..
Updating etcd membership..
Member 249a4b9a790b3719 added to cluster 807ae3bffc8d69ca

ETCD_NAME="etcd-member-ip-10-0-156-255.ec2.internal"
ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-156-255.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Starting etcd..

1	Specify both the IP address of the healthy master where the signer server is running, and the etcd name of the new member.

Verify that the new master host has been added to the etcd member list.

Access the healthy master and connect to the running etcd container.

[core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

In the etcd container, export variables needed for connecting to etcd.

sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

In the etcd container, execute etcdctl member list and verify that the new member is listed.

sh-4.3#  etcdctl member list -w table

+------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+
|        ID        | STATUS  |                   NAME                   |                           PEER ADDRS                           |       CLIENT ADDRS        |
+------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+
|  cbe982c74cbb42f | started |  etcd-member-ip-10-0-156-255.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 |  https://10.0.156.255:2379 |
| 249a4b9a790b3719 | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | https://10.0.143.125:2379 |
+------------------+---------+------------------------------------------+----------------------------------------------------------------+---------------------------+

It may take up to 20 minutes for the new member to start.

After the new member is added, remove the signer Pod because it is no longer needed.

In a terminal that has access to the cluster, run the following command:
```
$ oc delete pod -n openshift-config etcd-signer
```

Replacing a failed master host

Removing a failed master host from the etcd cluster

Adding the member back to the cluster

Adding a master host back to the etcd cluster

Generating etcd certificates and adding the member to the cluster