Restoring to a previous cluster state

Restoring to a previous cluster state

To restore the cluster to a previous state, you must have previously backed up etcd data by creating a snapshot. You will use this snapshot to restore the cluster state.

You can use a saved etcd snapshot to restore back to a previous cluster state.

Prerequisites

Access to the cluster as a user with the cluster-admin role.
SSH access to master hosts.
A backed-up etcd snapshot.

You must use the same etcd snapshot file on all master hosts in the cluster.

Procedure

Prepare each master host in your cluster to be restored.

You should run the restore script on all of your master hosts within a short period of time so that the cluster members come up at about the same time and form a quorum. For this reason, it is recommended to stage each master host in a separate terminal, so that the restore script can then be started quickly on each.
1. Copy the etcd snapshot file to a master host.
  
  This procedure assumes that you have copied a snapshot file called snapshot.db to the /home/core/ directory of your master host.
2. Access the master host.
3. Set the INITIAL_CLUSTER variable to the list of members in the format of <name>=<url>. This variable will be passed to the restore script and must be exactly the same for each member.
  [core@ip-10-0-143-125 ~]$ export INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-35-108.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-10-16.ec2.internal=https://etcd-2.clustername.devcluster.openshift.com:2380"
4. If the cluster-wide proxy is enabled, be sure that you have exported the NO_proxy, HTTP_proxy, and HTTPS_proxy environment variables.
  
  You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpproxy, httpsproxy, and noproxy fields have values set.
5. Repeat these steps on your other master hosts, each in a separate terminal. Be sure to use the same etcd snapshot file on each master host.

Run the restore script on all of your master hosts.

Start the etcd-snapshot-restore.sh script on your first master host. Pass in two parameters: the path to the snapshot file and list of members, which is defined by the INITIAL_CLUSTER variable.

[core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/etcd-snapshot-restore.sh /home/core/snapshot.db $INITIAL_CLUSTER
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Stopping all static pods..
..stopping kube-scheduler-pod.yaml
..stopping kube-controller-manager-pod.yaml
..stopping kube-apiserver-pod.yaml
..stopping etcd-member.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
Stopping all containers..
bd44e4bc942276eb1a6d4b48ecd9f5fe95570f54aa9c6b16939fa2d9b679e1ea
d88defb9da5ae623592b81619e3690faeb4fa645440e71c029812cb960ff586f
3920ced20723064a379739c4a586f909497a7b6705a5b3cf367d9b930f23a5f1
d470f7a2d962c90f3a21bcc021970bde96bc8908f317ec70f1c21720b322c25c
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Restoring etcd member etcd-member-ip-10-0-143-125.ec2.internal from snapshot..
2019-05-15 19:03:34.647589 I | pkg/netutil: resolving etcd-0.clustername.devcluster.openshift.com:2380 to 10.0.143.125:2380
2019-05-15 19:03:34.883545 I | mvcc: restore compact to 361491
2019-05-15 19:03:34.915679 I | etcdserver/membership: added member cbe982c74cbb42f [https://etcd-0.clustername.devcluster.openshift.com:2380] to cluster 807ae3bffc8d69ca
Starting static pods..
..starting kube-scheduler-pod.yaml
..starting kube-controller-manager-pod.yaml
..starting kube-apiserver-pod.yaml
..starting etcd-member.yaml
Starting kubelet..

Once the restore starts, run the script on your other master hosts.

Verify that the Machine Configs have been applied.

In a terminal that has access to the cluster as a cluster-admin user, run the following command.
```
$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING
master   rendered-master-50e7e00374e80b767fcc922bdfbc522b   True      False
```
When the snapshot has been applied, the currentConfig of the master will match the ID from when the etcd snapshot was taken. The currentConfig name for masters is in the format rendered-master-<currentConfig>.

Verify that all master hosts have started and joined the cluster.

Access a master host and connect to the running etcd container.

[core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

In the etcd container, export variables needed for connecting to etcd.

sh-4.2# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)

In the etcd container, execute etcdctl member list and verify that the three members show as started.

sh-4.2#  etcdctl member list -w table

+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
|        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
| 29e461db6be4eaaa | started | etcd-member-ip-10-0-164-170.ec2.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.164.170:2379 |
|  cbe982c74cbb42f | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.143.125:2379 |
| a752f80bcb0da3e8 | started |   etcd-member-ip-10-0-156-2.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.156.2:2379 |
+------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

It may take up to 10 minutes for each new member to start.