Backing up and restoring etcd in an on-premise environment - High availability for hosted control planes | Hosted control planes

Backing up and restoring etcd on a hosted cluster in an on-premise environment

You can back up and restore etcd for hosted control planes in an on-premise environment to fix failures.

Backing up and restoring etcd on a hosted cluster in an on-premise environment

By backing up and restoring etcd on a hosted cluster, you can fix failures, such as corrupted or missing data in an etcd member of a three-node cluster. If members of the etcd cluster lose data or have a CrashLoopBackOff status, this approach helps prevent an etcd quorum loss.

Prerequisites

The oc and jq binaries have been installed.
Management cluster prerequisites:
- A valid StorageClass resource is configured in the management cluster.
- You have cluster-admin access to the management cluster.
- You have access to online storage that is compatible with OpenShift ADP cloud storage providers, such as Amazon Web Services (AWS) S3, Microsoft Azure, Google Cloud, or MinIO. If you use S3 for backup storage, ensure that IAM roles and policies are configured. For more information, see "Configuring Amazon Web Services".
- Hosted control plane pods are accessible and functioning properly.
- You have access to the openshift-adp subscription through a CatalogSource object.
Hosted cluster service publishing strategy prerequisites:
- The APIServer service must have a fixed hostname. Otherwise, the restore process fails and nodes cannot rejoin the cluster. For hosted control planes on AWS, the APIServer service can also use a Route service publishing strategy with a fixed hostname.
- For production environments, it is strongly recommended to configure all services with fixed hostnames. By having fixed hostnames, you can ensure full service continuity and DNS consistency during the restore process on a different management cluster.
- When you restore a hosted cluster to a different management cluster, all services in the hosted cluster must be configured with a fixed hostname in its servicePublishingStrategy property. This requirement applies to all platforms: AWS, Agent, OKD Virtualization, and OpenStack.

Restoring a hosted cluster to a different management cluster is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

For hosted control planes on AWS, the OIDC provider configuration must be accessible so that any necessary fixes can be completed after the restore process. See the following procedure for more information about applying any necessary fixes.
For hosted control planes on bare metal, the InfraEnv resource must reside in a different namespace from the hosted control plane namespace. Do not delete the InfraEnv resource during the backup or restore process.

After you back up the hosted cluster, you must back up workloads in the data cluster and then destroy the original hosted cluster so that the restore process can begin.

Procedure

Set up your environment variables:

Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary:

$ CLUSTER_NAME=my-cluster

$ HOSTED_CLUSTER_NAMESPACE=clusters

$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}"

Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary:

$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
  -p '{"spec":{"pausedUntil":"true"}}' --type=merge

Take a snapshot of etcd by using one of the following methods:

Use a previously backed-up snapshot of etcd.

If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps:

List etcd pods by entering the following command:

$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd

Take a snapshot of the pod database and save it locally to your machine by entering the following commands:

$ ETCD_POD=etcd-0

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
  env ETCDCTL_API=3 /usr/bin/etcdctl \
  --cacert /etc/etcd/tls/etcd-ca/ca.crt \
  --cert /etc/etcd/tls/client/etcd-client.crt \
  --key /etc/etcd/tls/client/etcd-client.key \
  --endpoints=https://localhost:2379 \
  snapshot save /var/lib/snapshot.db

Verify that the snapshot is successful by entering the following command:

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
  env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status \
  /var/lib/snapshot.db

Make a local copy of the snapshot by entering the following command:
```
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db \
  /tmp/etcd.snapshot.db
```
1. Make a copy of the snapshot database from etcd persistent storage:
  1. List etcd pods by entering the following command:
    
    $ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
  2. Find a pod that is running and set its name as the value of ETCD_POD: ETCD_POD=etcd-0, and then copy its snapshot database by entering the following command:
    
    $ oc cp -c etcd \ ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db \ /tmp/etcd.snapshot.db

Scale down the etcd statefulset by entering the following command:

$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0

Delete volumes for second and third members by entering the following command:
```
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2
```

Create a pod to access the first etcd member’s data:

Get the etcd image by entering the following command:

$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd \
  -o jsonpath='{ .spec.template.spec.containers[0].image }')

Create a pod that allows access to etcd data:

$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: etcd-data
spec:
  replicas: 1
  selector:
    matchLabels:
      app: etcd-data
  template:
    metadata:
      labels:
        app: etcd-data
    spec:
      containers:
      - name: access
        image: $ETCD_IMAGE
        volumeMounts:
        - name: data
          mountPath: /var/lib
        command:
        - /usr/bin/bash
        args:
        - -c
        - |-
          while true; do
            sleep 1000
          done
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-etcd-0
    EOF

Check the status of the etcd-data pod and wait for it to be running by entering the following command:
```
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data
```

Get the name of the etcd-data pod by entering the following command:

$ DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers \
  -l app=etcd-data -o name | cut -d/ -f2)

Copy an etcd snapshot into the pod by entering the following command:

$ oc cp /tmp/etcd.snapshot.db \
  ${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db

Remove old data from the etcd-data pod by entering the following commands:

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data

Restore the etcd snapshot by entering the following command:

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
     etcdutl snapshot restore /var/lib/restored.snap.db \
     --data-dir=/var/lib/data --skip-hash-check \
     --name etcd-0 \
     --initial-cluster-token=etcd-cluster \
     --initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
     --initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380

Remove the temporary etcd snapshot from the pod by entering the following command:

$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
  rm /var/lib/restored.snap.db

Delete data access deployment by entering the following command:

$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data

Scale up the etcd cluster by entering the following command:

$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3

Wait for the etcd member pods to return and report as available by entering the following command:
```
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w
```

Restore reconciliation of the hosted cluster by entering the following command:

$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
  -p '{"spec":{"pausedUntil":"null"}}' --type=merge

Manually roll out the hosted cluster by entering the following command:

$ oc annotate hostedcluster -n \
  <hosted_cluster_namespace> <hosted_cluster_name> \
  hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)

The Multus admission controller and network node identity pods do not start yet.

Delete the pods for the second and third members of etcd and their PVCs by entering the following commands:

$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pod/etcd-1 --wait=false

$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-2 pod/etcd-2 --wait=false

Manually roll out the hosted cluster again by entering the following command:

$ oc annotate hostedcluster -n \
  <hosted_cluster_namespace> <hosted_cluster_name> \
  hypershift.openshift.io/restart-date=$(date --iso-8601=seconds) \
  --overwrite

After a few minutes, the control plane pods start running.

If your hosted cluster is on AWS and you need to apply OIDC fixes after the restore process, enter the following command:
```
$ hcp fix dr-oidc-iam --hc-name <hosted_cluster_name> --hc-namespace <hosted_cluster_namespace> --aws-creds ~/.aws/credentials[4:48 AM]
```
This command regenerates the OIDC in S3 in case OIDC is deleted.

Additional resources

Configuring Amazon Web Services