$ CLUSTER_NAME=my-cluster
You can back up and restore etcd on a hosted cluster in an on-premise environment to fix failures.
By backing up and restoring etcd on a hosted cluster, you can fix failures, such as corrupted or missing data in an etcd member of a three node cluster. If multiple members of the etcd cluster encounter data loss or have a CrashLoopBackOff
status, this approach helps prevent an etcd quorum loss.
This procedure requires API downtime. |
The oc
and jq
binaries have been installed.
First, set up your environment variables:
Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary:
$ CLUSTER_NAME=my-cluster
$ HOSTED_CLUSTER_NAMESPACE=clusters
$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}"
Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary:
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
-p '{"spec":{"pausedUntil":"true"}}' --type=merge
Next, take a snapshot of etcd by using one of the following methods:
Use a previously backed-up snapshot of etcd.
If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps:
List etcd pods by entering the following command:
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
Take a snapshot of the pod database and save it locally to your machine by entering the following commands:
$ ETCD_pod=etcd-0
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_pod} -- \
env ETCDCTL_API=3 /usr/bin/etcdctl \
--cacert /etc/etcd/tls/etcd-ca/ca.crt \
--cert /etc/etcd/tls/client/etcd-client.crt \
--key /etc/etcd/tls/client/etcd-client.key \
--endpoints=https://localhost:2379 \
snapshot save /var/lib/snapshot.db
Verify that the snapshot is successful by entering the following command:
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_pod} -- \
env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status \
/var/lib/snapshot.db
Make a local copy of the snapshot by entering the following command:
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_pod}:/var/lib/snapshot.db \
/tmp/etcd.snapshot.db
Make a copy of the snapshot database from etcd persistent storage:
List etcd pods by entering the following command:
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
Find a pod that is running and set its name as the value of ETCD_pod: ETCD_pod=etcd-0
, and then copy its snapshot database by entering the following command:
$ oc cp -c etcd \
${CONTROL_PLANE_NAMESPACE}/${ETCD_pod}:/var/lib/data/member/snap/db \
/tmp/etcd.snapshot.db
Next, scale down the etcd statefulset by entering the following command:
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0
Delete volumes for second and third members by entering the following command:
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2
Create a pod to access the first etcd member’s data:
Get the etcd image by entering the following command:
$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd \
-o jsonpath='{ .spec.template.spec.containers[0].image }')
Create a pod that allows access to etcd data:
$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: etcd-data
spec:
replicas: 1
selector:
matchLabels:
app: etcd-data
template:
metadata:
labels:
app: etcd-data
spec:
containers:
- name: access
image: $ETCD_IMAGE
volumeMounts:
- name: data
mountPath: /var/lib
command:
- /usr/bin/bash
args:
- -c
- |-
while true; do
sleep 1000
done
volumes:
- name: data
persistentVolumeClaim:
claimName: data-etcd-0
EOF
Check the status of the etcd-data
pod and wait for it to be running by entering the following command:
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data
Get the name of the etcd-data
pod by entering the following command:
$ DATA_pod=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers \
-l app=etcd-data -o name | cut -d/ -f2)
Copy an etcd snapshot into the pod by entering the following command:
$ oc cp /tmp/etcd.snapshot.db \
${CONTROL_PLANE_NAMESPACE}/${DATA_pod}:/var/lib/restored.snap.db
Remove old data from the etcd-data
pod by entering the following commands:
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_pod} -- rm -rf /var/lib/data
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_pod} -- mkdir -p /var/lib/data
Restore the etcd snapshot by entering the following command:
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_pod} -- \
etcdutl snapshot restore /var/lib/restored.snap.db \
--data-dir=/var/lib/data --skip-hash-check \
--name etcd-0 \
--initial-cluster-token=etcd-cluster \
--initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
--initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380
Remove the temporary etcd snapshot from the pod by entering the following command:
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_pod} -- \
rm /var/lib/restored.snap.db
Delete data access deployment by entering the following command:
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data
Scale up the etcd cluster by entering the following command:
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3
Wait for the etcd member pods to return and report as available by entering the following command:
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w
Restore reconciliation of the hosted cluster by entering the following command:
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
-p '{"spec":{"pausedUntil":"null"}}' --type=merge
Manually roll out the hosted cluster by entering the following command:
$ oc annotate hostedcluster -n \
<hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
The Multus admission controller and network node identity pods do not start yet.
Delete the pods for the second and third members of etcd and their PVCs by entering the following commands:
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pod/etcd-1 --wait=false
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-2 pod/etcd-2 --wait=false
Manually roll out the hosted cluster again by entering the following command:
$ oc annotate hostedcluster -n \
<hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds) \
--overwrite
After a few minutes, the control plane pods start running.