This is a cache of https://docs.okd.io/latest/hosted_control_planes/hcp_high_availability/hcp-disaster-recovery-oadp.html. It is a snapshot of the page at 2024-11-23T19:04:27.675+0000.
Disaster recovery for a hosted cluster by using OADP - High availability for hosted control planes | Hosted control planes | OKD 4
×

You can use the OpenShift API for Data Protection (OADP) Operator to perform disaster recovery on Amazon Web Services (AWS) and bare metal.

The disaster recovery process with OpenShift API for Data Protection (OADP) involves the following steps:

  1. Preparing your platform, such as Amazon Web Services or bare metal, to use OADP

  2. Backing up the data plane workload

  3. Backing up the control plane workload

  4. Restoring a hosted cluster by using OADP

Prerequisites

You must meet the following prerequisites on the management cluster:

  • You installed the OADP Operator.

  • You created a storage class.

  • You have access to the cluster with cluster-admin privileges.

  • You have access to the OADP subscription through a catalog source.

  • You have access to a cloud storage provider that is compatible with OADP, such as S3, Microsoft Azure, Google Cloud Platform, or MinIO.

  • In a disconnected environment, you have access to a self-hosted storage provider, for example Red Hat OpenShift Data Foundation or MinIO, that is compatible with OADP.

  • Your hosted control planes pods are up and running.

Preparing AWS to use OADP

To perform disaster recovery for a hosted cluster, you can use OpenShift API for Data Protection (OADP) on Amazon Web Services (AWS) S3 compatible storage. After creating the DataProtectionApplication object, new velero deployment and node-agent pods are created in the openshift-adp namespace.

To prepare AWS to use OADP, see "Configuring the OpenShift API for Data Protection with Multicloud Object Gateway".

Next steps
  • Backing up the data plane workload

  • Backing up the control plane workload

Preparing bare metal to use OADP

To perform disaster recovery for a hosted cluster, you can use OpenShift API for Data Protection (OADP) on bare metal. After creating the DataProtectionApplication object, new velero deployment and node-agent pods are created in the openshift-adp namespace.

To prepare bare metal to use OADP, see "Configuring the OpenShift API for Data Protection with AWS S3 compatible storage".

Next steps
  • Backing up the data plane workload

  • Backing up the control plane workload

Backing up the data plane workload

If the data plane workload is not important, you can skip this procedure. To back up the data plane workload by using the OADP Operator, see "Backing up applications".

Additional resources
Next steps
  • Restoring a hosted cluster by using OADP

Backing up the control plane workload

You can back up the control plane workload by creating the Backup custom resource (CR).

To monitor and observe the backup process, see "Observing the backup and restore process".

Procedure
  1. Pause the reconciliation of the HostedCluster resource by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      patch hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
      --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "true"}]'
  2. Pause the reconciliation of the NodePool resource by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      patch nodepool -n <hosted_cluster_namespace> <node_pool_name> \
      --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "true"}]'
  3. Pause the reconciliation of the AgentCluster resource by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      annotate agentcluster -n <hosted_control_plane_namespace>  \
      cluster.x-k8s.io/paused=true --all'
  4. Pause the reconciliation of the AgentMachine resource by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      annotate agentmachine -n <hosted_control_plane_namespace>  \
      cluster.x-k8s.io/paused=true --all'
  5. Annotate the HostedCluster resource to prevent the deletion of the hosted control plane namespace by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
      hypershift.openshift.io/skip-delete-hosted-controlplane-namespace=true
  6. Create a YAML file that defines the Backup CR:

    Example backup-control-plane.yaml file
    apiVersion: velero.io/v1
    kind: Backup
    metadata:
      name: <backup_resource_name> (1)
      namespace: openshift-adp
      labels:
        velero.io/storage-location: default
    spec:
      hooks: {}
      includedNamespaces: (2)
      - <hosted_cluster_namespace> (3)
      - <hosted_control_plane_namespace> (4)
      includedResources:
      - sa
      - role
      - rolebinding
      - pod
      - pvc
      - pv
      - bmh
      - configmap
      - infraenv (5)
      - priorityclasses
      - pdb
      - agents
      - hostedcluster
      - nodepool
      - secrets
      - services
      - deployments
      - hostedcontrolplane
      - cluster
      - agentcluster
      - agentmachinetemplate
      - agentmachine
      - machinedeployment
      - machineset
      - machine
      excludedResources: []
      storageLocation: default
      ttl: 2h0m0s
      snapshotMoveData: true (6)
      datamover: "velero" (6)
      defaultVolumesToFsBackup: true (7)
    1 Replace backup_resource_name with the name of your Backup resource.
    2 Selects specific namespaces to back up objects from them. You must include your hosted cluster namespace and the hosted control plane namespace.
    3 Replace <hosted_cluster_namespace> with the name of the hosted cluster namespace, for example, clusters.
    4 Replace <hosted_control_plane_namespace> with the name of the hosted control plane namespace, for example, clusters-hosted.
    5 You must create the infraenv resource in a separate namespace. Do not delete the infraenv resource during the backup process.
    6 Enables the CSI volume snapshots and uploads the control plane workload automatically to the cloud storage.
    7 Sets the fs-backup backing up method for persistent volumes (PVs) as default. This setting is useful when you use a combination of Container Storage Interface (CSI) volume snapshots and the fs-backup method.

    If you want to use CSI volume snapshots, you must add the backup.velero.io/backup-volumes-excludes=<pv_name> annotation to your PVs.

  7. Apply the Backup CR by running the following command:

    $ oc apply -f backup-control-plane.yaml
Verification
  • Verify if the value of the status.phase is Completed by running the following command:

    $ oc get backup <backup_resource_name> -n openshift-adp -o jsonpath='{.status.phase}'
Next steps
  • Restoring a hosted cluster by using OADP

Restoring a hosted cluster by using OADP

You can restore the hosted cluster by creating the Restore custom resource (CR).

  • If you are using an in-place update, InfraEnv does not need spare nodes. You need to re-provision the worker nodes from the new management cluster.

  • If you are using a replace update, you need some spare nodes for InfraEnv to deploy the worker nodes.

After you back up your hosted cluster, you must destroy it to initiate the restoring process. To initiate node provisioning, you must back up workloads in the data plane before deleting the hosted cluster.

Prerequisites

To monitor and observe the backup process, see "Observing the backup and restore process".

Procedure
  1. Verify that no pods and persistent volume claims (PVCs) are present in the hosted control plane namespace by running the following command:

    $ oc get pod pvc -n <hosted_control_plane_namespace>
    Expected output
    No resources found
  2. Create a YAML file that defines the Restore CR:

    Example restore-hosted-cluster.yaml file
    apiVersion: velero.io/v1
    kind: Restore
    metadata:
      name: <restore_resource_name> (1)
      namespace: openshift-adp
    spec:
      backupName: <backup_resource_name> (2)
      restorePVs: true (3)
      existingResourcePolicy: update (4)
      excludedResources:
      - nodes
      - events
      - events.events.k8s.io
      - backups.velero.io
      - restores.velero.io
      - resticrepositories.velero.io
    1 Replace <restore_resource_name> with the name of your Restore resource.
    2 Replace <backup_resource_name> with the name of your Backup resource.
    3 Initiates the recovery of persistent volumes (PVs) and its pods.
    4 Ensures that the existing objects are overwritten with the backed up content.

    You must create the infraenv resource in a separate namespace. Do not delete the infraenv resource during the restore process. The infraenv resource is mandatory for the new nodes to be reprovisioned.

  3. Apply the Restore CR by running the following command:

    $ oc apply -f restore-hosted-cluster.yaml
  4. Verify if the value of the status.phase is Completed by running the following command:

    $ oc get hostedcluster <hosted_cluster_name> -n <hosted_cluster_namespace> -o jsonpath='{.status.phase}'
  5. After the restore process is complete, start the reconciliation of the HostedCluster and NodePool resources that you paused during backing up of the control plane workload:

    1. Start the reconciliation of the HostedCluster resource by running the following command:

      $ oc --kubeconfig <management_cluster_kubeconfig_file> \
        patch hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
        --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "false"}]'
    2. Start the reconciliation of the NodePool resource by running the following command:

      $ oc --kubeconfig <management_cluster_kubeconfig_file> \
        patch nodepool -n <hosted_cluster_namespace> <node_pool_name> \
        --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "false"}]'
  6. Start the reconciliation of the Agent provider resources that you paused during backing up of the control plane workload:

    1. Start the reconciliation of the AgentCluster resource by running the following command:

      $ oc --kubeconfig <management_cluster_kubeconfig_file> \
        annotate agentcluster -n <hosted_control_plane_namespace>  \
        cluster.x-k8s.io/paused- --overwrite=true --all
    2. Start the reconciliation of the AgentMachine resource by running the following command:

      $ oc --kubeconfig <management_cluster_kubeconfig_file> \
        annotate agentmachine -n <hosted_control_plane_namespace>  \
        cluster.x-k8s.io/paused- --overwrite=true --all
  7. Remove the hypershift.openshift.io/skip-delete-hosted-controlplane-namespace- annotation in the HostedCluster resource to avoid manually deleting the hosted control plane namespace by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
      hypershift.openshift.io/skip-delete-hosted-controlplane-namespace- \
      --overwrite=true --all
  8. Scale the NodePool resource to the desired number of replicas by running the following command:

    $ oc --kubeconfig <management_cluster_kubeconfig_file> \
      scale nodepool -n <hosted_cluster_namespace> <node_pool_name> \
      --replicas <replica_count> (1)
    1 Replace <replica_count> by an integer value, for example, 3.

Observing the backup and restore process

When using OpenShift API for Data Protection (OADP) to backup and restore a hosted cluster, you can monitor and observe the process.

Procedure
  1. Observe the backup process by running the following command:

    $ watch "oc get backup -n openshift-adp <backup_resource_name> -o jsonpath='{.status}'"
  2. Observe the restore process by running the following command:

    $ watch "oc get restore -n openshift-adp <backup_resource_name> -o jsonpath='{.status}'"
  3. Observe the Velero logs by running the following command:

    $ oc logs -n openshift-adp -ldeploy=velero -f
  4. Observe the progress of all of the OADP objects by running the following command:

    $ watch "echo BackupRepositories:;echo;oc get backuprepositories.velero.io -A;echo; echo BackupStorageLocations: ;echo; oc get backupstoragelocations.velero.io -A;echo;echo DataUploads: ;echo;oc get datauploads.velero.io -A;echo;echo DataDownloads: ;echo;oc get datadownloads.velero.io -n openshift-adp; echo;echo VolumeSnapshotLocations: ;echo;oc get volumesnapshotlocations.velero.io -A;echo;echo Backups:;echo;oc get backup -A; echo;echo Restores:;echo;oc get restore -A"

Using the velero CLI to describe the Backup and Restore resources

When using OpenShift API for Data Protection, you can get more details of the Backup and Restore resources by using the velero command-line interface (CLI).

Procedure
  1. Create an alias to use the velero CLI from a container by running the following command:

    $ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
  2. Get details of your Restore custom resource (CR) by running the following command:

    $ velero restore describe <restore_resource_name> --details (1)
    1 Replace <restore_resource_name> with the name of your Restore resource.
  3. Get details of your Backup CR by running the following command:

    $ velero restore describe <backup_resource_name> --details (1)
    1 Replace <backup_resource_name> with the name of your Backup resource.