Disaster recovery

About disaster recovery methods
- Metro-DR
- Regional-DR
Defining applications for disaster recovery
- Best practices when defining an RHACM-managed VM
- Best practices when defining an RHACM-discovered virtual machine
VM behavior during disaster recovery scenarios
Metro-DR for Red Hat OpenShift Data Foundation

OpenShift Virtualization supports using disaster recovery (DR) solutions to ensure that your environment can recover after a site outage. To use these methods, you must plan your OpenShift Virtualization deployment in advance.

About disaster recovery methods

For an overview of disaster recovery (DR) concepts, architecture, and planning considerations, see the Red Hat OpenShift Virtualization disaster recovery guide in the Red Hat Knowledgebase.

The two primary DR methods for OpenShift Virtualization are Metropolitan Disaster Recovery (Metro-DR) and Regional-DR.

Metro-DR

Metro-DR uses synchronous replication. It writes to storage at both the primary and secondary sites so that the data is always synchronized between sites. Because the storage provider is responsible for ensuring that the synchronization succeeds, the environment must meet the throughput and latency requirements of the storage provider.

Regional-DR

Regional-DR uses asynchronous replication. The data in the primary site is synchronized with the secondary site at regular intervals. For this type of replication, you can have a higher latency connection between the primary and secondary sites.

Regional-DR is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Defining applications for disaster recovery

Define applications for disaster recovery by using VMs that Red Hat Advanced Cluster Management (RHACM) manages or discovers.

Best practices when defining an RHACM-managed VM

An RHACM-managed application that includes a VM must be created by using a GitOps workflow and by creating an RHACM application or ApplicationSet.

There are several actions you can take to improve your experience and chance of success when defining an RHACM-managed VM.

Use a PVC and populator to define storage for the VM

Because data volumes create persistent volume claims (PVCs) implicitly, data volumes and VMs with data volume templates do not fit as neatly into the GitOps model.

Use the import method when choosing a population source for your VM disk

Use the import method to work around limitations in Regional-DR that prevent you from protecting VMs that use cloned PVCs.

Select a RHEL image from the software catalog to use the import method. Red Hat recommends using a specific version of the image rather than a floating tag for consistent results. The KubeVirt community maintains container disks for other operating systems in a Quay repository.

Use `pullMethod: node`

Use the pod pullMethod: node when creating a data volume from a registry source to take advantage of the OpenShift Container Platform pull secret, which is required to pull container images from the Red Hat registry.

Best practices when defining an RHACM-discovered virtual machine

You can configure any VM in the cluster that is not an RHACM-managed application as an RHACM-discovered application. This includes VMs imported by using the Migration Toolkit for Virtualization (MTV), VMs created by using the OpenShift Virtualization web console, or VMs created by any other means, such as the CLI.

There are several actions you can take to improve your experience and chance of success when defining an RHACM-discovered VM.

Protect the VM when using MTV, the OpenShift Virtualization web console, or a custom VM

Because automatic labeling is not currently available, the application owner must manually label the components of the VM application when using MTV, the OpenShift Virtualization web console, or a custom VM.

After creating the VM, apply a common label to the following resources associated with the VM: VirtualMachine, DataVolume, PersistentVolumeClaim, Service, Route, Secret, and configmap. Do not label virtual machine instances (VMIs) or pods since OpenShift Virtualization creates and manages these automatically.

Include more than the `VirtualMachine` object in the VM

Working VMs typically also contain data volumes, persistent volume claims (PVCs), services, routes, secrets, configmap objects, and VirtualMachineSnapshot objects.

Include the VM as part of a larger logical application

This includes other pod-based workloads and VMs.

VM behavior during disaster recovery scenarios

VMs typically act similarly to pod-based workloads during both relocate and failover disaster recovery flows.

Relocate

Use relocate to move an application from the primary environment to the secondary environment when the primary environment is still accessible. During relocate, the VM is gracefully terminated, any unreplicated data is synchronized to the secondary environment, and the VM starts in the secondary environment.

Becauase the terminates gracefully, there is no data loss in this scenario. Therefore, the VM operating system does not need to perform crash recovery.

Failover

Use failover when there is a critical failure in the primary environment that makes it impractical or impossible to use relocation to move the workload to a secondary environment. When failover is executed, the storage is fenced from the primary environment, the I/O to the VM disks is abruptly halted, and the VM restarts in the secondary environment using the replicated data.

You should expect data loss due to failover. The extent of loss depends on whether you use Metro-DR, which uses synchronous replication, or Regional-DR, which uses asynchronous replication. Because Regional-DR uses snapshot-based replication intervals, the window of data loss is proportional to the replication interval length. When the VM restarts, the operating system might perform crash recovery.

Metro-DR for Red Hat OpenShift Data Foundation

OpenShift Virtualization supports the Metro-DR solution for OpenShift Data Foundation, which provides two-way synchronous data replication between managed OpenShift Virtualization clusters installed on primary and secondary sites. This solution combines Red Hat Advanced Cluster Management (RHACM), Red Hat Ceph Storage, and OpenShift Data Foundation components.

Use this solution during a site disaster to failover applications from the primary to the secondary site, and relocate the applications back to the primary site after restoring the disaster site.

This synchronous solution is only available to metropolitan distance data centers with a 10-millisecond latency or less.

For more information about using the Metro-DR solution for OpenShift Data Foundation with OpenShift Virtualization, see the Red Hat Knowledgebase or IBM’s OpenShift Data Foundation Metro-DR documentation.

Additional resources

Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads