This is a cache of https://docs.openshift.com/container-platform/4.16/backup_and_restore/application_backup_and_restore/troubleshooting.html. It is a snapshot of the page at 2024-11-27T10:26:59.237+0000.
Troubleshooting - OADP Application backup and restore | Backup and restore | OpenShift Container Platform 4.16
×

You can debug Velero custom resources (CRs) by using the OpenShift cli tool or the Velero cli tool. The Velero cli tool provides more detailed logs and information.

You can collect logs and CR information by using the must-gather tool.

You can obtain the Velero cli tool by:

  • Downloading the Velero cli tool

  • Accessing the Velero binary in the Velero deployment in the cluster

Downloading the Velero cli tool

You can download and install the Velero cli tool by following the instructions on the Velero documentation page.

The page includes instructions for:

  • macOS by using Homebrew

  • GitHub

  • Windows by using Chocolatey

Prerequisites
  • You have access to a Kubernetes cluster, v1.16 or later, with DNS and container networking enabled.

  • You have installed kubectl locally.

Procedure
  1. Open a browser and navigate to "Install the cli" on the Velero website.

  2. Follow the appropriate procedure for macOS, GitHub, or Windows.

  3. Download the Velero version appropriate for your version of OADP and OpenShift Container Platform.

OADP-Velero-OpenShift Container Platform version relationship

OADP version Velero version OpenShift Container Platform version

1.1.0

1.9

4.9 and later

1.1.1

1.9

4.9 and later

1.1.2

1.9

4.9 and later

1.1.3

1.9

4.9 and later

1.1.4

1.9

4.9 and later

1.1.5

1.9

4.9 and later

1.1.6

1.9

4.11 and later

1.1.7

1.9

4.11 and later

1.2.0

1.11

4.11 and later

1.2.1

1.11

4.11 and later

1.2.2

1.11

4.11 and later

1.2.3

1.11

4.11 and later

1.3.0

1.12

4.10 - 4.15

1.3.1

1.12

4.10 - 4.15

1.3.2

1.12

4.10 - 4.15

1.4.0

1.14

4.14 and later

1.4.1

1.14

4.14 and later

Accessing the Velero binary in the Velero deployment in the cluster

You can use a shell command to access the Velero binary in the Velero deployment in the cluster.

Prerequisites
  • Your DataProtectionApplication custom resource has a status of Reconcile complete.

Procedure
  • Enter the following command to set the needed alias:

    $ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'

Debugging Velero resources with the OpenShift cli tool

You can debug a failed backup or restore by checking Velero custom resources (CRs) and the Velero pod log with the OpenShift cli tool.

Velero CRs

Use the oc describe command to retrieve a summary of warnings and errors associated with a Backup or Restore CR:

$ oc describe <velero_cr> <cr_name>

Velero pod logs

Use the oc logs command to retrieve the Velero pod logs:

$ oc logs pod/<velero>

Velero pod debug logs

You can specify the Velero log level in the DataProtectionApplication resource as shown in the following example.

This option is available starting from OADP 1.0.3.

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: velero-sample
spec:
  configuration:
    velero:
      logLevel: warning

The following logLevel values are available:

  • trace

  • debug

  • info

  • warning

  • error

  • fatal

  • panic

It is recommended to use debug for most logs.

Debugging Velero resources with the Velero cli tool

You can debug Backup and Restore custom resources (CRs) and retrieve logs with the Velero cli tool.

The Velero cli tool provides more detailed information than the OpenShift cli tool.

Syntax

Use the oc exec command to run a Velero cli command:

$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  <backup_restore_cr> <command> <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql

Help option

Use the velero --help option to list all Velero cli commands:

$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  --help

Describe command

Use the velero describe command to retrieve a summary of warnings and errors associated with a Backup or Restore CR:

$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  <backup_restore_cr> describe <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql

The following types of restore errors and warnings are shown in the output of a velero describe request:

  • Velero: A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on

  • Cluster: A list of messages related to backing up or restoring cluster-scoped resources

  • Namespaces: A list of list of messages related to backing up or restoring resources stored in namespaces

One or more errors in one of these categories results in a Restore operation receiving the status of PartiallyFailed and not Completed. Warnings do not lead to a change in the completion status.

  • For resource-specific errors, that is, Cluster and Namespaces errors, the restore describe --details output includes a resource list that lists all resources that Velero succeeded in restoring. For any resource that has such an error, check to see if the resource is actually in the cluster.

  • If there are Velero errors, but no resource-specific errors, in the output of a describe command, it is possible that the restore completed without any actual problems in restoring workloads, but carefully validate post-restore applications.

    For example, if the output contains PodVolumeRestore or node agent-related errors, check the status of PodVolumeRestores and DataDownloads. If none of these are failed or still running, then volume data might have been fully restored.

Logs command

Use the velero logs command to retrieve the logs of a Backup or Restore CR:

$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  <backup_restore_cr> logs <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
  restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf

Pods crash or restart due to lack of memory or CPU

If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.

Additional resources

Setting resource requests for a Velero pod

You can use the configuration.velero.podConfig.resourceAllocations specification field in the oadp_v1alpha1_dpa.yaml file to set specific resource requests for a Velero pod.

Procedure
  • Set the cpu and memory resource requests in the YAML file:

    Example Velero file
    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    ...
    configuration:
      velero:
        podConfig:
          resourceAllocations: (1)
            requests:
              cpu: 200m
              memory: 256Mi
    1 The resourceAllocations listed are for average usage.

Setting resource requests for a Restic pod

You can use the configuration.restic.podConfig.resourceAllocations specification field to set specific resource requests for a Restic pod.

Procedure
  • Set the cpu and memory resource requests in the YAML file:

    Example Restic file
    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    ...
    configuration:
      restic:
        podConfig:
          resourceAllocations: (1)
            requests:
              cpu: 1000m
              memory: 16Gi
    1 The resourceAllocations listed are for average usage.

The values for the resource request fields must follow the same format as Kubernetes resource requirements. Also, if you do not specify configuration.velero.podConfig.resourceAllocations or configuration.restic.podConfig.resourceAllocations, the default resources specification for a Velero pod or a Restic pod is as follows:

requests:
  cpu: 500m
  memory: 128Mi

PodVolumeRestore fails to complete when StorageClass is NFS

The restore operation fails when there is more than one volume during a NFS restore by using Restic or Kopia. PodVolumeRestore either fails with the following error or keeps trying to restore before finally failing.

Error message
Velero: pod volume restore failed: data path restore failed: \
Failed to run kopia restore: Failed to copy snapshot data to the target: \
restore error: copy file: error creating file: \
open /host_pods/b4d...6/volumes/kubernetes.io~nfs/pvc-53...4e5/userdata/base/13493/2681: \
no such file or directory
Cause

The NFS mount path is not unique for the two volumes to restore. As a result, the velero lock files use the same file on the NFS server during the restore, causing the PodVolumeRestore to fail.

Solution

You can resolve this issue by setting up a unique pathPattern for each volume, while defining the StorageClass for nfs-subdir-external-provisioner in the deploy/class.yaml file. Use the following nfs-subdir-external-provisioner StorageClass example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-client
provisioner: k8s-sigs.io/nfs-subdir-external-provisioner
parameters:
  pathPattern: "${.PVC.namespace}/${.PVC.annotations.nfs.io/storage-path}" (1)
  onDelete: delete
1 Specifies a template for creating a directory path by using PVC metadata such as labels, annotations, name, or namespace. To specify metadata, use ${.PVC.<metadata>}. For example, to name a folder: <pvc-namespace>-<pvc-name>, use ${.PVC.namespace}-${.PVC.name} as pathPattern.

Issues with Velero and admission webhooks

Velero has limited abilities to resolve admission webhook issues during a restore. If you have workloads with admission webhooks, you might need to use an additional Velero plugin or make changes to how you restore the workload.

Typically, workloads with admission webhooks require you to create a resource of a specific kind first. This is especially true if your workload has child resources because admission webhooks typically block child resources.

For example, creating or restoring a top-level object such as service.serving.knative.dev typically creates child resources automatically. If you do this first, you will not need to use Velero to create and restore these resources. This avoids the problem of child resources being blocked by an admission webhook that Velero might use.

Restoring workarounds for Velero backups that use admission webhooks

This section describes the additional steps required to restore resources for several types of Velero backups that use admission webhooks.

Restoring Knative resources

You might encounter problems using Velero to back up Knative resources that use admission webhooks.

You can avoid such problems by restoring the top level Service resource first whenever you back up and restore Knative resources that use admission webhooks.

Procedure
  • Restore the top level service.serving.knavtive.dev Service resource:

    $ velero restore <restore_name> \
      --from-backup=<backup_name> --include-resources \
      service.serving.knavtive.dev

Restoring IBM AppConnect resources

If you experience issues when you use Velero to a restore an IBM® AppConnect resource that has an admission webhook, you can run the checks in this procedure.

Procedure
  1. Check if you have any mutating admission plugins of kind: MutatingWebhookConfiguration in the cluster:

    $ oc get mutatingwebhookconfigurations
  2. Examine the YAML file of each kind: MutatingWebhookConfiguration to ensure that none of its rules block creation of the objects that are experiencing issues. For more information, see the official Kubernetes documentation.

  3. Check that any spec.version in type: Configuration.appconnect.ibm.com/v1beta1 used at backup time is supported by the installed Operator.

OADP plugins known issues

The following section describes known issues in OpenShift API for Data Protection (OADP) plugins:

Velero plugin panics during imagestream backups due to a missing secret

When the backup and the Backup Storage Location (BSL) are managed outside the scope of the Data Protection Application (DPA), the OADP controller, meaning the DPA reconciliation does not create the relevant oadp-<bsl_name>-<bsl_provider>-registry-secret.

When the backup is run, the OpenShift Velero plugin panics on the imagestream backup, with the following panic error:

024-02-27T10:46:50.028951744Z time="2024-02-27T10:46:50Z" level=error msg="Error backing up item"
backup=openshift-adp/<backup name> error="error executing custom action (groupResource=imagestreams.image.openshift.io,
namespace=<BSL Name>, name=postgres): rpc error: code = Aborted desc = plugin panicked:
runtime error: index out of range with length 1, stack trace: goroutine 94…
Workaround to avoid the panic error

To avoid the Velero plugin panic error, perform the following steps:

  1. Label the custom BSL with the relevant label:

    $ oc label BackupStorageLocation <bsl_name> app.kubernetes.io/component=bsl
  2. After the BSL is labeled, wait until the DPA reconciles.

    You can force the reconciliation by making any minor change to the DPA itself.

  3. When the DPA reconciles, confirm that the relevant oadp-<bsl_name>-<bsl_provider>-registry-secret has been created and that the correct registry data has been populated into it:

    $ oc -n openshift-adp get secret/oadp-<bsl_name>-<bsl_provider>-registry-secret -o json | jq -r '.data'

OpenShift ADP Controller segmentation fault

If you configure a DPA with both cloudstorage and restic enabled, the openshift-adp-controller-manager pod crashes and restarts indefinitely until the pod fails with a crash loop segmentation fault.

You can have either velero or cloudstorage defined, because they are mutually exclusive fields.

  • If you have both velero and cloudstorage defined, the openshift-adp-controller-manager fails.

  • If you have neither velero nor cloudstorage defined, the openshift-adp-controller-manager fails.

For more information about this issue, see OADP-1054.

OpenShift ADP Controller segmentation fault workaround

You must define either velero or cloudstorage when you configure a DPA. If you define both APIs in your DPA, the openshift-adp-controller-manager pod fails with a crash loop segmentation fault.

Velero plugins returning "received EOF, stopping recv loop" message

Velero plugins are started as separate processes. After the Velero operation has completed, either successfully or not, they exit. Receiving a received EOF, stopping recv loop message in the debug logs indicates that a plugin operation has completed. It does not mean that an error has occurred.

Installation issues

You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.

Backup storage contains invalid directories

The Velero pod log displays the error message, Backup storage contains invalid top-level directories.

Cause

The object storage contains top-level directories that are not Velero directories.

Solution

If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the spec.backupLocations.velero.objectStorage.prefix parameter in the DataProtectionApplication manifest.

Incorrect AWS credentials

The oadp-aws-registry pod log displays the error message, InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.

The Velero pod log displays the error message, NoCredentialProviders: no valid providers in chain.

Cause

The credentials-velero file used to create the Secret object is incorrectly formatted.

Solution

Ensure that the credentials-velero file is correctly formatted, as in the following example:

Example credentials-velero file
[default] (1)
aws_access_key_id=AKIAIOSFODNN7EXAMPLE (2)
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
1 AWS default profile.
2 Do not enclose the values with quotation marks (", ').

OADP Operator issues

The OpenShift API for Data Protection (OADP) Operator might encounter issues caused by problems it is not able to resolve.

OADP Operator fails silently

The S3 buckets of an OADP Operator might be empty, but when you run the command oc get po -n <OADP_Operator_namespace>, you see that the Operator has a status of Running. In such a case, the Operator is said to have failed silently because it incorrectly reports that it is running.

Cause

The problem is caused when cloud credentials provide insufficient permissions.

Solution

Retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues.

Procedure
  1. Run one of the following commands to retrieve a list of BSLs:

    1. Using the OpenShift cli:

      $ oc get backupstoragelocation -A
    2. Using the Velero cli:

      $ velero backup-location get -n <OADP_Operator_namespace>
  2. Using the list of BSLs, run the following command to display the manifest of each BSL, and examine each manifest for an error.

    $ oc get backupstoragelocation -n <namespace> -o yaml
Example result
apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: BackupStorageLocation
  metadata:
    creationTimestamp: "2023-11-03T19:49:04Z"
    generation: 9703
    name: example-dpa-1
    namespace: openshift-adp-operator
    ownerReferences:
    - apiVersion: oadp.openshift.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: DataProtectionApplication
      name: example-dpa
      uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82
    resourceVersion: "24273698"
    uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83
  spec:
    config:
      enableSharedConfig: "true"
      region: us-west-2
    credential:
      key: credentials
      name: cloud-credentials
    default: true
    objectStorage:
      bucket: example-oadp-operator
      prefix: example
    provider: aws
  status:
    lastValidationTime: "2023-11-10T22:06:46Z"
    message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc
      error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused
      by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus
      code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54"
    phase: Unavailable
kind: List
metadata:
  resourceVersion: ""

OADP timeouts

Extending a timeout allows complex or resource-intensive processes to complete successfully without premature termination. This configuration can reduce the likelihood of errors, retries, or failures.

Ensure that you balance timeout extensions in a logical manner so that you do not configure excessively long timeouts that might hide underlying issues in the process. Carefully consider and monitor an appropriate timeout value that meets the needs of the process and the overall system performance.

The following are various OADP timeouts, with instructions of how and when to implement these parameters:

Restic timeout

The spec.configuration.nodeAgent.timeout parameter defines the Restic timeout. The default value is 1h.

Use the Restic timeout parameter in the nodeAgent section for the following scenarios:

  • For Restic backups with total PV data usage that is greater than 500GB.

  • If backups are timing out with the following error:

    level=error msg="Error backing up item" backup=velero/monitoring error="timed out waiting for all PodVolumeBackups to complete"
Procedure
  • Edit the values in the spec.configuration.nodeAgent.timeout block of the DataProtectionApplication custom resource (CR) manifest, as shown in the following example:

    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    metadata:
     name: <dpa_name>
    spec:
      configuration:
        nodeAgent:
          enable: true
          uploaderType: restic
          timeout: 1h
    # ...

Velero resource timeout

resourceTimeout defines how long to wait for several Velero resources before timeout occurs, such as Velero custom resource definition (CRD) availability, volumeSnapshot deletion, and repository availability. The default is 10m.

Use the resourceTimeout for the following scenarios:

  • For backups with total PV data usage that is greater than 1TB. This parameter is used as a timeout value when Velero tries to clean up or delete the Container Storage Interface (CSI) snapshots, before marking the backup as complete.

    • A sub-task of this cleanup tries to patch VSC and this timeout can be used for that task.

  • To create or ensure a backup repository is ready for filesystem based backups for Restic or Kopia.

  • To check if the Velero CRD is available in the cluster before restoring the custom resource (CR) or resource from the backup.

Procedure
  • Edit the values in the spec.configuration.velero.resourceTimeout block of the DataProtectionApplication CR manifest, as in the following example:

    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    metadata:
     name: <dpa_name>
    spec:
      configuration:
        velero:
          resourceTimeout: 10m
    # ...

Data Mover timeout

timeout is a user-supplied timeout to complete VolumeSnapshotBackup and VolumeSnapshotRestore. The default value is 10m.

Use the Data Mover timeout for the following scenarios:

  • If creation of VolumeSnapshotBackups (VSBs) and VolumeSnapshotRestores (VSRs), times out after 10 minutes.

  • For large scale environments with total PV data usage that is greater than 500GB. Set the timeout for 1h.

  • With the VolumeSnapshotMover (VSM) plugin.

  • Only with OADP 1.1.x.

Procedure
  • Edit the values in the spec.features.dataMover.timeout block of the DataProtectionApplication CR manifest, as in the following example:

    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    metadata:
     name: <dpa_name>
    spec:
      features:
        dataMover:
          timeout: 10m
    # ...

CSI snapshot timeout

CSISnapshotTimeout specifies the time during creation to wait until the CSI VolumeSnapshot status becomes ReadyToUse, before returning error as timeout. The default value is 10m.

Use the CSISnapshotTimeout for the following scenarios:

  • With the CSI plugin.

  • For very large storage volumes that may take longer than 10 minutes to snapshot. Adjust this timeout if timeouts are found in the logs.

Typically, the default value for CSISnapshotTimeout does not require adjustment, because the default setting can accommodate large storage volumes.

Procedure
  • Edit the values in the spec.csiSnapshotTimeout block of the Backup CR manifest, as in the following example:

    apiVersion: velero.io/v1
    kind: Backup
    metadata:
     name: <backup_name>
    spec:
     csiSnapshotTimeout: 10m
    # ...

Velero default item operation timeout

defaultItemOperationTimeout defines how long to wait on asynchronous BackupItemActions and RestoreItemActions to complete before timing out. The default value is 1h.

Use the defaultItemOperationTimeout for the following scenarios:

  • Only with Data Mover 1.2.x.

  • To specify the amount of time a particular backup or restore should wait for the Asynchronous actions to complete. In the context of OADP features, this value is used for the Asynchronous actions involved in the Container Storage Interface (CSI) Data Mover feature.

  • When defaultItemOperationTimeout is defined in the Data Protection Application (DPA) using the defaultItemOperationTimeout, it applies to both backup and restore operations. You can use itemOperationTimeout to define only the backup or only the restore of those CRs, as described in the following "Item operation timeout - restore", and "Item operation timeout - backup" sections.

Procedure
  • Edit the values in the spec.configuration.velero.defaultItemOperationTimeout block of the DataProtectionApplication CR manifest, as in the following example:

    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    metadata:
     name: <dpa_name>
    spec:
      configuration:
        velero:
          defaultItemOperationTimeout: 1h
    # ...

Item operation timeout - restore

ItemOperationTimeout specifies the time that is used to wait for RestoreItemAction operations. The default value is 1h.

Use the restore ItemOperationTimeout for the following scenarios:

  • Only with Data Mover 1.2.x.

  • For Data Mover uploads and downloads to or from the BackupStorageLocation. If the restore action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.

Procedure
  • Edit the values in the Restore.spec.itemOperationTimeout block of the Restore CR manifest, as in the following example:

    apiVersion: velero.io/v1
    kind: Restore
    metadata:
     name: <restore_name>
    spec:
     itemOperationTimeout: 1h
    # ...

Item operation timeout - backup

ItemOperationTimeout specifies the time used to wait for asynchronous BackupItemAction operations. The default value is 1h.

Use the backup ItemOperationTimeout for the following scenarios:

  • Only with Data Mover 1.2.x.

  • For Data Mover uploads and downloads to or from the BackupStorageLocation. If the backup action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.

Procedure
  • Edit the values in the Backup.spec.itemOperationTimeout block of the Backup CR manifest, as in the following example:

    apiVersion: velero.io/v1
    kind: Backup
    metadata:
     name: <backup_name>
    spec:
     itemOperationTimeout: 1h
    # ...

Backup and Restore CR issues

You might encounter these common issues with Backup and Restore custom resources (CRs).

Backup CR cannot retrieve volume

The Backup CR displays the error message, InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist.

Cause

The persistent volume (PV) and the snapshot locations are in different regions.

Solution
  1. Edit the value of the spec.snapshotLocations.velero.config.region key in the DataProtectionApplication manifest so that the snapshot location is in the same region as the PV.

  2. Create a new Backup CR.

Backup CR status remains in progress

The status of a Backup CR remains in the InProgress phase and does not complete.

Cause

If a backup is interrupted, it cannot be resumed.

Solution
  1. Retrieve the details of the Backup CR:

    $ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
      backup describe <backup>
  2. Delete the Backup CR:

    $ oc delete backup <backup> -n openshift-adp

    You do not need to clean up the backup location because a Backup CR in progress has not uploaded files to object storage.

  3. Create a new Backup CR.

  4. View the Velero backup details

    $ velero backup describe <backup-name> --details

Backup CR status remains in PartiallyFailed

The status of a Backup CR without Restic in use remains in the PartiallyFailed phase and does not complete. A snapshot of the affiliated PVC is not created.

Cause

If the backup is created based on the CSI snapshot class, but the label is missing, CSI snapshot plugin fails to create a snapshot. As a result, the Velero pod logs an error similar to the following:

time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
Solution
  1. Delete the Backup CR:

    $ oc delete backup <backup> -n openshift-adp
  2. If required, clean up the stored data on the BackupStorageLocation to free up space.

  3. Apply label velero.io/csi-volumesnapshot-class=true to the VolumeSnapshotClass object:

    $ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
  4. Create a new Backup CR.

Restic issues

You might encounter these issues when you back up applications with Restic.

Restic permission error for NFS data volumes with root_squash enabled

The Restic pod log displays the error message: controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied".

Cause

If your NFS data volumes have root_squash enabled, Restic maps to nfsnobody and does not have permission to create backups.

Solution

You can resolve this issue by creating a supplemental group for Restic and adding the group ID to the DataProtectionApplication manifest:

  1. Create a supplemental group for Restic on the NFS data volume.

  2. Set the setgid bit on the NFS directories so that group ownership is inherited.

  3. Add the spec.configuration.nodeAgent.supplementalGroups parameter and the group ID to the DataProtectionApplication manifest, as shown in the following example:

    apiVersion: oadp.openshift.io/v1alpha1
    kind: DataProtectionApplication
    # ...
    spec:
      configuration:
        nodeAgent:
          enable: true
          uploaderType: restic
          supplementalGroups:
          - <group_id> (1)
    # ...
    1 Specify the supplemental group ID.
  4. Wait for the Restic pods to restart so that the changes are applied.

Restic Backup CR cannot be recreated after bucket is emptied

If you create a Restic Backup CR for a namespace, empty the object storage bucket, and then recreate the Backup CR for the same namespace, the recreated Backup CR fails.

The velero pod log displays the following error message: stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?.

Cause

Velero does not recreate or update the Restic repository from the ResticRepository manifest if the Restic directories are deleted from object storage. See Velero issue 4421 for more information.

Solution
  • Remove the related Restic repository from the namespace by running the following command:

    $ oc delete resticrepository openshift-adp <name_of_the_restic_repository>

    In the following error log, mysql-persistent is the problematic Restic repository. The name of the repository appears in italics for clarity.

     time="2021-12-29T18:29:14Z" level=info msg="1 errors
     encountered backup up item" backup=velero/backup65
     logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
     time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
     backup=velero/backup65 error="pod volume backup failed: error running
     restic backup, stderr=Fatal: unable to open config file: Stat: The
     specified key does not exist.\nIs there a repository at the following
     location?\ns3:http://minio-minio.apps.mayap-oadp-
     veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
     restic/mysql-persistent\n: exit status 1" error.file="/remote-source/
     src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
     error.function="github.com/vmware-tanzu/velero/
     pkg/restic.(*backupper).BackupPodVolumes"
     logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds

Restic restore partially failing on OCP 4.14 due to changed PSA policy

OpenShift Container Platform 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process. 

If a SecurityContextConstraints (SCC) resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied. 

This issue arises due to the resource restore order of Velero.

Sample error
\"level=error\" in line#2273: time=\"2023-06-12T06:50:04Z\"
level=error msg=\"error restoring mysql-869f9f44f6-tp5lv: pods\\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity\\\
"restricted:v1.24\\\": privil eged (container \\\"mysql\\\
" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/restore/restore.go:1388\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n
velero container contains \"level=error\" in line#2447: time=\"2023-06-12T06:50:05Z\"
level=error msg=\"Namespace todolist-mariadb,
resource restore error: error restoring pods/todolist-mariadb/mysql-869f9f44f6-tp5lv: pods \\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\
"mysql\\\" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\",\\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\"
logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\"
restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]",
Solution
  1. In your DPA custom resource (CR), check or set the restore-resource-priorities field on the Velero server to ensure that securitycontextconstraints is listed in order before pods in the list of resources:

    $ oc get dpa -o yaml
    Example DPA CR
    # ...
    configuration:
      restic:
        enable: true
      velero:
        args:
          restore-resource-priorities: 'securitycontextconstraints,customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io' (1)
        defaultPlugins:
        - gcp
        - openshift
    1 If you have an existing restore resource priority list, ensure you combine that existing list with the complete list.
  2. Ensure that the security standards for the application pods are aligned, as provided in Fixing PodSecurity Admission warnings for deployments, to prevent deployment warnings. If the application is not aligned with security standards, an error can occur regardless of the SCC. 

This solution is temporary, and ongoing discussions are in progress to address it. 

Using the must-gather tool

You can collect logs, metrics, and information about OADP custom resources by using the must-gather tool.

The must-gather data must be attached to all customer cases.

You can run the must-gather tool with the following data collection options:

  • Full must-gather data collection collects Prometheus metrics, pod logs, and Velero CR information for all namespaces where the OADP Operator is installed.

  • Essential must-gather data collection collects pod logs and Velero CR information for a specific duration of time, for example, one hour or 24 hours. Prometheus metrics and duplicate logs are not included.

  • must-gather data collection with timeout. Data collection can take a long time if there are many failed Backup CRs. You can improve performance by setting a timeout value.

  • Prometheus metrics data dump downloads an archive file containing the metrics data collected by Prometheus.

Prerequisites
  • You must be logged in to the OpenShift Container Platform cluster as a user with the cluster-admin role.

  • You must have the OpenShift cli (oc) installed.

  • You must use Red Hat Enterprise Linux (RHEL) 8.x with OADP 1.2.

  • You must use Red Hat Enterprise Linux (RHEL) 9.x with OADP 1.3.

Procedure
  1. Navigate to the directory where you want to store the must-gather data.

  2. Run the oc adm must-gather command for one of the following data collection options:

    • Full must-gather data collection, including Prometheus metrics:

      1. For OADP 1.2, run the following command:

        $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.2
      2. For OADP 1.3, run the following command:

        $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3

        The data is saved as must-gather/must-gather.tar.gz. You can upload this file to a support case on the Red Hat Customer Portal.

    • Essential must-gather data collection, without Prometheus metrics, for a specific time duration:

      $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1 \
        -- /usr/bin/gather_<time>_essential (1)
      1 Specify the time in hours. Allowed values are 1h, 6h, 24h, 72h, or all, for example, gather_1h_essential or gather_all_essential.
    • must-gather data collection with timeout:

      $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1 \
        -- /usr/bin/gather_with_timeout <timeout> (1)
      1 Specify a timeout value in seconds.
    • Prometheus metrics data dump:

      1. For OADP 1.2, run the following command:

        $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.2 -- /usr/bin/gather_metrics_dump
      2. For OADP 1.3, run the following command:

        $ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_metrics_dump

        This operation can take a long time. The data is saved as must-gather/metrics/prom_data.tar.gz.

Additional resources

Using must-gather with insecure TLS connections

If a custom CA certificate is used, the must-gather pod fails to grab the output for velero logs/describe. To use the must-gather tool with insecure TLS connections, you can pass the gather_without_tls flag to the must-gather command.

Procedure
  • Pass the gather_without_tls flag, with value set to true, to the must-gather tool by using the following command:

$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_without_tls <true/false>

By default, the flag value is set to false. Set the value to true to allow insecure TLS connections.

Combining options when using the must-gather tool

Currently, it is not possible to combine must-gather scripts, for example specifying a timeout threshold while permitting insecure TLS connections. In some situations, you can get around this limitation by setting up internal variables on the must-gather command line, such as the following example:

oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- skip_tls=true /usr/bin/gather_with_timeout <timeout_value_in_seconds>

In this example, set the skip_tls variable before running the gather_with_timeout script. The result is a combination of gather_with_timeout and gather_without_tls.

The only other variables that you can specify this way are the following:

  • logs_since, with a default value of 72h

  • request_timeout, with a default value of 0s

If DataProtectionApplication custom resource (CR) is configured with s3Url and insecureSkipTLS: true, the CR does not collect the necessary logs because of a missing CA certificate. To collect those logs, run the must-gather command with the following option:

$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_without_tls true

OADP Monitoring

The OpenShift Container Platform provides a monitoring stack that allows users and administrators to effectively monitor and manage their clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters, including receiving alerts if an event occurs.

Additional resources

OADP monitoring setup

The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.

With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.

Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp namespace.

Prerequisites
  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.

  • You have created a cluster monitoring config map.

Procedure
  1. Edit the cluster-monitoring-config ConfigMap object in the openshift-monitoring namespace:

    $ oc edit configmap cluster-monitoring-config -n openshift-monitoring
  2. Add or enable the enableUserWorkload option in the data section’s config.yaml field:

    apiVersion: v1
    data:
      config.yaml: |
        enableUserWorkload: true (1)
    kind: ConfigMap
    metadata:
    # ...
    1 Add this option or set to true
  3. Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the openshift-user-workload-monitoring namespace:

    $ oc get pods -n openshift-user-workload-monitoring
    Example output
    NAME                                   READY   STATUS    RESTARTS   AGE
    prometheus-operator-6844b4b99c-b57j9   2/2     Running   0          43s
    prometheus-user-workload-0             5/5     Running   0          32s
    prometheus-user-workload-1             5/5     Running   0          32s
    thanos-ruler-user-workload-0           3/3     Running   0          32s
    thanos-ruler-user-workload-1           3/3     Running   0          32s
  4. Verify the existence of the user-workload-monitoring-config ConfigMap in the openshift-user-workload-monitoring. If it exists, skip the remaining steps in this procedure.

    $ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
    Example output
    Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
  5. Create a user-workload-monitoring-config ConfigMap object for the User Workload Monitoring, and save it under the 2_configure_user_workload_monitoring.yaml file name:

    Example output
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
  6. Apply the 2_configure_user_workload_monitoring.yaml file:

    $ oc apply -f 2_configure_user_workload_monitoring.yaml
    configmap/user-workload-monitoring-config created

Creating OADP service monitor

OADP provides an openshift-adp-velero-metrics-svc service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.

Get details about the service by running the following commands:

Procedure
  1. Ensure the openshift-adp-velero-metrics-svc service exists. It should contain app.kubernetes.io/name=velero label, which will be used as selector for the ServiceMonitor object.

    $ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
    Example output
    NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
    openshift-adp-velero-metrics-svc   ClusterIP   172.30.38.244   <none>        8085/TCP   1h
  2. Create a ServiceMonitor YAML file that matches the existing service label, and save the file as 3_create_oadp_service_monitor.yaml. The service monitor is created in the openshift-adp namespace where the openshift-adp-velero-metrics-svc service resides.

    Example ServiceMonitor object
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        app: oadp-service-monitor
      name: oadp-service-monitor
      namespace: openshift-adp
    spec:
      endpoints:
      - interval: 30s
        path: /metrics
        targetPort: 8085
        scheme: http
      selector:
        matchLabels:
          app.kubernetes.io/name: "velero"
  3. Apply the 3_create_oadp_service_monitor.yaml file:

    $ oc apply -f 3_create_oadp_service_monitor.yaml
    Example output
    servicemonitor.monitoring.coreos.com/oadp-service-monitor created
Verification
  • Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OpenShift Container Platform web console:

    1. Navigate to the ObserveTargets page.

    2. Ensure the Filter is unselected or that the User source is selected and type openshift-adp in the Text search field.

    3. Verify that the status for the Status for the service monitor is Up.

      OADP metrics targets
      Figure 1. OADP metrics targets

Creating an alerting rule

The OpenShift Container Platform monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.

Procedure
  1. Create a PrometheusRule YAML file with the sample OADPBackupFailing alert and save it as 4_create_oadp_alert_rule.yaml.

    Sample OADPBackupFailing alert
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: sample-oadp-alert
      namespace: openshift-adp
    spec:
      groups:
      - name: sample-oadp-backup-alert
        rules:
        - alert: OADPBackupFailing
          annotations:
            description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
            summary: OADP has issues creating backups
          expr: |
            increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
          for: 5m
          labels:
            severity: warning

    In this sample, the Alert displays under the following conditions:

    • There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.

    • If the time of the first increase is less than 5 minutes, the Alert will be in a Pending state, after which it will turn into a Firing state.

  2. Apply the 4_create_oadp_alert_rule.yaml file, which creates the PrometheusRule object in the openshift-adp namespace:

    $ oc apply -f 4_create_oadp_alert_rule.yaml
    Example output
    prometheusrule.monitoring.coreos.com/sample-oadp-alert created
Verification
  • After the Alert is triggered, you can view it in the following ways:

    • In the Developer perspective, select the Observe menu.

    • In the Administrator perspective under the ObserveAlerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.

      OADP backup failing alert
      Figure 2. OADP backup failing alert
Additional resources

List of available metrics

These are the list of metrics provided by the OADP together with their Types.

Metric name Description Type

kopia_content_cache_hit_bytes

Number of bytes retrieved from the cache

Counter

kopia_content_cache_hit_count

Number of times content was retrieved from the cache

Counter

kopia_content_cache_malformed

Number of times malformed content was read from the cache

Counter

kopia_content_cache_miss_count

Number of times content was not found in the cache and fetched

Counter

kopia_content_cache_missed_bytes

Number of bytes retrieved from the underlying storage

Counter

kopia_content_cache_miss_error_count

Number of times content could not be found in the underlying storage

Counter

kopia_content_cache_store_error_count

Number of times content could not be saved in the cache

Counter

kopia_content_get_bytes

Number of bytes retrieved using GetContent()

Counter

kopia_content_get_count

Number of times GetContent() was called

Counter

kopia_content_get_error_count

Number of times GetContent() was called and the result was an error

Counter

kopia_content_get_not_found_count

Number of times GetContent() was called and the result was not found

Counter

kopia_content_write_bytes

Number of bytes passed to WriteContent()

Counter

kopia_content_write_count

Number of times WriteContent() was called

Counter

velero_backup_attempt_total

Total number of attempted backups

Counter

velero_backup_deletion_attempt_total

Total number of attempted backup deletions

Counter

velero_backup_deletion_failure_total

Total number of failed backup deletions

Counter

velero_backup_deletion_success_total

Total number of successful backup deletions

Counter

velero_backup_duration_seconds

Time taken to complete backup, in seconds

Histogram

velero_backup_failure_total

Total number of failed backups

Counter

velero_backup_items_errors

Total number of errors encountered during backup

Gauge

velero_backup_items_total

Total number of items backed up

Gauge

velero_backup_last_status

Last status of the backup. A value of 1 is success, 0.

Gauge

velero_backup_last_successful_timestamp

Last time a backup ran successfully, Unix timestamp in seconds

Gauge

velero_backup_partial_failure_total

Total number of partially failed backups

Counter

velero_backup_success_total

Total number of successful backups

Counter

velero_backup_tarball_size_bytes

Size, in bytes, of a backup

Gauge

velero_backup_total

Current number of existent backups

Gauge

velero_backup_validation_failure_total

Total number of validation failed backups

Counter

velero_backup_warning_total

Total number of warned backups

Counter

velero_csi_snapshot_attempt_total

Total number of CSI attempted volume snapshots

Counter

velero_csi_snapshot_failure_total

Total number of CSI failed volume snapshots

Counter

velero_csi_snapshot_success_total

Total number of CSI successful volume snapshots

Counter

velero_restore_attempt_total

Total number of attempted restores

Counter

velero_restore_failed_total

Total number of failed restores

Counter

velero_restore_partial_failure_total

Total number of partially failed restores

Counter

velero_restore_success_total

Total number of successful restores

Counter

velero_restore_total

Current number of existent restores

Gauge

velero_restore_validation_failed_total

Total number of failed restores failing validations

Counter

velero_volume_snapshot_attempt_total

Total number of attempted volume snapshots

Counter

velero_volume_snapshot_failure_total

Total number of failed volume snapshots

Counter

velero_volume_snapshot_success_total

Total number of successful volume snapshots

Counter

Viewing metrics using the Observe UI

You can view metrics in the OpenShift Container Platform web console from the Administrator or Developer perspective, which must have access to the openshift-adp project.

Procedure
  • Navigate to the ObserveMetrics page:

    • If you are using the Developer perspective, follow these steps:

      1. Select Custom query, or click on the Show PromQL link.

      2. Type the query and click Enter.

    • If you are using the Administrator perspective, type the expression in the text field and select Run Queries.

      OADP metrics query
      Figure 3. OADP metrics query