Troubleshooting Operator issues - Troubleshooting | Support

Operator subscription condition types
Viewing Operator subscription status by using the cli
Viewing Operator catalog source status by using the cli
Querying Operator pod status
Gathering Operator logs

Operators are a method of packaging, deploying, and managing an Red Hat OpenShift Service on AWS application. They act like an extension of the software vendor’s engineering team, watching over an Red Hat OpenShift Service on AWS environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, such as skipping a software backup process to save time.

Red Hat OpenShift Service on AWS includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).

As a cluster administrator, you can install application Operators from the OperatorHub using the Red Hat OpenShift Service on AWS web console or the cli. You can then subscribe the Operator to one or more namespaces to make it available for developers on your cluster. Application Operators are managed by Operator Lifecycle Manager (OLM).

If you experience Operator issues, verify Operator subscription status. Check Operator pod health across the cluster and gather Operator logs for diagnosis.

edit

Operator subscription condition types

Subscriptions can report the following condition types:

Table 1. Subscription condition types
Condition	Description
`CatalogSourcesUnhealthy`	Some or all of the catalog sources to be used in resolution are unhealthy.
`InstallPlanMissing`	An install plan for a subscription is missing.
`InstallPlanPending`	An install plan for a subscription is pending installation.
`InstallPlanFailed`	An install plan for a subscription has failed.
`ResolutionFailed`	The dependency resolution for a subscription has failed.

Default Red Hat OpenShift Service on AWS cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription object. Application Operators are managed by Operator Lifecycle Manager (OLM) and they have a Subscription object.

Additional resources

Catalog health requirements

edit

Viewing Operator subscription status by using the cli

You can view Operator subscription status by using the cli.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift cli (oc).

Procedure

List Operator subscriptions:
```
$ oc get subs -n <operator_namespace>
```

Use the oc describe command to inspect a Subscription resource:

$ oc describe sub <subscription_name> -n <operator_namespace>

In the command output, find the Conditions section for the status of Operator subscription condition types. In the following example, the CatalogSourcesUnhealthy condition type has a status of false because all available catalog sources are healthy:

Example output

Name:         cluster-logging
Namespace:    openshift-logging
Labels:       operators.coreos.com/cluster-logging.openshift-logging=
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         Subscription
# ...
Conditions:
   Last Transition Time:  2019-07-29T13:42:57Z
   Message:               all available catalogsources are healthy
   Reason:                AllCatalogSourcesHealthy
   Status:                False
   Type:                  CatalogSourcesUnhealthy
# ...

edit

Viewing Operator catalog source status by using the cli

You can view the status of an Operator catalog source by using the cli.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift cli (oc).

Procedure

List the catalog sources in a namespace. For example, you can check the openshift-marketplace namespace, which is used for cluster-wide catalog sources:

$ oc get catalogsources -n openshift-marketplace

Example output

NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     55m
community-operators   Community Operators   grpc   Red Hat     55m
example-catalog       Example Catalog       grpc   Example Org 2m25s
redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     55m
redhat-operators      Red Hat Operators     grpc   Red Hat     55m

Use the oc describe command to get more details and status about a catalog source:

$ oc describe catalogsource example-catalog -n openshift-marketplace

Example output

Name:         example-catalog
Namespace:    openshift-marketplace
Labels:       <none>
Annotations:  operatorframework.io/managed-by: marketplace-operator
              target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}
API Version:  operators.coreos.com/v1alpha1
Kind:         CatalogSource
# ...
Status:
  Connection State:
    Address:              example-catalog.openshift-marketplace.svc:50051
    Last Connect:         2021-09-09T17:07:35Z
    Last Observed State:  TRANSIENT_FAILURE
  Registry Service:
    Created At:         2021-09-09T17:05:45Z
    Port:               50051
    Protocol:           grpc
    Service Name:       example-catalog
    Service Namespace:  openshift-marketplace
# ...

In the preceding example output, the last observed state is TRANSIENT_FAILURE. This state indicates that there is a problem establishing a connection for the catalog source.

List the pods in the namespace where your catalog source was created:

$ oc get pods -n openshift-marketplace

Example output

NAME                                    READY   STATUS             RESTARTS   AGE
certified-operators-cv9nn               1/1     Running            0          36m
community-operators-6v8lp               1/1     Running            0          36m
marketplace-operator-86bfc75f9b-jkgbc   1/1     Running            0          42m
example-catalog-bwt8z                   0/1     ImagePullBackOff   0          3m55s
redhat-marketplace-57p8c                1/1     Running            0          36m
redhat-operators-smxx8                  1/1     Running            0          36m

When a catalog source is created in a namespace, a pod for the catalog source is created in that namespace. In the preceding example output, the status for the example-catalog-bwt8z pod is ImagePullBackOff. This status indicates that there is an issue pulling the catalog source’s index image.

Use the oc describe command to inspect a pod for more detailed information:

$ oc describe pod example-catalog-bwt8z -n openshift-marketplace

Example output

Name:         example-catalog-bwt8z
Namespace:    openshift-marketplace
Priority:     0
Node:         ci-ln-jyryyg2-f76d1-ggdbq-worker-b-vsxjd/10.0.128.2
...
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       48s                default-scheduler  Successfully assigned openshift-marketplace/example-catalog-bwt8z to ci-ln-jyryyf2-f76d1-fgdbq-worker-b-vsxjd
  Normal   AddedInterface  47s                multus             Add eth0 [10.131.0.40/23] from openshift-sdn
  Normal   BackOff         20s (x2 over 46s)  kubelet            Back-off pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          20s (x2 over 46s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling         8s (x3 over 47s)   kubelet            Pulling image "quay.io/example-org/example-catalog:v1"
  Warning  Failed          8s (x3 over 47s)   kubelet            Failed to pull image "quay.io/example-org/example-catalog:v1": rpc error: code = Unknown desc = reading manifest v1 in quay.io/example-org/example-catalog: unauthorized: access to the requested resource is not authorized
  Warning  Failed          8s (x3 over 47s)   kubelet            Error: ErrImagePull

In the preceding example output, the error messages indicate that the catalog source’s index image is failing to pull successfully because of an authorization issue. For example, the index image might be stored in a registry that requires login credentials.

Additional resources

gRPC documentation: States of Connectivity

edit

Querying Operator pod status

You can list Operator pods within a cluster and their status. You can also collect a detailed Operator pod summary.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift cli (oc).

Procedure

List Operators running in the cluster. The output includes Operator version, availability, and up-time information:
```
$ oc get clusteroperators
```
List Operator pods running in the Operator’s namespace, plus pod status, restarts, and age:
```
$ oc get pod -n <operator_namespace>
```

Output a detailed Operator pod summary:

$ oc describe pod <operator_pod_name> -n <operator_namespace>

edit

Gathering Operator logs

If you experience Operator issues, you can gather detailed diagnostic information from Operator pod logs.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
Your API service is still functional.
You have installed the OpenShift cli (oc).
You have the fully qualified domain names of the control plane or control plane machines.

Procedure

List the Operator pods that are running in the Operator’s namespace, plus the pod status, restarts, and age:
```
$ oc get pods -n <operator_namespace>
```
Review logs for an Operator pod:
```
$ oc logs pod/<pod_name> -n <operator_namespace>
```
If an Operator pod has multiple containers, the preceding command will produce an error that includes the name of each container. Query logs from an individual container:
```
$ oc logs pod/<operator_pod_name> -c <container_name> -n <operator_namespace>
```

If the API is not functional, review Operator pod and container logs on each control plane node by using SSH instead. Replace <master-node>.<cluster_name>.<base_domain> with appropriate values.

List pods on each control plane node:

$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl pods

For any Operator pods not showing a Ready status, inspect the pod’s status in detail. Replace <operator_pod_id> with the Operator pod’s ID listed in the output of the preceding command:
```
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspectp <operator_pod_id>
```

List containers related to an Operator pod:

$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps --pod=<operator_pod_id>

For any Operator container not showing a Ready status, inspect the container’s status in detail. Replace <container_id> with a container ID listed in the output of the preceding command:
```
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspect <container_id>
```

Review the logs for any Operator containers not showing a Ready status. Replace <container_id> with a container ID listed in the output of the preceding command:

$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>

Red Hat OpenShift Service on AWS cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes by using SSH is not recommended. Before attempting to collect diagnostic data over SSH, review whether the data collected by running oc adm must gather and other oc commands is sufficient instead. However, if the Red Hat OpenShift Service on AWS API is not available, or the kubelet is not properly functioning on the target node, oc operations will be impacted. In such situations, it is possible to access nodes using ssh core@<node>.<cluster_name>.<base_domain>.