Troubleshooting Cluster API clusters - Managing machines with the Cluster API | Machine management

Referencing the intended objects when using the CLI
Duplicated machine set and machine resources
Troubleshooting resource migration

Managing machines with the Cluster API is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Use the information in this section to understand and recover from issues you might encounter. Generally, troubleshooting steps for problems with the Cluster API are similar to those steps for problems with the Machine API.

The Cluster CAPI Operator and its operands are provisioned in the openshift-cluster-api namespace, whereas the Machine API uses the openshift-machine-api namespace. When using oc commands that reference a namespace, be sure to reference the correct one.

Referencing the intended objects when using the CLI

For clusters that use the Cluster API, OpenShift CLI (oc) commands prioritize Cluster API objects over Machine API objects.

This behavior impacts any oc command that acts upon any object that is represented in both the Cluster API and the Machine API. This explanation uses the oc delete machine command, which deletes a machine, as an example.

Cause

When you run an oc command, oc communicates with the Kube API server to determine which objects to act upon. The Kube API server uses the first installed custom resource definition (CRD) it encounters alphabetically when an oc command is run.

CRDs for Cluster API objects are in the cluster.x-k8s.io group, while CRDs for Machine API objects are in the machine.openshift.io group. Because the letter c precedes the letter m alphabetically, the Kube API server matches on the Cluster API object CRD. As a result, the oc command acts upon Cluster API objects.

Consequence

Due to this behavior, the following unintended outcomes can occur on a cluster that uses the Cluster API:

For namespaces that contain both types of objects, commands such as oc get machine return only Cluster API objects.
For namespaces that contain only Machine API objects, commands such as oc get machine return no results.

Workaround

You can ensure that oc commands act on the type of objects you intend by using the corresponding fully qualified name.

Prerequisites

You have access to the cluster using an account with cluster-admin permissions.
You have installed the OpenShift CLI (oc).

Procedure

To delete a Machine API machine, use the fully qualified name machine.machine.openshift.io when running the oc delete machine command:
```
$ oc delete machine.machine.openshift.io <machine_name>
```
To delete a Cluster API machine, use the fully qualified name machine.cluster.x-k8s.io when running the oc delete machine command:
```
$ oc delete machine.cluster.x-k8s.io <machine_name>
```

Duplicated machine set and machine resources

On clusters that support migrating Machine API resources to Cluster API resources, some resources seem to have duplicate instances in the output of OpenShift CLI (oc) commands that list resources and in the OKD web console.

Cause

When you install an OKD cluster that uses the default configuration options, the installation program provisions the following infrastructure resources in the openshift-machine-api namespace:

One control plane machine set that manages three control plane machines.
One or more compute machine sets that manage three compute machines.
One machine health check that manages spot instances.
Compute machines that are created according to the compute machine set specifications.

On clusters that support migrating Machine API resources to Cluster API resources, a two-way synchronization controller creates the following Cluster API resources in the openshift-cluster-api namespace:

One cluster resource.
One provider-specific infrastructure cluster resource.
One or more machine templates that correspond to compute machine sets.
One or more compute machine sets that manage three compute machines.
Compute machines that are created according to the machine template and compute machine set specifications.
Infrastructure machines that correspond to compute machines.

These Cluster API resources have the same names as their counterparts in the openshift-machine-api namespace.

Consequence

Due to this behavior, instances of machine set and machine resources that seem to be duplicates appear in the output of oc commands that list resources and in the OKD web console.

Workaround

Although the resources have the same names as their counterparts in the other namespace, only the resources that use the current authoritative API are active. The synchronization controller creates and maintains the corresponding resources that do not use the current authoritative API in an unprovisioned (Paused) state to prevent unintended reconciliation.

Result

Only one of each resource that seems to be a duplicate is active at a time. The inactive nonauthoritative resources do not impact functionality.

Do not delete any nonauthoritative resource that does not use the current authoritative API unless you want to delete the corresponding resource that does use the current authoritative API.

When you delete a nonauthoritative resource that does not use the current authoritative API, the synchronization controller deletes the corresponding resource that does use the current authoritative API. For more information, see "Unexpected resource deletion behavior".

Troubleshooting resource migration

When you migrate a resource to use a different authoritative API, you might encounter issues during the migration process. You might also notice unexpected behavior due to differences between the Cluster API and the Machine API.

Authoritative API types of compute machines

The authoritative API of a compute machine depends on the values of the .spec.authoritativeAPI and .spec.template.spec.authoritativeAPI fields in the Machine API compute machine set that creates it.

Table 1. Interaction of `authoritativeAPI` fields when creating compute machines
`.spec.authoritativeAPI` value	`ClusterAPI`	`ClusterAPI`	`MachineAPI`	`MachineAPI`
`.spec.template.spec.authoritativeAPI` value	`ClusterAPI`	`MachineAPI`	`MachineAPI`	`ClusterAPI`
`authoritativeAPI` value for new compute machines	`ClusterAPI`	`ClusterAPI`	`MachineAPI`	`ClusterAPI`

When the .spec.authoritativeAPI value is ClusterAPI, the Machine API machine set is not authoritative and the .spec.template.spec.authoritativeAPI value is not used. As a result, the only combination that creates a compute machine with the Machine API as authoritative is where the .spec.authoritativeAPI and .spec.template.spec.authoritativeAPI values are MachineAPI.

Unexpected machine counts after scaling

On clusters that support migrating resources between the Machine API and the Cluster API, users might experience unexpected behavior when scaling the number of compute machines. The output of the oc get command for a compute machine set that does not use the authoritative API might contain inaccurate values in the CURRENT, READY, and AVAILABLE columns.

Cause

The values that populate the CURRENT, READY, and AVAILABLE columns originate in the .status stanza of a compute machine set. The two-way synchronization controller that handles resource conversion between authoritative API types does not currently synchronize values in the .status stanza.

The value in the DESIRED column reflects the .spec.replicas value of a compute machine set. The two-way synchronization controller synchronizes values in the .spec stanza.

Consequence

Users can expect to see the following behavior when scaling migrated machine sets:

Start with a compute machine set with existing machines.
Migrate the machine set to use a different authoritative API.
Scale the now authoritative machine set up by setting a larger value in the .spec.replicas field.
The machine set creates machines with the current authoritative API to satisfy the number of requested replicas.
Scale the authoritative machine set down such that one of the following conditions causes the deletion of machines that do not use the current authoritative API:
- The total number of replicas requested is fewer than the number of machines that do not use the current authoritative API.
- The machine deletion policy for the machine set selects machines that do not use the current authoritative API.
Check the status of the nonauthoritative compute machine set by running the oc get command.
- The value in the DESIRED column in the output reflects the .spec.replicas value.
- The values in the CURRENT, READY, and AVAILABLE columns reflect the original number of replicas that existed before scaling the machine set.

Workaround

To verify that a scale-down operation successfully deleted the compute machines that do not use the current authoritative API, run the oc get command that lists the nonauthoritative compute machines.

Result

If the scale-down operation succeeded, the count in the output of the oc get command for the nonauthoritative compute machines reflects the .spec.replicas value of the machine set.

Incomplete synchronization of labels and annotations

The label and annotation synchronization behavior differs between the Machine API and the Cluster API. In some cases, these differences cause the two-way synchronization controller to overwrite labels on a Cluster API machine during migration.

Cause: With the Machine API, changes to machine set labels and annotations do not propagate to existing machines and nodes. These changes only apply to machines deployed after the update.

With the Cluster API, changes to machine set labels and annotations propagate to existing machines and nodes. When the authoritative API for a machine set changes from Machine API to Cluster API, its labels propagate to the Cluster API machines that it manages. The propagation happens before the Cluster API machine is marked as authoritative.
Consequence: The two-way synchronization controller overwrites any propagated labels and annotations with the earlier value, leading to an inconsistency. This outcome only occurs when removing a label or annotation. Updates and additional labels or annotations do not cause this inconsistency.
Workaround: There is no workaround for this issue. For more information, see OCPBUGS-54333.

Unsupported configuration options

The Machine API does not support all configuration options for the Cluster API. Some Machine API configurations cannot migrate to the Cluster API. Additional configuration options might be supported in a future release.

Attempting to use the following configurations might cause a migration to fail or result in errors.

This list might not be exhaustive.

General limitations

Machine API compute machines cannot migrate to the Cluster API unless the NodeDeletionTimeout field uses the Cluster API default value of 10s.
OKD does not support using the following Cluster API fields in the spec.template.spec stanza of a machine set or the spec stanza of a machine:
- version
- readinessGates
The Machine API does not support using the following Cluster API drain configuration options:
- nodeDrainTimeout
- nodeVolumeDetachTimeout
- nodeDeletionTimeout
The Cluster API does not support propagating labels or taints from machines to nodes.

Amazon Web Services (AWS) limitations

Machine API compute machines cannot use AWS load balancers.
The Machine API does not support using the following Amazon EC2 Instance Metadata Service (IMDS) configuration options:
- httpEndpoint
- httpPutResponseHopLimit
- instanceMetadataTags
If you migrate a Cluster API machine template that uses IMDS configuration options to a Machine API compute machine set, expect the following behaviors:
- Any machines that the migrated Machine API machine set creates will not have these fields. The underlying instances will not use these settings.
- Any existing machines that the migrated machine set manages will retain these fields. The underlying instances will continue to use these settings.
OKD does not support using the following AWS machine template fields:
- spec.ami.eksLookupType
- spec.cloudInit
- spec.ignition.proxy
- spec.ignition.tls
- spec.imageLookupBaseOS
- spec.imageLookupFormat
- spec.imageLookupOrg
- spec.networkInterfaces
- spec.privateDNSName
- spec.securityGroupOverrides
- spec.uncompressedUserData
The Cluster API does not support orphaning a nonroot EBS volume when its underlying AWS EC2 instance is removed. When an instance is terminated, the Cluster API removes all dependent volumes.
When migrating a Machine API resource to the Cluster API, the ignition version is hard-coded and might not match the user data secret that is passed through.

Troubleshooting clusters that use the Cluster API

Referencing the intended objects when using the CLI

Duplicated machine set and machine resources

Troubleshooting resource migration

Authoritative API types of compute machines

Unexpected machine counts after scaling

Incomplete synchronization of labels and annotations

Unsupported configuration options