Hosted control planes release notes | Hosted control planes

Notable technical changes
Fixed issues
Technology Preview features status
Known issues

With this release, hosted control planes for OKD 4.22 is available. Hosted control planes for OKD 4.22 supports multicluster engine for Kubernetes Operator version 2.17.

Notable technical changes

Review the following notable technical changes introduced in this release.

Hosted control planes on IBM Z in a disconnected environment now available: With this release, hosted control planes on IBM Z® in a disconnected environment is a General Availability feature. In earlier versions, it was a Technology Preview feature.

Fixed issues

The following issues are fixed for this release:

Before this update, services in the hosted control plane namespace, such as the aws-ebs-csi-driver-controller-metrics service, used the service-ca annotation (service.beta.openshift.io/serving-cert-secret-name) to generate TLS certificates. As a consequence, control plane services incorrectly depended on the OpenShift Service CA Operator in the hosted cluster for certificate generation, which weakened the security boundary between the control plane and the hosted cluster. With this release, the Control Plane Operator creates and manages TLS certificates for the aws-ebs-csi-driver-controller-metrics service directly, signed by the hosted control plane root CA, eliminating the dependency on the OpenShift Service CA Operator. The implementation checks for service-ca annotations to ensure a smooth upgrade path from older deployments. As a result, control plane isolation and certificate lifecycle management are improved. (OCPBUGS-34662)
Before this update, the HyperShift Operator metrics collector validated the proxy CA bundle certificates on every metrics collection cycle. As a consequence, when a certificate in the CA bundle expired, repeated proxy ca bundle is invalid messages were posted in the HyperShift Operator logs without identifying the hosted cluster, making it difficult to diagnose the cluster with the invalid proxy CA certificate. With this release, certificate validation is moved to the HostedCluster reconcile loop, and a new ValidProxyConfiguration condition is added to the HostedCluster API. The metrics collector now reads the validation result from the condition instead of directly performing validation. As a result, the metrics collector no longer posts repeated messages in the logs, and affected clusters can be identified. (OCPBUGS-55151)
Before this update, KubeVirt virtual machines (VMs) used in node pools were not configured with an external eviction strategy. As a consequence, the Cluster API Provider for KubeVirt controller did not detect eviction requests during node drains on the underlying infrastructure cluster. Node drains were not coordinated properly, the hosted control plane function was disrupted, and pods failed when node pool VMs were shut down. With this release, KubeVirt VMs are configured with the external eviction strategy in the VM template specification. As a result, the Cluster API Provider for KubeVirt can detect eviction events and coordinate the draining of hosted cluster nodes during infrastructure operations. For VMs that support live migration, the Cluster API Provider for KubeVirt skips the drain process and allows the VMs to be migrated without disruption. (OCPBUGS-58397)
Before this update, when you removed the additionalTrustBundle field from the HostedCluster specification, the additionalTrustBundle certificate was not removed from the user-ca-bundle config map. As a consequence, it appeared that the additionalTrustBundle certificate was not removed from the hosted clusters. With this release, the reconciliation logic ensures that the user-ca-bundle config map is deleted from the hosted cluster when you delete the additionalTrustBundle field. As a result, when you delete the additionalTrustBundle field from the HostedCluster specification, the certificate is removed, improving security and consistency. (OCPBUGS-60707)
Before this update, the control plane deployments related to Cluster API (cluster-api and capi-provider) in the hosted control plane namespace lacked finalizers. As a consequence, if these deployments were deleted before the HostedCluster resource was deleted, the controller pods would stop running before they could process finalizers on their managed Cluster API resources (Machine objects, MachineDeployment objects, platform-specific infrastructure objects), leading to orphaned cloud resources such as EC2 instances, VMs, disks, and load balancers. With this update, the HyperShift Operator adds a finalizer, hypershift.openshift.io/component-finalizer, to the cluster-api and capi-provider deployments. The finalizer is only removed after the underlying infrastructure resources have been deleted during HostedCluster teardown. As a result, accidental deletion of these deployments is blocked until Cluster API resources are properly cleaned up, preventing orphaned cloud resources.(OCPBUGS-63452)
Before this update, when the ValidAWSIdentityProvider condition was copied from the control plane to the hosted cluster, the logic preserved the earlier status if the new condition was Unknown. As a consequence, when the earlier condition was True and the new condition was Unknown, the update was skipped. With this release, the condition on the hosted cluster correctly reflects the current health of the AWS Identity Provider referenced in the cloud credentials. (OCPBUGS-66325)
Before this update, the Cluster Network Operator failed to recognize KubeVirt as a supported platform for hosted control planes with dual-stack networking. As a consequence, on deployments of hosted control planes on OpenShift Virtualization with dual-stack networking, the Cluster Network Operator deployment failed. With this release, the Cluster Network Operator recognizes KubeVirt as a supported platform for hosted control planes with dual-stack networking. As a result, deploying hosted control planes on OpenShift Virtualization with IPv4/IPv6 dual-stack networking succeeds. (OCPBUGS-66417)
Before this update, the cluster autoscaler did not include the hypershift.openshift.io/nodepool-globalps-enabled label in its --balancing-ignore-label list. As a consequence, when the autoscaler balanced node groups, it treated nodes with and without this label as belonging to different groups, causing uneven scaling across nodes in the same NodePool object. With this update, the hypershift.openshift.io/nodepool-globalps-enabled label is added to the balancing ignore list of the autoscaler. As a result, the autoscaler distributes new nodes evenly across node groups regardless of the Global Pull Secret eligibility label. (OCPBUGS-73817)
Before this update, when you created a hosted cluster that used a NodePort publishing strategy, specifying a port outside the Kubernetes service node port range, such as 10000, was silently accepted during cluster creation. As a consequence, the cluster installation got stuck with only 3 pods in the hosted cluster namespace, because the Control Plane Operator rejected the port for being outside the acceptable range of 30000 - 32767, causing a late failure after resources were already provisioned. With this release, early validation is added for the NodePort.Port value against the configured ServiceNodePortRange parameter of the cluster. Invalid values are rejected upfront with a clear message indicating the allowed range. As a result, you receive an immediate validation error when you specify a NodePort that is outside the acceptable range, and avoid stuck cluster installations. (OCPBUGS-65842)
Before this update, the hypershift.openshift.io/nodepool-globalps-enabled label was applied to nodes by the Hosted Cluster Config Operator globalps controller, which discovered eligible nodes by querying MachineSet objects and Machine objects during its periodic reconciliation. As a consequence, when a new Replace node joined the cluster, the global-pull-secret-syncer DaemonSet pod could not schedule on it until the next reconcile cycle of the globalps controller, causing a delay of up to 15 minutes. With this update, the label is set directly on Cluster API Machine objects by the HyperShift Operator during MachineDeployment reconciliation, so it propagates to nodes at creation time by using the Hosted Cluster Config Operator Node controller. As a result, new Replace nodes on AWS are immediately eligible for the global-pull-secret-syncer DaemonSet, eliminating the scheduling delay. (OCPBUGS-77966)
Before this update, the ignition server deployment computed registry overrides by performing live HTTP registry connectivity checks (LookupMappedImage/GetMetadata) during every Control Plane Operator reconciliation. As a consequence, network conditions caused the --registry-overrides argument and MIRRORED_RELEASE_IMAGE environment variable to return different values on each reconciliation, triggering constant deployment regenerations and pod restarts. With this update, the ignition server deployment uses the static registry overrides from the HostedCluster specification instead of performing live registry lookups at deploy time. The ignition server already resolves per-image mirrors at runtime by using its own override logic. As a result, ignition server deployments remain stable with consistent configuration, eliminating unnecessary pod restarts. (OCPBUGS-60185)
Before this update, when you created a hosted cluster that used a NodePort publishing strategy, specifying a port outside the Kubernetes service node port range, such as 10000, was silently accepted during cluster creation. As a consequence, the cluster installation got stuck with only 3 pods in the hosted cluster namespace, because the Control Plane Operator rejected the port for being outside the acceptable range of 30000 - 32767, causing a late failure after resources were already provisioned. With this release, early validation is added for the NodePort.Port value against the configured ServiceNodePortRange parameter of the cluster. Invalid values are rejected upfront with a clear message indicating the allowed range. As a result, you receive an immediate validation error when you specify a NodePort that is outside the acceptable range, and avoid stuck cluster installations. (OCPBUGS-65842)
Before this update, when the allowedCIDRBlocks parameter was removed from the HostedCluster specification, the LoadBalancerSourceRanges field on the external router LoadBalancer service was not cleared. As a consequence, stale Classless Inter-Domain Routing (CIDR) restrictions remained on the router service after the administrator removed the access restrictions, continuing to block traffic that should have been allowed. With this update, the reconciliation logic always sets the LoadBalancerSourceRanges field on the external router service to match the current allowedCIDRBlocks value, including clearing it when the list is empty. As a result, removing the allowedCIDRBlocks parameter from the HostedCluster specification correctly removes the CIDR restrictions from the router service.(OCPBUGS-69761)
Before this update, the HostedControlPlane controller set the HostedControlPlaneAvailable condition to True after the Kubernetes API server was reachable, without verifying that all control plane components had finished rolling out. As a consequence, customers could interact with the cluster before components such as the kube-controller-manager, oauth-server, or kube-scheduler were fully ready, which could lead to failures or unexpected behavior. With this update, the controller now lists all control plane component resources in the hosted control plane namespace and verifies that each has its Available condition set to True before setting the HostedControlPlaneAvailable condition to True. If any components are not yet available, the condition reports the ComponentsNotAvailable reason with a message listing the pending components. After the cluster reaches the available state, later component rollouts, such as during upgrades, do not flip the condition back to False. As a result, the hosted control plane now only reports Available=True after all control plane components have completed their initial rollout, ensuring a more reliable user experience. (OCPBUGS-74648)
Before this update, the Hosted Cluster Config Operator contained logic that modified the openshift-controller-manager-config config map to disable the serviceaccount-pull-secrets controller when the managementState parameter of the image registry was set to Removed. In OKD 4.20 and later, Control Plane Operator v2 started managing this config map, but the Hosted Cluster Config Operator continued modifying it on every reconciliation cycle. As a consequence, the openshift-controller-manager-config config map was updated by Hosted Cluster Config Operator every minute, which triggered the openshift-controller-manager file observer to detect changes and restart pods. This behavior caused constant openshift-controller-manager pod restarts. With this release, the OpenShift Controller Manager config update logic is removed from the Hosted Cluster Config Operator because Control Plane Operator v2 manages the openshift-controller-manager-config config map. As a result, the openshift-controller-manager pods no longer experience unnecessary restarts. (OCPBUGS-74931)
Before this update, during the backup and restore process with OADP, the token secret was deleted before the NodePool object was restored. Then, the NodePool controller created a token secret without the ignition-reached annotation. Because nodes were already running, they did not contact the ignition endpoint again, so the annotation was never set back. As a consequence, the ReachedIgnitionEndpoint condition stayed False, blocking machine health check creation and disabling auto-repair for the restored node pools. With this release, when the HostedCluster object has the hypershift.openshift.io/restored-from-backup annotation set by the OADP plugin, the token secret is created with the ignition-reached=True parameter, preserving the condition across the restore process. As a result, after a backup and restore process, node pools correctly report ReachedIgnitionEndpoint=True so that the machine health check and auto-repair work as expected. (OCPBUGS-77621)
Before this update, when deploying hosted clusters with a 4.21 or later payload, the HyperShift Operator used hard-coded quay.io image references for the Cluster API manager and platform-specific Cluster API provider containers. These hard-coded images bypassed the standard release payload image lookup, which respects ImageContentSourcePolicies (ICSPs) and ImageDigestMirrorSets (IDMSs). As a consequence, in disconnected or mirrored environments, Cluster API images were always pulled directly from quay.io even when registry overrides were configured, causing image pull failures and preventing cluster creation. With this update, the backward-compatible Cluster API image references are resolved by looking up the component from a pinned 4.20.10 release payload through the standard release image provider, which correctly follows registry override configuration. As a result, Cluster API images are pulled from the correct mirror registry in disconnected environments. For this fix to work, the 4.20.10 release payload and its Cluster API component images must be mirrored to the target registry. (OCPBUGS-74247)
Before this update, requests from the Kubernetes API server bootstrap container were denied by a validating admission policy that restricts feature gate changes to a specific user. As a consequence, the bootstrap container was unable to apply feature gate changes, causing control plane issues. With this release, a dedicated identity is created for the Kubernetes API server bootstrap container and is allow-listed in the policy. As a result, the bootstrap container can apply feature gate changes without being denied by the validating admission policy. (OCPBUGS-50603)
Before this update, when a predicate of a Control Plane Operator v2 component evaluated to false, the framework tried to look up and clean up the associated resource by using the cached client. For resource types not installed on the management cluster, such as the SecretProviderClass custom resource definition of the Secrets Store CSI driver, this caused the cached client to create an informer that retried list and watch actions indefinitely, blocking all control plane reconciliation. As a consequence, hosted cluster creation failed on management clusters that did not have the Secrets Store CSI driver custom resource definition installed. With this update, the Control Plane Operator probes whether a resource type is accessible on the management cluster before trying to interact with it. If the custom resource definition is not installed or the operator lacks role-based access permission, the operation is skipped gracefully and the result is cached. As a result, hosted cluster creation succeeds even when optional custom resource definitions such as the Secrets Store CSI driver are not present on the management cluster. (OCPBUGS-65687)
Before this update, when using a custom API server DNS name with external DNS, the kubeconfig secret contained an incorrect port. As a consequence, connections to the API server failed with reset errors. With this update, the kubeconfig uses the correct port for the configured DNS setup. As a result, external DNS connections work as expected. (OCPBUGS-72258)
Before this update, a race condition in VolumeSnapshot processing where a snapshot was deleted between listing and retrieving was treated as an unrecoverable error, ending the processing of remaining snapshots. As a consequence, intermittent backup failures (about 25% of scheduled backups) were marked as PartiallyFailed with missing etcd PVC data. With this release, deleted snapshots are gracefully skipped instead of treated as unrecoverable errors, allowing the remaining snapshots to be processed normally. As a result, backups are completed successfully even when snapshot cleanup races with plugin processing. (OCPBUGS-75913)
Before this update, when the scale-from-zero feature was enabled in AWS and a node pool used the InPlace node upgrade type with autoscaling set to min=0, the scale-from-zero implementation did not support the InPlace upgrade strategy. The original implementation used a machine deployment controller approach that only worked with the Replace upgrade strategy. As a consequence, new workloads did not trigger node pool scale-up from zero when using the InPlace upgrade type, preventing nodes from being created even when pods were pending. With this release, the scale-from-zero implementation uses a generic provider pattern that works with all upgrade types. As a result, node pools that use the InPlace upgrade type can scale up from zero when workload demands require additional capacity. The autoscaler correctly provisions nodes regardless of the upgrade strategy. (OCPBUGS-70320)

Technology Preview features status

Some features in this release are currently in Technology Preview. These experimental features are not intended for production use. Note the following scope of support on the Red Hat Customer Portal for these features:

Technology Preview Features Support Scope

In the following table, features are marked with the following statuses:

Not Available
Technology Preview
General Availability
Deprecated
Removed

For IBM Power and IBM Z, the following exceptions apply:

For version 4.20 and later, you must run the control plane on machine types that are based on 64-bit x86 architecture or s390x architecture, and node pools on IBM Power or IBM Z.
For version 4.19 and earlier, you must run the control plane on machine types that are based on 64-bit x86 architecture, and node pools on IBM Power or IBM Z.

Table 1. Hosted control planes GA and TP tracker
Feature	4.20	4.21	4.22
Hosted control planes for OKD using non-bare-metal agent machines	Technology Preview	Technology Preview	Technology Preview
Hosted control planes for OKD on OpenStack	Technology Preview	Technology Preview	Technology Preview
Custom taints and tolerations	Technology Preview	Technology Preview	Technology Preview
NVIDIA GPU devices on hosted control planes for OKD Virtualization	Technology Preview	Technology Preview	Technology Preview
Hosted control planes for OKD Virtualization on IBM Z ^[1]	Not Available	Technology Preview	General Availability
Hosted control planes on IBM Z in a disconnected environment	General Availability	General Availability	General Availability
Hosted control planes for OKD on Microsoft Azure	Not Available	Not Available	Technology Preview

Hosted control planes for OKD Virtualization on IBM Z is supported as Technology Preview starting with OKD 4.21, multicluster engine for Kubernetes Operator 2.11, and Red Hat Advanced Cluster Management (RHACM) 2.16. Creating hosted control planes with external infrastructure is not supported.

Known issues

This section includes several known issues for OKD 4.

If the annotation and the ManagedCluster resource name do not match, the multicluster engine for Kubernetes Operator console displays the cluster as Pending import. The cluster cannot be used by the multicluster engine Operator. The same issue happens when there is no annotation and the ManagedCluster name does not match the Infra-ID value of the HostedCluster resource.
When you use the multicluster engine for Kubernetes Operator console to add a new node pool to an existing hosted cluster, the same version of OKD might appear more than once in the list of options. You can select any instance in the list for the version that you want.
When a node pool is scaled down to 0 workers, the list of hosts in the console still shows nodes in a Ready state. You can verify the number of nodes in two ways:
- In the console, go to the node pool and verify that it has 0 nodes.
- On the command-line interface, run the following commands:
  - Verify that 0 nodes are in the node pool by running the following command:
    
    $ oc get nodepool -A
  - Verify that 0 nodes are in the cluster by running the following command:
    
    $ oc get nodes --kubeconfig
  - Verify that 0 agents are reported as bound to the cluster by running the following command:
    
    $ oc get agents -A
When you create a hosted cluster in an environment that uses the dual-stack network, you might encounter pods stuck in the ContainerCreating state. This issue occurs because the openshift-service-ca-operator resource cannot generate the metrics-tls secret that the DNS pods need for DNS resolution. As a result, the pods cannot resolve the Kubernetes API server. To resolve this issue, configure the DNS server settings for a dual stack network.
If you created a hosted cluster in the same namespace as its managed cluster, detaching the managed hosted cluster deletes everything in the managed cluster namespace including the hosted cluster. The following situations can create a hosted cluster in the same namespace as its managed cluster:
- You created a hosted cluster on the Agent platform through the multicluster engine for Kubernetes Operator console by using the default hosted cluster cluster namespace.
- You created a hosted cluster through the command-line interface or API by specifying the hosted cluster namespace to be the same as the hosted cluster name.
When you use the console or API to specify an IPv6 address for the spec.services.servicePublishingStrategy.nodePort.address field of a hosted cluster, a full IPv6 address with 8 hextets is required. For example, instead of specifying 2620:52:0:1306::30, you need to specify 2620:52:0:1306:0:0:0:30.
In hosted control planes on OKD Virtualization, if you store all hosted cluster information in a shared namespace and then back up and restore a hosted cluster, you might unintentionally change other hosted clusters. To avoid this issue, back up and restore only hosted clusters that use labels, or avoid storing all hosted cluster information in a shared namespace.
For version 4.21, hosted control planes pins all Cluster API images to the 4.20.10-multi release image for compatibility reasons. Hosted control planes pins the images when Cluster API deployments are generated. The 4.20.10-multi image must always be mirrored and available in order for the Cluster API to work with hosted control planes version 4.21.
Intermittent egress IP outages occur when a hosted cluster uses the following combined settings:
- The service publishing strategy for the Konnectivity service is set to Route.
- The management cluster uses Virtual Router Redundancy Protocol (VRRP) VIP for ingress.