cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable"
The following sections describe the various OKD components and configurations that you use to configure and deploy clusters to run telco core workloads.
No reference design updates in this release
CPU partitioning allows for the separation of sensitive workloads from generic purposes, auxiliary processes, interrupts, and driver work queues to achieve improved performance and latency.
The operating system needs a certain amount of CPU to perform all the support tasks including kernel networking.
A system with just user plane networking applications (DPDK) needs at least one Core (2 hyperthreads when enabled) reserved for the operating system and the infrastructure components.
A system with Hyper-Threading enabled must always put all core sibling threads to the same pool of CPUs.
The set of reserved and isolated cores must include all CPU cores.
Core 0 of each NUMA node must be included in the reserved CPU set.
Isolated cores might be impacted by interrupts. The following annotations must be attached to the pod if guaranteed QoS pods require full use of the CPU:
cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable"
When per-pod power management is enabled with PerformanceProfile.workloadHints.perPodPowerManagement
the following annotations must also be attached to the pod if guaranteed QoS pods require full use of the CPU:
cpu-c-states.crio.io: "disable" cpu-freq-governor.crio.io: "performance"
The minimum reserved capacity (systemReserved
) required can be found by following the guidance in "Which amount of CPU and memory are recommended to reserve for the system in OpenShift 4 nodes?"
The actual required reserved CPU capacity depends on the cluster configuration and workload attributes.
This reserved CPU value must be rounded up to a full core (2 hyper-thread) alignment.
Changes to the CPU partitioning will drain and reboot the nodes in the MCP.
The reserved CPUs reduce the pod density, as the reserved CPUs are removed from the allocatable capacity of the OpenShift node.
The real-time workload hint should be enabled if the workload is real-time capable.
Hardware without Interrupt Request (IRQ) affinity support will impact isolated CPUs. To ensure that pods with guaranteed CPU QoS have full use of allocated CPU, all hardware in the server must support IRQ affinity.
OVS dynamically manages its cpuset
configuration to adapt to network traffic needs.
You do not need to reserve additional CPUs for handling high network throughput on the primary CNI.
If workloads running on the cluster require cgroups v1, you can configure nodes to use cgroups v1 as part of the initial cluster deployment. For more information, see "Enabling Linux cgroup v1 during installation".
Telco core cloud-native functions (CNFs) typically require a service mesh implementation.
Specific service mesh features and performance requirements are dependent on the application. The selection of service mesh implementation and configuration is outside the scope of this documentation. You must account for the impact of service mesh on cluster resource usage and performance, including additional latency introduced in pod networking, in your implementation. |
Telco core validation is now extended with bonding, MACVLAN, IPVLAN and SR-IOV networking scenarios.
The cluster is configured in dual-stack IP configuration (IPv4 and IPv6).
The validated physical network configuration consists of two dual-port NICs. One NIC is shared among the primary CNI (OVN-Kubernetes) and IPVLAN and MACVLAN traffic, the second NIC is dedicated to SR-IOV VF-based Pod traffic.
A Linux bonding interface (bond0
) is created in an active-active LACP IEEE 802.3ad
configuration with the two NIC ports attached.
The top-of-rack networking equipment must support and be configured for multi-chassis link aggregation (mLAG) technology. |
VLAN interfaces are created on top of bond0
, including for the primary CNI.
Bond and VLAN interfaces are created at install time during network configuration.
Apart from the VLAN (VLAN0
) used by the primary CNI, the other VLANS can be created on Day 2 using the Kubernetes NMState Operator.
MACVLAN and IPVLAN interfaces are created with their corresponding CNIs. They do not share the same base interface.
SR-IOV VFs are managed by the SR-IOV Network Operator. The following diagram provides an overview of SR-IOV NIC sharing:
No reference design updates in this release
The Cluster Network Operator (CNO) deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during OKD cluster installation. It allows configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.
OVN-Kubernetes is required for IPv6 support.
Large MTU cluster support requires connected network equipment to be set to the same or larger value.
MACVLAN and IPVLAN cannot co-locate on the same main interface due to their reliance on the same underlying kernel mechanism, specifically the rx_handler
.
This handler allows a third-party module to process incoming packets before the host processes them, and only one such handler can be registered per network interface.
Since both MACVLAN and IPVLAN need to register their own rx_handler
to function, they conflict and cannot coexist on the same interface.
See ipvlan/ipvlan_main.c#L82 and net/macvlan.c#L1260 for details.
Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC.
Splitting the shared NIC into multiple NICs or using a single dual-port NIC has not been validated with the telco core reference design. |
Single-stack IP cluster not validated.
Pod egress traffic is handled by kernel routing table with the routingViaHost
option. Appropriate static routes must be configured in the host.
In OKD 4.17, frr-k8s
is now the default and fully supported Border Gateway Protocol (BGP) backend.
The deprecated frr
BGP mode is still available.
You should upgrade clusters to use the frr-k8s
backend.
MetalLB is a load-balancer implementation that uses standard routing protocols for bare-metal clusters. It enables a Kubernetes service to get an external IP address which is also added to the host network for the cluster.
Some use cases might require features not available in MetalLB, for example stateful load balancing. Where necessary, use an external third party load balancer. Selection and configuration of an external load balancer is outside the scope of this document. When you use an external third party load balancer, ensure that it meets all performance and resource utilization requirements. |
Stateful load balancing is not supported by MetalLB. An alternate load balancer implementation must be used if this is a requirement for workload CNFs.
The networking infrastructure must ensure that the external IP address is routable from clients to the host network for the cluster.
MetalLB is used in BGP mode only for core use case models.
For core use models, MetalLB is supported with only the OVN-Kubernetes network provider used in local gateway mode. See routingViaHost
in the "Cluster Network Operator" section.
BGP configuration in MetalLB varies depending on the requirements of the network and peers.
Address pools can be configured as needed, allowing variation in addresses, aggregation length, auto assignment, and other relevant parameters.
MetalLB uses BGP for announcing routes only.
Only the transmitInterval
and minimumTtl
parameters are relevant in this mode.
Other parameters in the BFD profile should remain close to the default settings. Shorter values might lead to errors and impact performance.
No reference design updates in this release
SR-IOV enables physical network interfaces (PFs) to be divided into multiple virtual functions (VFs). VFs can then be assigned to multiple pods to achieve higher throughput performance while keeping the pods isolated. The SR-IOV Network Operator provisions and manages SR-IOV CNI, network device plugin, and other components of the SR-IOV stack.
Supported network interface controllers are listed in "Supported devices".
The SR-IOV Network Operator automatically enables IOMMU on the kernel command line.
SR-IOV VFs do not receive link state updates from PF. If link down detection is needed, it must be done at the protocol level.
MultiNetworkPolicy
CRs can be applied to netdevice
networks only.
This is because the implementation uses the iptables
tool, which cannot manage vfio
interfaces.
SR-IOV interfaces in vfio
mode are typically used to enable additional secondary networks for applications that require high throughput or low latency.
If you exclude the SriovOperatorConfig
CR from your deployment, the CR will not be created automatically.
NICs that do not support firmware updates under secure boot or kernel lock-down must be pre-configured with enough VFs enabled to support the number of VFs needed by the application workload.
The SR-IOV Network Operator plugin for these NICs might need to be disabled using the undocumented |
No reference design updates in this release
The NMState Operator provides a Kubernetes API for performing network configurations across cluster nodes.
Not applicable
The initial networking configuration is applied using NMStateConfig
content in the installation CRs. The NMState Operator is used only when needed for network updates.
When SR-IOV virtual functions are used for host networking, the NMState Operator using NodeNetworkConfigurationPolicy
is used to configure those VF interfaces, for example, VLANs and the MTU.
Cluster Logging Operator 6.0 is new in this release. Update your existing implementation to adapt to the new version of the API.
The Cluster Logging Operator enables collection and shipping of logs off the node for remote archival and analysis. The reference configuration ships audit and infrastructure logs to a remote archive by using Kafka.
Not applicable
The impact of cluster CPU use is based on the number or size of logs generated and the amount of log filtering configured.
The reference configuration does not include shipping of application logs. Inclusion of application logs in the configuration requires evaluation of the application logging rate and sufficient additional CPU resources allocated to the reserved set.
No reference design updates in this release
Use the Performance Profile to configure clusters with high power mode, low power mode, or mixed mode. The choice of power mode depends on the characteristics of the workloads running on the cluster, particularly how sensitive they are to latency.
Power configuration relies on appropriate BIOS configuration, for example, enabling C-states and P-states. Configuration varies between hardware vendors.
Latency: To ensure that latency-sensitive workloads meet their requirements, you will need either a high-power configuration or a per-pod power management configuration. Per-pod power management is only available for Guaranteed
QoS Pods with dedicated pinned CPUs.
Cloud native storage services can be provided by multiple solutions including OpenShift Data Foundation from Red Hat or third parties.
No reference design updates in this release
Red Hat OpenShift Data Foundation is a software-defined storage service for containers. For Telco core clusters, storage support is provided by OpenShift Data Foundation storage services running externally to the application workload cluster.
In an IPv4/IPv6 dual-stack networking environment, OpenShift Data Foundation uses IPv4 addressing. For more information, see Support OpenShift dual stack with OpenShift Data Foundation using IPv4.
OpenShift Data Foundation network traffic should be isolated from other traffic on a dedicated network, for example, by using VLAN isolation.
Other storage solutions can be used to provide persistent storage for core clusters.
The configuration and integration of these solutions is outside the scope of the telco core RDS. Integration of the storage solution into the core cluster must include correct sizing and performance analysis to ensure the storage meets overall performance and resource utilization requirements. |
The following sections describe the various OKD components and configurations that you use to configure the hub cluster with Red Hat Advanced Cluster Management (RHACM).
No reference design updates in this release
Red Hat Advanced Cluster Management (RHACM) provides Multi Cluster Engine (MCE) installation and ongoing lifecycle management functionality for deployed clusters.
You manage cluster configuration and upgrades declaratively by applying Policy
custom resources (CRs) to clusters during maintenance windows.
You apply policies with the RHACM policy controller as managed by Topology Aware Lifecycle Manager (TALM).
When installing managed clusters, RHACM applies labels and initial ignition configuration to individual nodes in support of custom disk partitioning, allocation of roles, and allocation to machine config pools.
You define these configurations with SiteConfig
or ClusterInstance
CRs.
Size your cluster according to the limits specified in Sizing your cluster.
RHACM scaling limits are described in Performance and scalability.
Use RHACM policy hub-side templating to better scale cluster configuration. You can significantly reduce the number of policies by using a single group policy or small number of general group policies where the group and per-cluster values are substituted into templates.
Cluster specific configuration: managed clusters typically have some number of configuration values that are specific to the individual cluster.
These configurations should be managed using RHACM policy hub-side templating with values pulled from ConfigMap
CRs based on the cluster name.
No reference design updates in this release
Topology Aware Lifecycle Manager (TALM) is an Operator that runs only on the hub cluster for managing how changes including cluster and Operator upgrades, configuration, and so on are rolled out to the network.
TALM supports concurrent cluster deployment in batches of 400.
Precaching and backup features are for single-node OpenShift clusters only.
Only policies that have the ran.openshift.io/ztp-deploy-wave
annotation are automatically applied by TALM during initial cluster installation.
You can create further ClusterGroupUpgrade
CRs to control the policies that TALM remediates.
No reference design updates in this release
GitOps and GitOps ZTP plugins provide a GitOps-based infrastructure for managing cluster deployment and configuration.
Cluster definitions and configurations are maintained as a declarative state in Git.
You can apply ClusterInstance
CRs to the hub cluster where the SiteConfig
Operator renders them as installation CRs.
Alternatively, you can use the GitOps ZTP plugin to generate installation CRs directly from SiteConfig
CRs.
The GitOps ZTP plugin supports automatic wrapping of configuration CRs in policies based on PolicyGenTemplate
CRs.
You can deploy and manage multiple versions of OKD on managed clusters using the baseline reference configuration CRs. You can use custom CRs alongside the baseline CRs. To maintain multiple per-version policies simultaneously, use Git to manage the versions of the source CRs and policy CRs ( Keep reference CRs and custom CRs under different directories. Doing this allows you to patch and update the reference CRs by simple replacement of all directory contents without touching the custom CRs. |
300 SiteConfig
CRs per ArgoCD application.
You can use multiple applications to achieve the maximum number of clusters supported by a single hub cluster.
Content in the /source-crs
folder in Git overrides content provided in the GitOps ZTP plugin container.
Git takes precedence in the search path.
Add the /source-crs
folder in the same directory as the kustomization.yaml
file, which includes the PolicyGenTemplate
as a generator.
Alternative locations for the |
The extraManifestPath
field of the SiteConfig
CR is deprecated from OKD 4.15 and later.
Use the new extraManifests.searchPaths
field instead.
For multi-node cluster upgrades, you can pause MachineConfigPool
(MCP
) CRs during maintenance windows by setting the paused
field to true
.
You can increase the number of nodes per MCP
updated simultaneously by configuring the maxUnavailable
setting in the MCP
CR.
The MaxUnavailable
field defines the percentage of nodes in the pool that can be simultaneously unavailable during a MachineConfig
update.
Set maxUnavailable
to the maximum tolerable value.
This reduces the number of reboots in a cluster during upgrades which results in shorter upgrade times.
When you finally unpause the MCP
CR, all the changed configurations are applied with a single reboot.
During cluster installation, you can pause custom MCP
CRs by setting the paused
field to true
and setting maxUnavailable
to 100% to improve installation times.
To avoid confusion or unintentional overwriting of files when updating content, use unique and distinguishable names for user-provided CRs in the /source-crs
folder and extra manifests in Git.
The SiteConfig
CR allows multiple extra-manifest paths. When files with the same name are found in multiple directory paths, the last file found takes precedence.
This allows you to put the full set of version-specific Day 0 manifests (extra-manifests) in Git and reference them from the SiteConfig
CR.
With this feature, you can deploy multiple OKD versions to managed clusters simultaneously.
No reference design updates in this release
You can install telco core clusters with the Agent-based installer (ABI) on bare-metal servers without requiring additional servers or virtual machines for managing the installation. ABI supports installations in disconnected environments. With ABI, you install clusters by using declarative custom resources (CRs).
Agent-based installer is an optional component. The recommended installation method is by using Red Hat Advanced Cluster Management or multicluster engine for Kubernetes Operator. |
You need to have a disconnected mirror registry with all required content mirrored to do Agent-based installs in a disconnected environment.
Networking configuration should be applied as NMState
custom resources (CRs) during cluster installation.
No reference design updates in this release
The Cluster Monitoring Operator (CMO) is included by default in OKD and provides monitoring (metrics, dashboards, and alerting) for the platform components and optionally user projects as well.
The default handling of pod CPU and memory metrics is based on upstream Kubernetes |
Monitoring configuration must enable the dedicated service monitor feature for accurate representation of pod metrics
You configure the Prometheus retention period. The value used is a tradeoff between operational requirements for maintaining historical data on the cluster against CPU and storage resources. Longer retention periods increase the need for storage and require additional CPU to manage the indexing of data.
No reference design updates in this release
The scheduler is a cluster-wide component responsible for selecting the right node for a given workload. It is a core part of the platform and does not require any specific configuration in the common deployment scenarios. However, there are few specific use cases described in the following section. NUMA-aware scheduling can be enabled through the NUMA Resources Operator. For more information, see "Scheduling NUMA-aware workloads".
The default scheduler does not understand the NUMA locality of workloads. It only knows about the sum of all free resources on a worker node. This might cause workloads to be rejected when scheduled to a node with the topology manager policy set to single-numa-node
or restricted
.
For example, consider a pod requesting 6 CPUs and being scheduled to an empty node that has 4 CPUs per NUMA node. The total allocatable capacity of the node is 8 CPUs and the scheduler will place the pod there. The node local admission will fail, however, as there are only 4 CPUs available in each of the NUMA nodes.
All clusters with multi-NUMA nodes are required to use the NUMA Resources Operator. Use the machineConfigPoolSelector
field in the KubeletConfig
CR to select all nodes where NUMA aligned scheduling is needed.
All machine config pools must have consistent hardware configuration for example all nodes are expected to have the same NUMA zone count.
Pods might require annotations for correct scheduling and isolation. For more information on annotations, see "CPU partitioning and performance tuning".
You can configure SR-IOV virtual function NUMA affinity to be ignored during scheduling by using the excludeTopology
field in SriovNetworkNodePolicy
CR.
Container mount namespace encapsulation and kdump are now available in the telco core RDS.
Container mount namespace encapsulation creates a container mount namespace that reduces system mount scanning and is visible to kubelet and CRI-O.
kdump is an optional configuration that is enabled by default that captures debug information when a kernel panic occurs. The reference CRs which enable kdump include an increased memory reservation based on the set of drivers and kernel modules included in the reference configuration.
Use of kdump and container mount namespace encapsulation is made available through additional kernel modules. You should analyze these modules to determine impact on CPU load, system performance, and ability to meet required KPIs.
Install the following kernel modules with MachineConfig
CRs.
These modules provide extended kernel functionality to cloud-native functions (CNFs).
sctp
ip_gre
ip6_tables
ip6t_REJECT
ip6table_filter
ip6table_mangle
iptable_filter
iptable_mangle
iptable_nat
xt_multiport
xt_owner
xt_REDIRECT
xt_statistic
xt_TCPMSS
Secure boot is now recommended for cluster hosts configured with the telco core reference design.
Enabling secure boot is the recommended configuration.
When secure boot is enabled, only signed kernel modules are loaded by the kernel. Out-of-tree drivers are not supported. |
No reference design updates in this release
Telco core clusters are expected to be installed in networks without direct access to the internet. All container images needed to install, configure, and operator the cluster must be available in a disconnected registry. This includes OKD images, Day 2 Operator Lifecycle Manager (OLM) Operator images, and application workload images.
A unique name is required for all custom CatalogSources. Do not reuse the default catalog names.
A valid time source must be configured as part of cluster installation.
Secure boot host firmware setting is now recommended for telco core clusters. For more information, see "Host firmware and boot loader configuration".
You should harden clusters against multiple attack vectors. In OKD, there is no single component or feature responsible for securing a cluster. Use the following security-oriented features and configurations to secure your clusters:
SecurityContextConstraints (SCC): All workload pods should be run with restricted-v2
or restricted
SCC.
Seccomp: All pods should be run with the RuntimeDefault
(or stronger) seccomp profile.
Rootless DPDK pods: Many user-plane networking (DPDK) CNFs require pods to run with root privileges. With this feature, a conformant DPDK pod can be run without requiring root privileges. Rootless DPDK pods create a tap device in a rootless pod that injects traffic from a DPDK application to the kernel.
Storage: The storage network should be isolated and non-routable to other cluster networks. See the "Storage" section for additional details.
Rootless DPDK pods requires the following additional configuration steps:
Configure the TAP plugin with the container_t
SELinux context.
Enable the container_use_devices
SELinux boolean on the hosts.
For rootless DPDK pod support, the SELinux boolean container_use_devices
must be enabled on the host for the TAP device to be created. This introduces a security risk that is acceptable for short to mid-term use. Other solutions will be explored.