This is a cache of https://docs.okd.io/latest/scalability_and_performance/telco_ref_design_specs/core/telco-core-ref-design-components.html. It is a snapshot of the page at 2024-11-05T19:15:12.607+0000.
Core reference design components - Reference design specifications | Scalability and performance | OKD 4
×

CPU partitioning and performance tuning

New in this release
  • No reference design updates in this release

Description

CPU partitioning allows for the separation of sensitive workloads from generic purposes, auxiliary processes, interrupts, and driver work queues to achieve improved performance and latency.

Limits and requirements
  • The operating system needs a certain amount of CPU to perform all the support tasks including kernel networking.

    • A system with just user plane networking applications (DPDK) needs at least one Core (2 hyperthreads when enabled) reserved for the operating system and the infrastructure components.

  • A system with Hyper-Threading enabled must always put all core sibling threads to the same pool of CPUs.

  • The set of reserved and isolated cores must include all CPU cores.

  • Core 0 of each NUMA node must be included in the reserved CPU set.

  • Isolated cores might be impacted by interrupts. The following annotations must be attached to the pod if guaranteed QoS pods require full use of the CPU:

    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
    irq-load-balancing.crio.io: "disable"
  • When per-pod power management is enabled with PerformanceProfile.workloadHints.perPodPowerManagement the following annotations must also be attached to the pod if guaranteed QoS pods require full use of the CPU:

    cpu-c-states.crio.io: "disable"
    cpu-freq-governor.crio.io: "performance"
Engineering considerations
  • The minimum reserved capacity (systemReserved) required can be found by following the guidance in "Which amount of CPU and memory are recommended to reserve for the system in OpenShift 4 nodes?"

  • The actual required reserved CPU capacity depends on the cluster configuration and workload attributes.

  • This reserved CPU value must be rounded up to a full core (2 hyper-thread) alignment.

  • Changes to the CPU partitioning will drain and reboot the nodes in the MCP.

  • The reserved CPUs reduce the pod density, as the reserved CPUs are removed from the allocatable capacity of the OpenShift node.

  • The real-time workload hint should be enabled if the workload is real-time capable.

  • Hardware without Interrupt Request (IRQ) affinity support will impact isolated CPUs. To ensure that pods with guaranteed CPU QoS have full use of allocated CPU, all hardware in the server must support IRQ affinity.

  • OVS dynamically manages its cpuset configuration to adapt to network traffic needs. You do not need to reserve additional CPUs for handling high network throughput on the primary CNI.

  • If workloads running on the cluster require cgroups v1, you can configure nodes to use cgroups v1 as part of the initial cluster deployment. For more information, see "Enabling Linux cgroup v1 during installation".

Service Mesh

Description

Telco core cloud-native functions (CNFs) typically require a service mesh implementation.

Specific service mesh features and performance requirements are dependent on the application. The selection of service mesh implementation and configuration is outside the scope of this documentation. You must account for the impact of service mesh on cluster resource usage and performance, including additional latency introduced in pod networking, in your implementation.

Additional resources

Networking

New in this release
  • Telco core validation is now extended with bonding, MACVLAN, IPVLAN and SR-IOV networking scenarios.

Description
  • The cluster is configured in dual-stack IP configuration (IPv4 and IPv6).

  • The validated physical network configuration consists of two dual-port NICs. One NIC is shared among the primary CNI (OVN-Kubernetes) and IPVLAN and MACVLAN traffic, the second NIC is dedicated to SR-IOV VF-based Pod traffic.

  • A Linux bonding interface (bond0) is created in an active-active LACP IEEE 802.3ad configuration with the two NIC ports attached.

    The top-of-rack networking equipment must support and be configured for multi-chassis link aggregation (mLAG) technology.

  • VLAN interfaces are created on top of bond0, including for the primary CNI.

  • Bond and VLAN interfaces are created at install time during network configuration. Apart from the VLAN (VLAN0) used by the primary CNI, the other VLANS can be created on Day 2 using the Kubernetes NMState Operator.

  • MACVLAN and IPVLAN interfaces are created with their corresponding CNIs. They do not share the same base interface.

  • SR-IOV VFs are managed by the SR-IOV Network Operator. The following diagram provides an overview of SR-IOV NIC sharing:

    Simplified SR-IOV NIC sharing configuration
    Figure 1. SR-IOV NIC sharing
Additional resources

Cluster Network Operator

New in this release
  • No reference design updates in this release

Description

The Cluster Network Operator (CNO) deploys and manages the cluster network components including the default OVN-Kubernetes network plugin during OKD cluster installation. It allows configuring primary interface MTU settings, OVN gateway modes to use node routing tables for pod egress, and additional secondary networks such as MACVLAN.

Limits and requirements
  • OVN-Kubernetes is required for IPv6 support.

  • Large MTU cluster support requires connected network equipment to be set to the same or larger value.

  • MACVLAN and IPVLAN cannot co-locate on the same main interface due to their reliance on the same underlying kernel mechanism, specifically the rx_handler. This handler allows a third-party module to process incoming packets before the host processes them, and only one such handler can be registered per network interface. Since both MACVLAN and IPVLAN need to register their own rx_handler to function, they conflict and cannot coexist on the same interface. See ipvlan/ipvlan_main.c#L82 and net/macvlan.c#L1260 for details.

  • Alternative NIC configurations include splitting the shared NIC into multiple NICs or using a single dual-port NIC.

    Splitting the shared NIC into multiple NICs or using a single dual-port NIC has not been validated with the telco core reference design.

  • Single-stack IP cluster not validated.

Engineering considerations
  • Pod egress traffic is handled by kernel routing table with the routingViaHost option. Appropriate static routes must be configured in the host.

Additional resources

Load balancer

New in this release
  • In OKD 4.17, frr-k8s is now the default and fully supported Border Gateway Protocol (BGP) backend. The deprecated frr BGP mode is still available. You should upgrade clusters to use the frr-k8s backend.

Description

MetalLB is a load-balancer implementation that uses standard routing protocols for bare-metal clusters. It enables a Kubernetes service to get an external IP address which is also added to the host network for the cluster.

Some use cases might require features not available in MetalLB, for example stateful load balancing. Where necessary, use an external third party load balancer. Selection and configuration of an external load balancer is outside the scope of this document. When you use an external third party load balancer, ensure that it meets all performance and resource utilization requirements.

Limits and requirements
  • Stateful load balancing is not supported by MetalLB. An alternate load balancer implementation must be used if this is a requirement for workload CNFs.

  • The networking infrastructure must ensure that the external IP address is routable from clients to the host network for the cluster.

Engineering considerations
  • MetalLB is used in BGP mode only for core use case models.

  • For core use models, MetalLB is supported with only the OVN-Kubernetes network provider used in local gateway mode. See routingViaHost in the "Cluster Network Operator" section.

  • BGP configuration in MetalLB varies depending on the requirements of the network and peers.

  • Address pools can be configured as needed, allowing variation in addresses, aggregation length, auto assignment, and other relevant parameters.

  • MetalLB uses BGP for announcing routes only. Only the transmitInterval and minimumTtl parameters are relevant in this mode. Other parameters in the BFD profile should remain close to the default settings. Shorter values might lead to errors and impact performance.

SR-IOV

New in this release
  • No reference design updates in this release

Description

SR-IOV enables physical network interfaces (PFs) to be divided into multiple virtual functions (VFs). VFs can then be assigned to multiple pods to achieve higher throughput performance while keeping the pods isolated. The SR-IOV Network Operator provisions and manages SR-IOV CNI, network device plugin, and other components of the SR-IOV stack.

Limits and requirements
  • Supported network interface controllers are listed in "Supported devices".

  • The SR-IOV Network Operator automatically enables IOMMU on the kernel command line.

  • SR-IOV VFs do not receive link state updates from PF. If link down detection is needed, it must be done at the protocol level.

  • MultiNetworkPolicy CRs can be applied to netdevice networks only. This is because the implementation uses the iptables tool, which cannot manage vfio interfaces.

Engineering considerations
  • SR-IOV interfaces in vfio mode are typically used to enable additional secondary networks for applications that require high throughput or low latency.

  • If you exclude the SriovOperatorConfig CR from your deployment, the CR will not be created automatically.

  • NICs that do not support firmware updates under secure boot or kernel lock-down must be pre-configured with enough VFs enabled to support the number of VFs needed by the application workload.

    The SR-IOV Network Operator plugin for these NICs might need to be disabled using the undocumented disablePlugins option.

NMState Operator

New in this release
  • No reference design updates in this release

Description

The NMState Operator provides a Kubernetes API for performing network configurations across cluster nodes.

Limits and requirements

Not applicable

Engineering considerations
  • The initial networking configuration is applied using NMStateConfig content in the installation CRs. The NMState Operator is used only when needed for network updates.

  • When SR-IOV virtual functions are used for host networking, the NMState Operator using NodeNetworkConfigurationPolicy is used to configure those VF interfaces, for example, VLANs and the MTU.

Logging

New in this release
  • Cluster Logging Operator 6.0 is new in this release. Update your existing implementation to adapt to the new version of the API.

Description

The Cluster Logging Operator enables collection and shipping of logs off the node for remote archival and analysis. The reference configuration ships audit and infrastructure logs to a remote archive by using Kafka.

Limits and requirements

Not applicable

Engineering considerations
  • The impact of cluster CPU use is based on the number or size of logs generated and the amount of log filtering configured.

  • The reference configuration does not include shipping of application logs. Inclusion of application logs in the configuration requires evaluation of the application logging rate and sufficient additional CPU resources allocated to the reserved set.

Additional resources

Power Management

New in this release
  • No reference design updates in this release

Description

Use the Performance Profile to configure clusters with high power mode, low power mode, or mixed mode. The choice of power mode depends on the characteristics of the workloads running on the cluster, particularly how sensitive they are to latency.

Limits and requirements
  • Power configuration relies on appropriate BIOS configuration, for example, enabling C-states and P-states. Configuration varies between hardware vendors.

Engineering considerations
  • Latency: To ensure that latency-sensitive workloads meet their requirements, you will need either a high-power configuration or a per-pod power management configuration. Per-pod power management is only available for Guaranteed QoS Pods with dedicated pinned CPUs.

Storage

Cloud native storage services can be provided by multiple solutions including OpenShift Data Foundation from Red Hat or third parties.

OpenShift Data Foundation

New in this release
  • No reference design updates in this release

Description

Red Hat OpenShift Data Foundation is a software-defined storage service for containers. For Telco core clusters, storage support is provided by OpenShift Data Foundation storage services running externally to the application workload cluster.

Limits and requirements
Engineering considerations
  • OpenShift Data Foundation network traffic should be isolated from other traffic on a dedicated network, for example, by using VLAN isolation.

  • Other storage solutions can be used to provide persistent storage for core clusters.

    The configuration and integration of these solutions is outside the scope of the telco core RDS. Integration of the storage solution into the core cluster must include correct sizing and performance analysis to ensure the storage meets overall performance and resource utilization requirements.

Additional resources

Telco core deployment components

The following sections describe the various OKD components and configurations that you use to configure the hub cluster with Red Hat Advanced Cluster Management (RHACM).

Red Hat Advanced Cluster Management

New in this release
  • No reference design updates in this release

Description

Red Hat Advanced Cluster Management (RHACM) provides Multi Cluster Engine (MCE) installation and ongoing lifecycle management functionality for deployed clusters. You manage cluster configuration and upgrades declaratively by applying Policy custom resources (CRs) to clusters during maintenance windows.

You apply policies with the RHACM policy controller as managed by Topology Aware Lifecycle Manager (TALM).

When installing managed clusters, RHACM applies labels and initial ignition configuration to individual nodes in support of custom disk partitioning, allocation of roles, and allocation to machine config pools. You define these configurations with SiteConfig or ClusterInstance CRs.

Limits and requirements
Engineering considerations
  • Use RHACM policy hub-side templating to better scale cluster configuration. You can significantly reduce the number of policies by using a single group policy or small number of general group policies where the group and per-cluster values are substituted into templates.

  • Cluster specific configuration: managed clusters typically have some number of configuration values that are specific to the individual cluster. These configurations should be managed using RHACM policy hub-side templating with values pulled from ConfigMap CRs based on the cluster name.

Topology Aware Lifecycle Manager

New in this release
  • No reference design updates in this release

Description

Topology Aware Lifecycle Manager (TALM) is an Operator that runs only on the hub cluster for managing how changes including cluster and Operator upgrades, configuration, and so on are rolled out to the network.

Limits and requirements
  • TALM supports concurrent cluster deployment in batches of 400.

  • Precaching and backup features are for single-node OpenShift clusters only.

Engineering considerations
  • Only policies that have the ran.openshift.io/ztp-deploy-wave annotation are automatically applied by TALM during initial cluster installation.

  • You can create further ClusterGroupUpgrade CRs to control the policies that TALM remediates.

GitOps and GitOps ZTP plugins

New in this release
  • No reference design updates in this release

Description

GitOps and GitOps ZTP plugins provide a GitOps-based infrastructure for managing cluster deployment and configuration. Cluster definitions and configurations are maintained as a declarative state in Git. You can apply ClusterInstance CRs to the hub cluster where the SiteConfig Operator renders them as installation CRs. Alternatively, you can use the GitOps ZTP plugin to generate installation CRs directly from SiteConfig CRs. The GitOps ZTP plugin supports automatic wrapping of configuration CRs in policies based on PolicyGenTemplate CRs.

You can deploy and manage multiple versions of OKD on managed clusters using the baseline reference configuration CRs. You can use custom CRs alongside the baseline CRs.

To maintain multiple per-version policies simultaneously, use Git to manage the versions of the source CRs and policy CRs (PolicyGenTemplate or PolicyGenerator).

Keep reference CRs and custom CRs under different directories. Doing this allows you to patch and update the reference CRs by simple replacement of all directory contents without touching the custom CRs.

Limits
  • 300 SiteConfig CRs per ArgoCD application. You can use multiple applications to achieve the maximum number of clusters supported by a single hub cluster.

  • Content in the /source-crs folder in Git overrides content provided in the GitOps ZTP plugin container. Git takes precedence in the search path.

  • Add the /source-crs folder in the same directory as the kustomization.yaml file, which includes the PolicyGenTemplate as a generator.

    Alternative locations for the /source-crs directory are not supported in this context.

  • The extraManifestPath field of the SiteConfig CR is deprecated from OKD 4.15 and later. Use the new extraManifests.searchPaths field instead.

Engineering considerations
  • For multi-node cluster upgrades, you can pause MachineConfigPool (MCP) CRs during maintenance windows by setting the paused field to true. You can increase the number of nodes per MCP updated simultaneously by configuring the maxUnavailable setting in the MCP CR. The MaxUnavailable field defines the percentage of nodes in the pool that can be simultaneously unavailable during a MachineConfig update. Set maxUnavailable to the maximum tolerable value. This reduces the number of reboots in a cluster during upgrades which results in shorter upgrade times. When you finally unpause the MCP CR, all the changed configurations are applied with a single reboot.

  • During cluster installation, you can pause custom MCP CRs by setting the paused field to true and setting maxUnavailable to 100% to improve installation times.

  • To avoid confusion or unintentional overwriting of files when updating content, use unique and distinguishable names for user-provided CRs in the /source-crs folder and extra manifests in Git.

  • The SiteConfig CR allows multiple extra-manifest paths. When files with the same name are found in multiple directory paths, the last file found takes precedence. This allows you to put the full set of version-specific Day 0 manifests (extra-manifests) in Git and reference them from the SiteConfig CR. With this feature, you can deploy multiple OKD versions to managed clusters simultaneously.

Agent-based installer

New in this release
  • No reference design updates in this release

Description

You can install telco core clusters with the Agent-based installer (ABI) on bare-metal servers without requiring additional servers or virtual machines for managing the installation. ABI supports installations in disconnected environments. With ABI, you install clusters by using declarative custom resources (CRs).

Agent-based installer is an optional component. The recommended installation method is by using Red Hat Advanced Cluster Management or multicluster engine for Kubernetes Operator.

Limits and requirements
  • You need to have a disconnected mirror registry with all required content mirrored to do Agent-based installs in a disconnected environment.

Engineering considerations
  • Networking configuration should be applied as NMState custom resources (CRs) during cluster installation.

Monitoring

New in this release
  • No reference design updates in this release

Description

The Cluster Monitoring Operator (CMO) is included by default in OKD and provides monitoring (metrics, dashboards, and alerting) for the platform components and optionally user projects as well.

The default handling of pod CPU and memory metrics is based on upstream Kubernetes cAdvisor and makes a tradeoff that prefers handling of stale data over metric accuracy. This leads to spiky data that will create false triggers of alerts over user-specified thresholds. OpenShift supports an opt-in dedicated service monitor feature creating an additional set of pod CPU and memory metrics that do not suffer from the spiky behavior. For additional information, see Dedicated Service Monitors - Questions and Answers.

Limits and requirements
  • Monitoring configuration must enable the dedicated service monitor feature for accurate representation of pod metrics

Engineering considerations
  • You configure the Prometheus retention period. The value used is a tradeoff between operational requirements for maintaining historical data on the cluster against CPU and storage resources. Longer retention periods increase the need for storage and require additional CPU to manage the indexing of data.

Additional resources

Scheduling

New in this release
  • No reference design updates in this release

Description
  • The scheduler is a cluster-wide component responsible for selecting the right node for a given workload. It is a core part of the platform and does not require any specific configuration in the common deployment scenarios. However, there are few specific use cases described in the following section. NUMA-aware scheduling can be enabled through the NUMA Resources Operator. For more information, see "Scheduling NUMA-aware workloads".

Limits and requirements
  • The default scheduler does not understand the NUMA locality of workloads. It only knows about the sum of all free resources on a worker node. This might cause workloads to be rejected when scheduled to a node with the topology manager policy set to single-numa-node or restricted.

    • For example, consider a pod requesting 6 CPUs and being scheduled to an empty node that has 4 CPUs per NUMA node. The total allocatable capacity of the node is 8 CPUs and the scheduler will place the pod there. The node local admission will fail, however, as there are only 4 CPUs available in each of the NUMA nodes.

    • All clusters with multi-NUMA nodes are required to use the NUMA Resources Operator. Use the machineConfigPoolSelector field in the KubeletConfig CR to select all nodes where NUMA aligned scheduling is needed.

  • All machine config pools must have consistent hardware configuration for example all nodes are expected to have the same NUMA zone count.

Engineering considerations
  • Pods might require annotations for correct scheduling and isolation. For more information on annotations, see "CPU partitioning and performance tuning".

  • You can configure SR-IOV virtual function NUMA affinity to be ignored during scheduling by using the excludeTopology field in SriovNetworkNodePolicy CR.

Node configuration

New in this release
  • Container mount namespace encapsulation and kdump are now available in the telco core RDS.

Description
  • Container mount namespace encapsulation creates a container mount namespace that reduces system mount scanning and is visible to kubelet and CRI-O.

  • kdump is an optional configuration that is enabled by default that captures debug information when a kernel panic occurs. The reference CRs which enable kdump include an increased memory reservation based on the set of drivers and kernel modules included in the reference configuration.

Limits and requirements
  • Use of kdump and container mount namespace encapsulation is made available through additional kernel modules. You should analyze these modules to determine impact on CPU load, system performance, and ability to meet required KPIs.

Engineering considerations
  • Install the following kernel modules with MachineConfig CRs. These modules provide extended kernel functionality to cloud-native functions (CNFs).

    • sctp

    • ip_gre

    • ip6_tables

    • ip6t_REJECT

    • ip6table_filter

    • ip6table_mangle

    • iptable_filter

    • iptable_mangle

    • iptable_nat

    • xt_multiport

    • xt_owner

    • xt_REDIRECT

    • xt_statistic

    • xt_TCPMSS

Host firmware and boot loader configuration

New in this release
  • Secure boot is now recommended for cluster hosts configured with the telco core reference design.

Engineering considerations
  • Enabling secure boot is the recommended configuration.

    When secure boot is enabled, only signed kernel modules are loaded by the kernel. Out-of-tree drivers are not supported.

Disconnected environment

New in this release
  • No reference design updates in this release

Description

Telco core clusters are expected to be installed in networks without direct access to the internet. All container images needed to install, configure, and operator the cluster must be available in a disconnected registry. This includes OKD images, Day 2 Operator Lifecycle Manager (OLM) Operator images, and application workload images.

Limits and requirements
  • A unique name is required for all custom CatalogSources. Do not reuse the default catalog names.

  • A valid time source must be configured as part of cluster installation.

Security

New in this release
  • Secure boot host firmware setting is now recommended for telco core clusters. For more information, see "Host firmware and boot loader configuration".

Description

You should harden clusters against multiple attack vectors. In OKD, there is no single component or feature responsible for securing a cluster. Use the following security-oriented features and configurations to secure your clusters:

  • SecurityContextConstraints (SCC): All workload pods should be run with restricted-v2 or restricted SCC.

  • Seccomp: All pods should be run with the RuntimeDefault (or stronger) seccomp profile.

  • Rootless DPDK pods: Many user-plane networking (DPDK) CNFs require pods to run with root privileges. With this feature, a conformant DPDK pod can be run without requiring root privileges. Rootless DPDK pods create a tap device in a rootless pod that injects traffic from a DPDK application to the kernel.

  • Storage: The storage network should be isolated and non-routable to other cluster networks. See the "Storage" section for additional details.

Limits and requirements
  • Rootless DPDK pods requires the following additional configuration steps:

    • Configure the TAP plugin with the container_t SELinux context.

    • Enable the container_use_devices SELinux boolean on the hosts.

Engineering considerations
  • For rootless DPDK pod support, the SELinux boolean container_use_devices must be enabled on the host for the TAP device to be created. This introduces a security risk that is acceptable for short to mid-term use. Other solutions will be explored.

Scalability

New in this release
  • No reference design updates in this release

Limits and requirements
  • Cluster should scale to at least 120 nodes.