Using remote worker node at the network edge - Remote worker nodes on the network edge | Nodes

Adding remote worker nodes
Network separation with remote worker nodes
Power loss on remote worker nodes
Latency spikes or temporary reduction in throughput to remote workers
Remote worker node strategies

You can configure OpenShift Container Platform clusters with nodes located at your network edge. In this topic, they are called remote worker nodes. A typical cluster with remote worker nodes combines on-premise master and worker nodes with worker nodes in other locations that connect to the cluster. This topic is intended to provide guidance on best practices for using remote worker nodes and does not contain specific configuration details.

There are multiple use cases across different industries, such as telecommunications, retail, manufacturing, and government, for using a deployment pattern with remote worker nodes. For example, you can separate and isolate your projects and workloads by combining the remote worker nodes into Kubernetes zones.

However, having remote worker nodes can introduce higher latency, intermittent loss of network connectivity, and other issues. Among the challenges in a cluster with remote worker node are:

Network separation: The OpenShift Container Platform control plane and the remote worker nodes must be able communicate with each other. Because of the distance between the control plane and the remote worker nodes, network issues could prevent this communication. See Network separation with remote worker nodes for information on how OpenShift Container Platform responds to network separation and for methods to diminish the impact to your cluster.
Power outage: Because the control plane and remote worker nodes are in separate locations, a power outage at the remote location or at any point between the two can negatively impact your cluster. See Power loss on remote worker nodes for information on how OpenShift Container Platform responds to a node losing power and for methods to diminish the impact to your cluster.
Latency spikes or temporary reduction in throughput: As with any network, any changes in network conditions between your cluster and the remote worker nodes can negatively impact your cluster. OpenShift Container Platform offers multiple worker latency profiles that let you control the reaction of the cluster to latency issues.

Note the following limitations when planning a cluster with remote worker nodes:

OpenShift Container Platform does not support remote worker nodes that use a different cloud provider than the on-premise cluster uses.
Moving workloads from one Kubernetes zone to a different Kubernetes zone can be problematic due to system and environment issues, such as a specific type of memory not being available in a different zone.
Proxies and firewalls can present additional limitations that are beyond the scope of this document. See the relevant OpenShift Container Platform documentation for how to address such limitations, such as Configuring your firewall.
You are responsible for configuring and maintaining L2/L3-level network connectivity between the control plane and the network-edge nodes.

Adding remote worker nodes

Adding remote worker nodes to a cluster involves some additional considerations.

You must ensure that a route or a default gateway is in place to route traffic between the control plane and every remote worker node.
You must place the Ingress VIP on the control plane.
Adding remote worker nodes with user-provisioned infrastructure is identical to adding other worker nodes.
To add remote worker nodes to an installer-provisioned cluster at install time, specify the subnet for each worker node in the install-config.yaml file before installation. There are no additional settings required for the DHCP server. You must use virtual media, because the remote worker nodes will not have access to the local provisioning network.
To add remote worker nodes to an installer-provisioned cluster deployed with a provisioning network, ensure that virtualMediaViaExternalNetwork flag is set to true in the install-config.yaml file so that it will add the nodes using virtual media. Remote worker nodes will not have access to the local provisioning network. They must be deployed with virtual media rather than PXE. Additionally, specify each subnet for each group of remote worker nodes and the control plane nodes in the DHCP server.

Additional resources

Network separation with remote worker nodes

All nodes send heartbeats to the Kubernetes Controller Manager Operator (kube controller) in the OpenShift Container Platform cluster every 10 seconds. If the cluster does not receive heartbeats from a node, OpenShift Container Platform responds using several default mechanisms.

OpenShift Container Platform is designed to be resilient to network partitions and other disruptions. You can mitigate some of the more common disruptions, such as interruptions from software upgrades, network splits, and routing issues. Mitigation strategies include ensuring that pods on remote worker nodes request the correct amount of CPU and memory resources, configuring an appropriate replication policy, using redundancy across zones, and using Pod Disruption Budgets on workloads.

If the kube controller loses contact with a node after a configured period, the node controller on the control plane updates the node health to Unhealthy and marks the node Ready condition as Unknown. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules pods on the node for eviction after five minutes, by default.

If a workload controller, such as a Deployment object or StatefulSet object, is directing traffic to pods on the unhealthy node and other nodes can reach the cluster, OpenShift Container Platform routes the traffic away from the pods on the node. Nodes that cannot reach the cluster do not get updated with the new traffic routing. As a result, the workloads on those nodes might continue to attempt to reach the unhealthy node.

You can mitigate the effects of connection loss by:

using daemon sets to create pods that tolerate the taints
using static pods that automatically restart if a node goes down
using Kubernetes zones to control pod eviction
configuring pod tolerations to delay or avoid pod eviction
configuring the kubelet to control the timing of when it marks nodes as unhealthy.

For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.

Power loss on remote worker nodes

If a remote worker node loses power or restarts ungracefully, OpenShift Container Platform responds using several default mechanisms.

If the Kubernetes Controller Manager Operator (kube controller) loses contact with a node after a configured period, the control plane updates the node health to Unhealthy and marks the node Ready condition as Unknown. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules pods on the node for eviction after five minutes, by default.

On the node, the pods must be restarted when the node recovers power and reconnects with the control plane.

If you want the pods to restart immediately upon restart, use static pods.

After the node restarts, the kubelet also restarts and attempts to restart the pods that were scheduled on the node. If the connection to the control plane takes longer than the default five minutes, the control plane cannot update the node health and remove the node.kubernetes.io/unreachable taint. On the node, the kubelet terminates any running pods. When these conditions are cleared, the scheduler can start scheduling pods to that node.

You can mitigate the effects of power loss by:

using daemon sets to create pods that tolerate the taints
using static pods that automatically restart with a node
configuring pods tolerations to delay or avoid pod eviction
configuring the kubelet to control the timing of when the node controller marks nodes as unhealthy.

For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.

Latency spikes or temporary reduction in throughput to remote workers

If the cluster administrator has performed latency tests for platform verification, they can discover the need to adjust the operation of the cluster to ensure stability in cases of high latency. The cluster administrator needs to change only one parameter, recorded in a file, which controls four parameters affecting how supervisory processes read status and interpret the health of the cluster. Changing only the one parameter provides cluster tuning in an easy, supportable manner.

The Kubelet process provides the starting point for monitoring cluster health. The Kubelet sets status values for all nodes in the OpenShift Container Platform cluster. The Kubernetes Controller Manager (kube controller) reads the status values every 10 seconds, by default. If the kube controller cannot read a node status value, it loses contact with that node after a configured period. The default behavior is:

The node controller on the control plane updates the node health to Unhealthy and marks the node Ready condition`Unknown`.
In response, the scheduler stops scheduling pods to that node.
The Node Lifecycle Controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules any pods on the node for eviction after five minutes, by default.

This behavior can cause problems if your network is prone to latency issues, especially if you have nodes at the network edge. In some cases, the Kubernetes Controller Manager might not receive an update from a healthy node due to network latency. The Kubelet evicts pods from the node even though the node is healthy.

To avoid this problem, you can use worker latency profiles to adjust the frequency that the Kubelet and the Kubernetes Controller Manager wait for status updates before taking action. These adjustments help to ensure that your cluster runs properly if network latency between the control plane and the worker nodes is not optimal.

These worker latency profiles contain three sets of parameters that are predefined with carefully tuned values to control the reaction of the cluster to increased latency. There is no need to experimentally find the best values manually.

You can configure worker latency profiles when installing a cluster or at any time you notice increased latency in your cluster network.

Additional resources

Improving cluster stability in high latency environments using worker latency profiles

Remote worker node strategies

If you use remote worker nodes, consider which objects to use to run your applications.

It is recommended to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.

Daemon sets: Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:

Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OpenShift Container Platform does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the Running state, when a node stops communicating, the pod keeps running and is assumed to be running by OpenShift Container Platform.

Daemon set pods, by default, are created with NoExecute tolerations for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints with no tolerationSeconds value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:

Tolerations added to daemon set pods by default

  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
    - key: node.kubernetes.io/disk-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/memory-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/pid-pressure
      operator: Exists
      effect: NoSchedule
    - key: node.kubernetes.io/unschedulable
      operator: Exists
      effect: NoSchedule

Daemon sets can use labels to ensure that a workload runs on a matching worker node.
You can use an OpenShift Container Platform service endpoint to load balance daemon set pods.

Daemon sets do not schedule pods after a reboot of the node if OpenShift Container Platform cannot reach the node.

Static pods: If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.

Static pods cannot use secrets and config maps.

Kubernetes zones: Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.

When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.

If a zone is fully disrupted, where all nodes in the zone have a Ready condition that is False or Unknown, the control plane does not apply the node.kubernetes.io/unreachable taint to the nodes in that zone.

For partially disrupted zones, where more than 55% of the nodes have a False or Unknown condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.

You assign a node to a specific zone by applying the topology.kubernetes.io/region label in the node specification.

Sample node labels for Kubernetes zones

kind: Node
apiVersion: v1
metadata:
  labels:
    topology.kubernetes.io/region=east

KubeletConfig objects

You can adjust the amount of time that the kubelet checks the state of each node.

To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy or Unreachable condition, create a KubeletConfig object that contains the node-status-update-frequency and node-status-report-frequency parameters.

The kubelet on each node determines the node status as defined by the node-status-update-frequency setting and reports that status to the cluster based on the node-status-report-frequency setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OpenShift Container Platform uses the node-status-report-frequency setting only when the Node Lease feature gate is enabled, which is the default state in OpenShift Container Platform clusters. If the Node Lease feature gate is disabled, the node reports its status based on the node-status-update-frequency setting.

Example kubelet config

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: disable-cpu-units
spec:
  machineConfigPoolSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker (1)
  kubeletConfig:
    node-status-update-frequency: (2)
      - "10s"
    node-status-report-frequency: (3)
      - "1m"

1	Specify the type of node type to which this `KubeletConfig` object applies using the label from the `MachineConfig` object.
2	Specify the frequency that the kubelet checks the status of a node associated with this `MachineConfig` object. The default value is `10s`. If you change this default, the `node-status-report-frequency` value is changed to the same value.
3	Specify the frequency that the kubelet reports the status of a node associated with this `MachineConfig` object. The default value is `1m`.

The node-status-update-frequency parameter works with the node-monitor-grace-period parameter.

The node-monitor-grace-period parameter specifies how long OpenShift Container Platform waits after a node associated with a MachineConfig object is marked Unhealthy if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after node-monitor-grace-period expires, pods continue to run. New pods can be scheduled to that node. The node-monitor-grace-period interval is 40s. The node-status-update-frequency value must be lower than the node-monitor-grace-period value.

Modifying the node-monitor-grace-period parameter is not supported.

Tolerations: You can use pod tolerations to mitigate the effects if the on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to a node it cannot reach.

A taint with the NoExecute effect affects pods that are running on the node in the following ways:

Pods that do not tolerate the taint are queued for eviction.
Pods that tolerate the taint without specifying a tolerationSeconds value in their toleration specification remain bound forever.
Pods that tolerate the taint with a specified tolerationSeconds value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.

Unless tolerations are explicitly set, Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, meaning that pods remain bound for 5 minutes if either of these taints is detected.

You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute effect for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints.

Example toleration in a pod spec

...
tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute" (1)
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute" (2)
  tolerationSeconds: 600 (3)
...

1	The `NoExecute` effect without `tolerationSeconds` lets pods remain forever if the control plane cannot reach the node.
2	The `NoExecute` effect with `tolerationSeconds`: 600 lets pods remain for 10 minutes if the control plane marks the node as `Unhealthy`.
3	You can specify your own `tolerationSeconds` value.

Other types of OpenShift Container Platform objects: You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.

When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.

stateful sets do not get restarted when there is an outage. The pods remain in the terminating state until the control plane can acknowledge that the pods are terminated.

To avoid scheduling a to a node that does not have access to the same type of persistent storage, OpenShift Container Platform cannot migrate pods that require persistent volumes to other zones in the case of network separation.

Additional resources

For more information on Daemonesets, see DaemonSets.
For more information on taints and tolerations, see Controlling pod placement using node taints.
For more information on configuring KubeletConfig objects, see Creating a KubeletConfig CRD.
For more information on replica sets, see ReplicaSets.
For more information on deployments, see Deployments.
For more information on replication controllers, see Replication controllers.
For more information on the controller manager, see Kubernetes Controller Manager Operator.

Using remote worker nodes at the network edge

Adding remote worker nodes

Network separation with remote worker nodes

Power loss on remote worker nodes

Latency spikes or temporary reduction in throughput to remote workers

Remote worker node strategies