The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the MachineHealthCheck
controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck
resource creates the PoisonPillRemediation
custom resource (CR), which triggers the Poison Pill Operator.
The Poison Pill Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
The Poison Pill Operator creates the PoisonPillConfig
CR with the name poison-pill-config
in the Poison Pill Operator’s namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.
A change in the PoisonPillConfig
CR re-creates the Poison Pill daemon set.
The PoisonPillConfig
CR resembles the following YAML file:
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
safeTimeToAssumeNodeRebootedSeconds: 180 (1)
watchdogFilePath: /test/watchdog1 (2)
isSoftwareRebootEnabled: true (3)
apiServerTimeout: 15s (4)
apiCheckInterval: 5s (5)
maxApiErrorThreshold: 3 (6)
peerApiServerTimeout: 5s (7)
peerDialTimeout: 5s (8)
peerRequestTimeout: 5s (9)
peerUpdateInterval: 15m (10)
1 |
Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value. |
2 |
Specify the file path of the watchdog device in the nodes. If a watchdog device is unavailable, the PoisonPillConfig CR uses a software reboot. |
3 |
Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of isSoftwareRebootEnabled is set to true . To disable the software reboot, set the parameter value to false . |
4 |
Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. |
5 |
Specify the frequency to check connectivity with each API server. |
6 |
Specify a threshold value. After reaching this threshold, the node starts contacting its peers. |
7 |
Specify the timeout duration to connect with the peer API server. |
8 |
Specify the timeout duration for establishing connection with the peer. |
9 |
Specify the timeout duration to get a response from the peer. |
10 |
Specify the frequency to update peer information, such as IP address. |
Watchdog devices can be any of the following:
-
Independently powered hardware devices
-
Hardware devices that share power with the hosts they control
-
Virtual devices implemented in software, or softdog
Hardware watchdog and softdog
devices have electronic or software timers, respectively. These watchdog devices are used to ensure that the machine enters a safe state when an error condition is detected. The cluster is required to repeatedly reset the watchdog timer to prove that it is in a healthy state. This timer might elapse due to fault conditions, such as deadlocks, CPU starvation, and loss of network or disk access. If the timer expires, the watchdog device assumes that a fault has occurred and the device triggers a forced reset of the node.
Hardware watchdog devices are more reliable than softdog
devices.
The Poison Pill Operator determines the remediation strategy based on the watchdog devices that are present.
If a hardware watchdog device is configured and available, the Operator uses it for remediation. If a hardware watchdog device is not configured, the Operator enables and uses a softdog
device for remediation.
If neither watchdog devices are supported, either by the system or by the configuration, the Operator remediates nodes by using software reboot.