$ oc adm ipfailover --selector="router=us-west-ha" \ --virtual-ips="1.2.3.4,10.1.1.100-104,5.6.7.8" \ --watch-port=80 --replicas=4 --create
This topic describes setting up high availability for pods and services on your OKD cluster.
IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set will be serviced by a node selected from the set. As long a single node is available, the VIPs will be served. There is no way to explicitly distribute the VIPs over the nodes. so there may be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs will be on it.
The VIPs must be routable from outside the cluster. |
IP failover monitors a port on each VIP to determine whether the port is
reachable on the node. If the port is not reachable, the VIP will not be
assigned to the node. If the port is set to 0
, this check is suppressed.
The check script does the needed testing.
IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the VRRP protocol to determine which host (from the set of hosts) will service which VIP. If a host becomes unavailable or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. Thus, a VIP is always serviced as long as a host is available.
When a host running Keepalived passes the check script, the host can become in the MASTER state based on its priority and the priority of the current MASTER, as determined by the preemption strategy.
The administrator can provide a script via the --notify-script=
option, which
is called whenever the state changes. Keepalived is in MASTER state when it
is servicing the VIP, in BACKUP state when another node is servicing the VIP,
or in FAULT` state when the check script fails. The
notify script is called with the new state whenever the
state changes.
OKD supports creation of IP failover deployment configuration, by
running the oc adm ipfailover
command. The IP failover deployment configuration
specifies the set of VIP addresses, and the set of nodes on which to service
them. A cluster can have multiple IP failover deployment configurations, with
each managing its own set of unique VIP addresses. Each node in the IP failover
configuration runs an IP failover pod, and this pod runs Keepalived.
When using VIPs to access a pod with host networking (e.g. a router), the application pod should be running on all nodes that are running the ipfailover pods. This enables any of the ipfailover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with ipfailover, either some ipfailover nodes will never service the VIPs or some application pods will never receive any traffic. Use the same selector and replication count, for both ipfailover and the application pods, to avoid this mismatch.
While using VIPs to access a service, any of the nodes can be in the ipfailover set of nodes, since the service is reachable on all nodes (no matter where the application pod is running). Any of the ipfailover nodes can become master at any time. The service can either use external IPs and a service port or it can use a nodePort.
When using external IPs in the service definition the VIPs are set to the external IPs and the ipfailover monitoring port is set to the service port. A nodePort is open on every node in the cluster and the service will load balance traffic from whatever node currently supports the VIP. In this case, the ipfailover monitoring port is set to the nodePort in the service definition.
Setting up a nodePort is a privileged operation. |
Even though a service VIP is highly available, performance can still be affected. keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs may end up on the same node even when other nodes have none. Strategies that externally load balance across a set of VIPs may be thwarted when ipfailover puts multiple VIPs on the same node. |
When you use ingressIP, you can set up ipfailover to have the same VIP range as the ingressIP range. You can also disable the monitoring port. In this case, all the VIPs will appear on same node in the cluster. Any user can set up a service with an ingressIP and have it highly available.
There are a maximum of 255 VIPs in the cluster. |
Use the oc adm ipfailover
command with suitable options, to create ipfailover deployment configuration.
Currently, ipfailover is not compatible with cloud infrastructures. For AWS, an Elastic Load Balancer (ELB) can be used to make OKD highly available, using the AWS console. |
As an administrator, you can configure ipfailover on an entire cluster, or on a
subset of nodes, as defined by the label selector. You can also configure
multiple IP failover deployment configurations in your cluster, where each one
is independent of the others. The oc adm ipfailover
command creates an
ipfailover deployment configuration which ensures that a failover pod runs on
each of the nodes matching the constraints or the label used. This pod runs
Keepalived which uses VRRP (Virtual router
Redundancy Protocol) among all the Keepalived daemons to ensure that the
service on the watched port is available, and if it is not, Keepalived will
automatically float the VIPs.
For production use, make sure to use a --selector=<label>
with at least two nodes to select
the nodes. Also, set a --replicas=<n>
value that matches the
number of nodes for the given labeled selector.
The oc adm ipfailover
command includes command line options that set environment
variables that control Keepalived. The
environment variables
start with OPENSHIFT_HA_*
and they can be changed as needed.
For example, the command below will create an IP failover configuration on a selection of nodes labeled router=us-west-ha
(on 4 nodes with 7 virtual IPs monitoring a service
listening on port 80, such as the router process).
$ oc adm ipfailover --selector="router=us-west-ha" \ --virtual-ips="1.2.3.4,10.1.1.100-104,5.6.7.8" \ --watch-port=80 --replicas=4 --create
Keepalived manages a set of virtual IP addresses. The administrator must make sure that all these addresses:
Are accessible on the configured hosts from outside the cluster.
Are not used for any other purpose within the cluster.
Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node will serve the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.
Each VIP in the set may end up being served by a different node. |
Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.
The script is provided through the --check-script=<script>
option to the oc adm
ipfailover
command. The script must exit with 0
for PASS or 1
for FAIL.
By default, the check is done every two seconds, but can be changed using the
--check-interval=<seconds>
option.
When a check script is not provided, a simple default script is run that tests
the TCP
connection. This default test is suppressed when the monitor port is 0
.
For each VIP, keepalived keeps the state of the node. The VIP on the node may be in MASTER, BACKUP, or FAULT state. All VIPs on the node that are not in the FAULT state participate in the negotiation to decide which will be MASTER for the VIP. All of the losers enter the BACKUP state. When the check script on the MASTER fails, the VIP enters the FAULT state and triggers a renegotiation. When the BACKUP fails, the VIP enters the FAULT state. When the check script passes again on a VIP in the FAULT state, it exits FAULT and negotiates for MASTER. The resulting state is either MASTER or BACKUP.
The administrator can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:
$1
- "GROUP"|"INSTANCE"
$2
- Name of the group or instance
$3
- The new state ("MASTER"|"BACKUP"|"FAULT")
These scripts run in the IP failover pod and use the pod’s file system, not the host file system. The options require the full path to the script. The administrator must make the script available in the pod to extract the results from running the notify script. The recommended approach for providing the scripts is to use a ConfigMap.
The full path names of the check and notify scripts are added to the keepalived configuration file, /etc/keepalived/keepalived.conf, which is loaded every time keepalived starts. The scripts can be added to the pod with a ConfigMap as follows.
Create the desired script and create a ConfigMap to hold it. The script
has no input arguments and must return 0
for OK and 1
for FAIL.
The check script, mycheckscript.sh:
#!/bin/bash
# Whatever tests are needed
# E.g., send request and verify response
exit 0
Create the ConfigMap:
$ oc create configmap mycustomcheck --from-file=mycheckscript.sh
There are two approaches to adding the script to the pod: use oc
commands or
edit the deployment configuration. In both cases, the defaultMode
for the
mounted configMap
files must allow execution. A value of 0755
(493
decimal) is typical.
Using oc
commands:
$ oc env dc/ipf-ha-router \
OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh
$ oc volume dc/ipf-ha-router --add --overwrite \
--name=config-volume \
--mount-path=/etc/keepalive \
--source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'
Editing the ipf-ha-router deployment configuration:
Use oc edit dc ipf-ha-router
to edit the router deployment configuration
with a text editor.
...
spec:
containers:
- env:
- name: OPENSHIFT_HA_CHECK_SCRIPT (1)
value: /etc/keepalive/mycheckscript.sh
...
volumeMounts: (2)
- mountPath: /etc/keepalive
name: config-volume
dnsPolicy: ClusterFirst
...
volumes: (3)
- configMap:
defaultMode: 0755 (4)
name: customrouter
name: config-volume
...
1 | In the spec.container.env field, add the OPENSHIFT_HA_CHECK_SCRIPT
environment variable to point to the mounted script file. |
2 | Add the spec.container.volumeMounts field to create the mount point. |
3 | Add a new spec.volumes field to mention the ConfigMap. |
4 | This sets execute permission on the files. When read back, it will be
displayed in decimal (493 ). |
Save the changes and exit the editor. This restarts ipf-ha-router.
When a host leaves the FAULT state by passing the check script, the host becomes a BACKUP if the new host has lower priority than the host currently in the MASTER state. However, if it has a higher priority, the preemption strategy determines it’s role in the cluster.
The nopreempt strategy does not move MASTER from the lower priority host to the higher priority host. With preempt 300, the default, keepalived waits the specified 300 seconds and moves MASTER to the higher priority host.
To specify preemption:
When creating ipfailover using the preemption-strategy
:
$ oc adm ipfailover --preempt-strategy=nopreempt \
...
Setting the variable using the oc set env
command:
$ oc set env dc/ipf-ha-router \
--overwrite=true \
OPENSHIFT_HA_PREEMPTION=nopreempt
Using oc edit dc ipf-ha-router
to edit the router deployment configuration:
...
spec:
containers:
- env:
- name: OPENSHIFT_HA_PREEMPTION (1)
value: nopreempt
...
OKD’s IP failover internally uses keepalived.
Ensure that multicast is enabled on the nodes labeled above and they can accept network traffic for 224.0.0.18 (the VRRP multicast IP address). |
Before starting the keepalived daemon, the startup script verifies the
iptables
rule that allows multicast traffic to flow. If there is no such
rule, the startup script creates a new rule and adds it to the IP tables
configuration. Where this new rule gets added to the IP tables configuration
depends on the --iptables-chain=
option. If there is an --iptables-chain=
option specified, the rule gets added to the specified chain in the option.
Otherwise, the rule is added to the INPUT
chain.
The |
The iptables
rule can be removed after the last keepalived daemon terminates. The rule is not automatically removed.
You can manually manage the iptables
rule on each of the nodes. It only gets created when none is present (as long as ipfailover is not created with the --iptable-chain=""
option).
You must ensure that the manually added rules persist after a system restart. Be careful since every keepalived daemon uses the VRRP protocol over multicast 224.0.0.18 to negotiate with its peers. There must be a different VRRP-id (in the range 0..255) for each VIP. |
$ for node in openshift-node-{5,6,7,8,9}; do ssh $node <<EOF export interface=${interface:-"eth0"} echo "Check multicast enabled ... "; ip addr show $interface | grep -i MULTICAST echo "Check multicast groups ... " ip maddr show $interface | grep 224.0.0 EOF done;
Option | Variable Name | Default | Notes |
---|---|---|---|
|
|
80 |
The ipfailover pod tries to open a TCP connection to this port on each VIP. If connection is established, the service is considered to be running. If this port is set to 0, the test always passes. |
|
|
The interface name for ipfailover to use, to send VRRP traffic. By default, |
|
|
|
2 |
Number of replicas to create. This must match |
|
|
The list of IP address ranges to replicate. This must be provided. (For example, 1.2.3.4-6,1.2.3.9.) See this discussion for more details. |
|
|
|
0 |
See VRRP ID Offset discussion for more details. |
|
|
INPUT |
The name of the iptables chain, to automatically add an |
|
|
Full path name in the pod file system of a script that is periodically run to verify the application is operating. See this discussion for more details. |
|
|
|
2 |
The period, in seconds, that the check script is run. |
|
|
Full path name in the pod file system of a script that is run whenever the state changes. See this discussion for more details. |
|
|
|
preempt 300 |
Strategy for handling a new higher priority host. See the VRRP Preemption section for more details. |
Each ipfailover pod managed by the ipfailover deployment configuration (1 pod per node/replica) runs a keepalived daemon. As more ipfailover deployment configurations are configured, more pods are created and more daemons join into the common VRRP negotiation. This negotiation is done by all the keepalived daemons and it determines which nodes will service which VIPs.
Internally, keepalived assigns a unique vrrp-id to each VIP. The negotiation uses this set of vrrp-ids, when a decision is made, the VIP corresponding to the winning vrrp-id is serviced on the winning node.
Therefore, for every VIP defined in the ipfailover deployment configuration, the ipfailover pod must assign a corresponding vrrp-id. This is done by starting at --vrrp-id-offset
and sequentially assigning the vrrp-ids to
the list of VIPs. The vrrp-ids may have values in the range 1..255.
When there are multiple ipfailover deployment configuration care must be taken to specify --vrrp-id-offset
so that there is room to increase the number of VIPS in the deployment configuration and none of the vrrp-id ranges overlap.
The following example describes how to set up highly-available router and geo-cache network services with IP failover on a set of nodes.
Label the nodes that will be used for the services. This step can be optional if you run the services on all the nodes in your OKD cluster and will use VIPs that can float within all nodes in the cluster.
The following example defines a label for nodes that are servicing traffic in the US west geography ha-svc-nodes=geo-us-west:
$ oc label nodes openshift-node-{5,6,7,8,9} "ha-svc-nodes=geo-us-west"
Create the service account. You can use ipfailover or when using a router (depending on your environment policies), you can either reuse the router service account created previously or a new ipfailover service account.
The following example creates a new service account with the name ipfailover in the default namespace:
$ oc create serviceaccount ipfailover -n default
Add the ipfailover service account in the default namespace to the privileged SCC:
$ oc adm policy add-scc-to-user privileged system:serviceaccount:default:ipfailover
Start the router and the geo-cache services.
Since the ipfailover runs on all nodes from step 1, it is recommended to also run the router/service on all the step 1 nodes. |
Start the router with the nodes matching the labels used in the first step. The following example runs five instances using the ipfailover service account:
$ oc adm router ha-router-us-west --replicas=5 \ --selector="ha-svc-nodes=geo-us-west" \ --labels="ha-svc-nodes=geo-us-west" \ --service-account=ipfailover
Run the geo-cache service with a replica on each of the nodes. See an example configuration for running a geo-cache service.
Make sure that you replace the myimages/geo-cache Docker image referenced in the file with your intended image. Change the number of replicas to the number of nodes in the geo-cache label. Check that the label matches the one used in the first step. |
$ oc create -n <namespace> -f ./examples/geo-cache.json
Configure ipfailover for the router and geo-cache services. Each has its own VIPs and both use the same nodes labeled with ha-svc-nodes=geo-us-west in the first step. Ensure that the number of replicas match the number of nodes listed in the label setup, in the first step.
The router, geo-cache, and ipfailover all create deployment configuration and all must have different names. |
Specify the VIPs and the port number that ipfailover should monitor on the desired instances.
The ipfailover command for the router:
$ oc adm ipfailover ipf-ha-router-us-west \ --replicas=5 --watch-port=80 \ --selector="ha-svc-nodes=geo-us-west" \ --virtual-ips="10.245.2.101-105" \ --iptables-chain="INPUT" \ --service-account=ipfailover --create
The following is the oc adm ipfailover
command for the geo-cache service that is
listening on port 9736. Since there are two ipfailover
deployment
configurations, the --vrrp-id-offset
must be set so that each VIP gets its own
offset. In this case, setting a value of 10
means that the
ipf-ha-router-us-west
can have a maximum of 10 VIPs (0-9) since
ipf-ha-geo-cache
is starting at 10.
$ oc adm ipfailover ipf-ha-geo-cache \ --replicas=5 --watch-port=9736 \ --selector="ha-svc-nodes=geo-us-west" \ --virtual-ips=10.245.3.101-105 \ --vrrp-id-offset=10 \ --service-account=ipfailover --create
In the commands above, there are ipfailover, router, and geo-cache pods on
each node. The set of VIPs for each ipfailover configuration must not overlap
and they must not be used elsewhere in the external or cloud environments. The
five VIP addresses in each example, 10.245.{2,3}.101-105
are served by the two
ipfailover deployment configurations. IP failover dynamically selects which
address is served on which node.
The administrator sets up external DNS to point to the VIP addresses knowing that all the router VIPs point to the same router, and all the geo-cache VIPs point to the same geo-cache service. As long as one node remains running, all the VIP addresses are served.
Deploy the ipfailover router to monitor postgresql listening on node port 32439 and the external IP address, as defined in the postgresql-ingress service:
$ oc adm ipfailover ipf-ha-postgresql \ --replicas=1 \ (1) --selector="app-type=postgresql" \ (2) --virtual-ips=10.9.54.100 \ (3) --watch-port=32439 \ (4) --service-account=ipfailover --create
1 | Specifies the number of instances to deploy. |
2 | Restricts where the ipfailover is deployed. |
3 | Virtual IP address to monitor. |
4 | Port on which ipfailover will monitor on each node. |
The default deployment strategy for the IP failover service is to recreate the deployment. In order to dynamically update the VIPs for a highly available routing service with minimal or no downtime, you must:
Update the IP failover service deployment configuration to use a rolling update strategy, and
Update the OPENSHIFT_HA_VIRTUAL_IPS
environment variable with the updated list or sets of virtual IP addresses.
The following example shows how to dynamically update the deployment strategy and the virtual IP addresses:
Consider an IP failover configuration that was created using the following:
$ oc adm ipfailover ipf-ha-router-us-west \ --replicas=5 --watch-port=80 \ --selector="ha-svc-nodes=geo-us-west" \ --virtual-ips="10.245.2.101-105" \ --service-account=ipfailover --create
Edit the deployment configuration:
$ oc edit dc/ipf-ha-router-us-west
Update the spec.strategy.type
field from Recreate
to Rolling
:
spec: replicas: 5 selector: ha-svc-nodes: geo-us-west strategy: recreateParams: timeoutSeconds: 600 resources: {} type: Rolling (1)
1 | Set to Rolling . |
Update the OPENSHIFT_HA_VIRTUAL_IPS
environment variable to contain the
additional virtual IP addresses:
- name: OPENSHIFT_HA_VIRTUAL_IPS value: 10.245.2.101-105,10.245.2.110,10.245.2.201-205 (1)
1 | 10.245.2.110,10.245.2.201-205 have been added to the list. |
Update the external DNS to match the set of VIPs.
The user can assign VIPs as ExternalIPs in a service. Keepalived makes sure that each VIP is served on some node in the ipfailover configuration. When a request arrives on the node, the service that is running on all nodes in the cluster, load balances the request among the service’s endpoints.
The NodePorts can be set to the ipfailover watch port so that keepalived can check the application is running. The NodePort is exposed on all nodes in the cluster, therefore it is available to keepalived on all ipfailover nodes.
In non-cloud clusters, ipfailover and ingressIP to a service can be combined. The result is high availability services for users that create services using ingressIP.
The approach is to specify an ingressIPNetworkCIDR
range and then use the same range in creating the ipfailover configuration.
Since, ipfailover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR
needs to be /24
or less.