Before you update the cluster - Day 2 operations for telco core CNF clusters | Edge computing

Pausing worker nodes before the update
Backup the etcd database before you proceed with the update
- Backing up etcd data
- Creating a single etcd backup
Checking the cluster health

Before you start the cluster update, you must pause worker nodes, back up the etcd database, and do a final cluster health check before proceeding.

Pausing worker nodes before the update

You must pause the worker nodes before you proceed with the update. In the following example, there are 2 mcp groups, mcp-1 and mcp-2. You patch the spec.paused field to true for each of these MachineConfigPool groups.

Procedure

Patch the mcp CRs to pause the nodes and drain and remove the pods from those nodes by running the following command:

$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":true}}'

$ oc patch mcp/mcp-2 --type merge --patch '{"spec":{"paused":true}}'

Get the status of the paused mcp groups:

$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

Example output

MCP     Paused
---     ------
master  false
mcp-1   true
mcp-2   true

The default control plane and worker mcp groups are not changed during an update.

Backup the etcd database before you proceed with the update

You must backup the etcd database before you proceed with the update.

Backing up etcd data

Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static pods. This backup can be saved and used at a later time if you need to restore etcd.

Only save a backup from a single control plane host. Do not take a backup from each control plane host in the cluster.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have checked whether the cluster-wide proxy is enabled.

You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpproxy, httpsproxy, and noproxy fields have values set.

Procedure

Start a debug session as root for a control plane node:
```
$ oc debug --as-root node/<node_name>
```
Change your root directory to /host in the debug shell:
```
sh-4.4# chroot /host
```

If the cluster-wide proxy is enabled, export the NO_proxy, HTTP_proxy, and HTTPS_proxy environment variables by running the following commands:

$ export HTTP_proxy=http://<your_proxy.example.com>:8080

$ export HTTPS_proxy=https://<your_proxy.example.com>:8080

$ export NO_proxy=<example.com>

Run the cluster-backup.sh script in the debug shell and pass in the location to save the backup to.

The cluster-backup.sh script is maintained as a component of the etcd Cluster Operator and is a wrapper around the etcdctl snapshot save command.

sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup

Example script output

found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-6
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-7
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-3
ede95fe6b88b87ba86a03c15e669fb4aa5bf0991c180d3c6895ce72eaade54a1
etcdctl version: 3.4.14
API version: 3.4
{"level":"info","ts":1624647639.0188997,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db.part"}
{"level":"info","ts":"2021-06-25T19:00:39.030Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1624647639.0301006,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://10.0.0.5:2379"}
{"level":"info","ts":"2021-06-25T19:00:40.215Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1624647640.6032252,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://10.0.0.5:2379","size":"114 MB","took":1.584090459}
{"level":"info","ts":1624647640.6047094,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/home/core/assets/backup/snapshot_2021-06-25_190035.db"}
Snapshot saved at /home/core/assets/backup/snapshot_2021-06-25_190035.db
{"hash":3866667823,"revision":31407,"totalKey":12828,"totalSize":114446336}
snapshot db and kube resources are successfully saved to /home/core/assets/backup

In this example, two files are created in the /home/core/assets/backup/ directory on the control plane host:

snapshot_<datetimestamp>.db: This file is the etcd snapshot. The cluster-backup.sh script confirms its validity.

static_kuberesources_<datetimestamp>.tar.gz: This file contains the resources for the static pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.

If etcd encryption is enabled, it is recommended to store this second file separately from the etcd snapshot for security reasons. However, this file is required to restore from the etcd snapshot.

Keep in mind that etcd encryption only encrypts values, not keys. This means that resource types, namespaces, and object names are unencrypted.

Creating a single etcd backup

Follow these steps to create a single etcd backup by creating and applying a custom resource (CR).

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have access to the OpenShift CLI (oc).

Procedure

If dynamically-provisioned storage is available, complete the following steps to create a single automated etcd backup:
1. Create a persistent volume claim (PVC) named etcd-backup-pvc.yaml with contents such as the following example:
  kind: PersistentVolumeClaim apiVersion: v1 metadata: name: etcd-backup-pvc namespace: openshift-etcd spec: accessModes: - ReadWriteOnce resources: requests: storage: 200Gi (1) volumeMode: Filesystem
  1 The amount of storage available to the PVC. Adjust this value for your requirements.
2. Apply the PVC by running the following command:
  $ oc apply -f etcd-backup-pvc.yaml
3. Verify the creation of the PVC by running the following command:
  $ oc get pvc
  Example output
  NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE etcd-backup-pvc Bound 51s
  Dynamic PVCs stay in the Pending state until they are mounted.
4. Create a CR file named etcd-single-backup.yaml with contents such as the following example:
  apiVersion: operator.openshift.io/v1alpha1 kind: EtcdBackup metadata: name: etcd-single-backup namespace: openshift-etcd spec: pvcName: etcd-backup-pvc (1)
  1 The name of the PVC to save the backup to. Adjust this value according to your environment.
5. Apply the CR to start a single backup:
  $ oc apply -f etcd-single-backup.yaml

If dynamically-provisioned storage is not available, complete the following steps to create a single automated etcd backup:

Create a StorageClass CR file named etcd-backup-local-storage.yaml with the following contents:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: etcd-backup-local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: Immediate

Apply the StorageClass CR by running the following command:
```
$ oc apply -f etcd-backup-local-storage.yaml
```

Create a PV named etcd-backup-pv-fs.yaml with contents such as the following example:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: etcd-backup-pv-fs
spec:
  capacity:
    storage: 100Gi (1)
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: etcd-backup-local-storage
  local:
    path: /mnt
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
      - key: kubernetes.io/hostname
         operator: In
         values:
         - <example_master_node> (2)

1	The amount of storage available to the PV. Adjust this value for your requirements.
2	Replace this value with the node to attach this PV to.

Verify the creation of the PV by running the following command:

$ oc get pv

Example output

NAME                    CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS                REASON   AGE
etcd-backup-pv-fs       100Gi      RWO            Retain           Available           etcd-backup-local-storage            10s

Create a PVC named etcd-backup-pvc.yaml with contents such as the following example:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: etcd-backup-pvc
  namespace: openshift-etcd
spec:
  accessModes:
  - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi (1)

1	The amount of storage available to the PVC. Adjust this value for your requirements.

Apply the PVC by running the following command:
```
$ oc apply -f etcd-backup-pvc.yaml
```
Create a CR file named etcd-single-backup.yaml with contents such as the following example:
```
apiVersion: operator.openshift.io/v1alpha1
kind: EtcdBackup
metadata:
  name: etcd-single-backup
  namespace: openshift-etcd
spec:
  pvcName: etcd-backup-pvc (1)
```
1 The name of the persistent volume claim (PVC) to save the backup to. Adjust this value according to your environment.
Apply the CR to start a single backup:
```
$ oc apply -f etcd-single-backup.yaml
```

Additional resources

Backing up etcd

Checking the cluster health

You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.

Procedure

Check the status of the cluster Operators by running the following command:

$ oc get co

Example output

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.34   True        False         False      4d22h
baremetal                                  4.14.34   True        False         False      4d22h
cloud-controller-manager                   4.14.34   True        False         False      4d23h
cloud-credential                           4.14.34   True        False         False      4d23h
cluster-autoscaler                         4.14.34   True        False         False      4d22h
config-operator                            4.14.34   True        False         False      4d22h
console                                    4.14.34   True        False         False      4d22h
...
service-ca                                 4.14.34   True        False         False      4d22h
storage                                    4.14.34   True        False         False      4d22h

Check the status of the cluster nodes:

$ oc get nodes

Example output

NAME           STATUS   ROLES                  AGE     VERSION
ctrl-plane-0   Ready    control-plane,master   4d22h   v1.27.15+6147456
ctrl-plane-1   Ready    control-plane,master   4d22h   v1.27.15+6147456
ctrl-plane-2   Ready    control-plane,master   4d22h   v1.27.15+6147456
worker-0       Ready    mcp-1,worker           4d22h   v1.27.15+6147456
worker-1       Ready    mcp-2,worker           4d22h   v1.27.15+6147456

Check that there are no in-progress or failed pods. There should be no pods returned when you run the following command.
```
$ oc get po -A | grep -E -iv 'running|complete'
```

Before you update the telco core CNF cluster

Pausing worker nodes before the update

Backup the etcd database before you proceed with the update

Backing up etcd data

Creating a single etcd backup

Checking the cluster health