Migrating etcd Data - Upgrading a Cluster | Installation and Configuration

Overview
Before You Begin
Running the Automated Migration Playbook
Running the Migration Manually
Recovering from Migration Issues

Overview

While etcd in OpenShift Container Platform was updated from etcd v2 to v3 in a previous release, OpenShift Container Platform continued using an etcd v2 data model and API for both new and upgraded clusters.

Starting with OpenShift Container Platform 3.6, new installations began using the v3 data model as well, providing improved performance and scalability. For existing clusters that upgraded to OpenShift Container Platform 3.6, however, the etcd data can be migrated from v2 to v3 using the post-upgrade steps below.

OpenShift Container Platform 3.6 does not enforce or require the migration from etcd data model v2 to v3. However, the migration is required before upgrading to OpenShift Container Platform 3.7. See Recommended Practices for OpenShift Container Platform etcd Hosts for more details.

The etcd v2 to v3 data migration is performed as an offline migration which means all etcd members and master services are stopped during the migration. Large clusters with up to 600MiB of etcd data can expect a 10 to 15 minute outage of the API, web console, and controllers.

This migration process performs the following steps:

Stop the master API and controller services
Perform an etcd backup on all etcd members
Perform a migration on the first etcd host
Remove etcd data from any remaining etcd hosts
Perform an etcd scaleup operation adding additional etcd hosts one by one
Re-introduce TTL information on specific keys
Reconfigure the masters for etcd v3 storage
Start the master API and controller services

Before You Begin

You can only begin the etcd data migration process after upgrading to OpenShift Container Platform 3.6, as previous versions are not compatible with etcd v3 storage. Additionally, the upgrade to OpenShift Container Platform 3.6 reconfigures cluster DNS services to run on every node, rather than on the masters, which ensures that, even when master services are taken down, existing pods continue to function as expected.

The migration process is currently only supported on clusters that have etcd hosts specifically defined in their inventory file. Therefore, the migration cannot be used for clusters which utilize the embedded etcd which runs as part of the master process. Support for migrating embedded etcd installations will be added in a future release.

Running the Automated Migration Playbook

If the migration playbooks fail before the masters are reconfigured to support etcd v3 storage, you must roll back the migration process. Contact support for more assistance.

A migration playbook is provided to automate all aspects of the process; this is the preferred method for performing the migration. You must have access to your existing inventory file with both masters and etcd hosts defined in their separate groups.

In order to perform the migration on Red Hat Enterprise Linux Atomic Host, you must be running Atomic Host 7.4 or later.

The migration can only be performed using openshift-ansible version 3.6.173.0.21 or later. Ensure you have the latest version of the openshift-ansible packages installed:
```
# yum upgrade openshift-ansible\*
```

Run the migrate.yml playbook using your inventory file:

# ansible-playbook [-i /path/to/inventory] \
    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml

Running the Migration Manually

The following procedure describes the steps required to successfully migrate the cluster (implemented as part of the Ansible etcd migration playbook).

Create an etcd backup. See Backup and Restore for steps.

Stop masters and wait for etcd convergence:

Stop all master services:

# systemctl stop atomic-openshift-master-api atomic-openshift-master-controllers

Before the migration can proceed, the etcd cluster must be healthy and raft indices of all etcd members must differ by one unit at most. At the same time, all etcd members and master daemons must be stopped.

To check the etcd cluster is healthy you can run:

# etcdctl <certificate_details> <endpoint> cluster-health (1)
member 2a3d833935d9d076 is healthy: got healthy result from https://etcd-test-1:2379
member a83a3258059fee18 is healthy: got healthy result from https://etcd-test-2:2379
member 22a9f2ddf18fee5f is healthy: got healthy result from https://etcd-test-3:2379
cluster is healthy

1	For `<certificate_details>`, see Backup and Restore for an example of how to set certificate flags.

To check a difference of raft indices you can run:

# ETCDCTL_API=3 etcdctl <certificate_details> <endpoint> -w table endpoint status
+------------------+------------------+---------+---------+-----------+-----------+------------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------+------------------+---------+---------+-----------+-----------+------------+
| etcd-test-1:2379 | 2a3d833935d9d076 | 3.1.9   | 25 kB   | false     |       415 |        995 |
| etcd-test-2:2379 | a83a3258059fee18 | 3.1.9   | 25 kB   | true      |       415 |        995 |
| etcd-test-3:2379 | 22a9f2ddf18fee5f | 3.1.9   | 25 kB   | false     |       415 |        995 |
+------------------+------------------+---------+---------+-----------+-----------+------------+

If the minimum and maximum of raft indexes over all etcd members differ by more than one unit, wait a minute and try the command again.

Migrate and scale up etcd:

The migration should not be run repeatedly, as new v2 data can overwrite v3 data that has already migrated.

Stop etcd on all etcd hosts:
```
# systemctl stop etcd
```
Run the following command (with the etcd daemon stopped) on your first etcd host to perform the migration:
```
# ETCDCTL_API=3 etcdctl migrate --data-dir=/var/lib/etcd
```
The --data-dir target can in a different location depending on the deployment. For example, embedded etcd operates over the /var/lib/origin/openshift.local.etcd directory, and etcd run as a system container operates over the /var/lib/etcd/etcd.etcd directory.

When complete, the migration responds with the following message if successful:
```
finished transforming keys
```
If there is no v2 data, it responds with:
```
no v2 keys to migrate
```
On each remaining etcd host, move the existing member directory to a backup location:
```
$ mv /var/lib/etcd/member /var/lib/etc/member.old
```

Create a new cluster on the first host:

# echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf
# systemctl start etcd
# sed -i '/ETCD_FORCE_NEW_CLUSTER=true/d' /etc/etcd/etcd.conf
# systemctl restart etcd

Scale up additional etcd hosts by following the Adding Additional etcd Members documentation.

When the etcdctl migrate command is run without the --no-ttl option, TTL keys are migrated as well. Given that the TTL keys in v2 data are replaced with leases in v3 data, you must attach leases to all migrated TTL keys (with the etcd daemon running).

After your etcd cluster is back online with all members, re-introduce the TTL information by running the following on the first master:

$ oc adm migrate etcd-ttl --etcd-address=https://<ip_address>:2379 \
    --cacert=/etc/origin/master/master.etcd-ca.crt \
    --cert=/etc/origin/master/master.etcd-client.crt \
    --key=/etc/origin/master/master.etcd-client.key \
    --ttl-keys-prefix '/kubernetes.io/events' \
    --lease-duration 1h
$ oc adm migrate etcd-ttl --etcd-address=https://<ip_address>:2379 \
    --cacert=/etc/origin/master/master.etcd-ca.crt \
    --cert=/etc/origin/master/master.etcd-client.crt \
    --key=/etc/origin/master/master.etcd-client.key \
    --ttl-keys-prefix '/kubernetes.io/masterleases' \
    --lease-duration 10s
$ oc adm migrate etcd-ttl --etcd-address=https://<ip_address>:2379 \
    --cacert=/etc/origin/master/master.etcd-ca.crt \
    --cert=/etc/origin/master/master.etcd-client.crt \
    --key=/etc/origin/master/master.etcd-client.key \
    --ttl-keys-prefix '/openshift.io/oauth/accesstokens' \
    --lease-duration 86400s
$ oc adm migrate etcd-ttl --etcd-address=https://<ip_address>:2379 \
    --cacert=/etc/origin/master/master.etcd-ca.crt \
    --cert=/etc/origin/master/master.etcd-client.crt \
    --key=/etc/origin/master/master.etcd-client.key \
    --ttl-keys-prefix '/openshift.io/oauth/authorizetokens' \
    --lease-duration 500s
$ oc adm migrate etcd-ttl --etcd-address=https://<ip_address>:2379 \
    --cacert=/etc/origin/master/master.etcd-ca.crt \
    --cert=/etc/origin/master/master.etcd-client.crt \
    --key=/etc/origin/master/master.etcd-client.key \
    --ttl-keys-prefix '/openshift.io/leases/controllers' \
    --lease-duration 10s

Reconfigure the master:

After the migration is complete, the master configuration file (the /etc/origin/master/master-config.yaml file by default) must be updated so the master daemons can use the new storage back end:
```
kubernetesMasterConfig:
  apiServerArguments:
    storage-backend:
    - etcd3
    storage-media-type:
    - application/vnd.kubernetes.protobuf
```

Restart your services, run:

# systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers

Recovering from Migration Issues

If you discover problems after the migration has completed, you may wish to restore from a backup:

Stop the master services:

# systemctl stop atomic-openshift-master-api atomic-openshift-master-controllers

Remove the storage-backend and storage-media-type keys from from kubernetesMasterConfig.apiServerArguments section in the master configuration file on each master:
```
kubernetesMasterConfig:
  apiServerArguments:
   ...
```
Restore from backups that were taken prior to the migration, located in a timestamped directory under /var/lib/etcd, such as:
```
/var/lib/etcd/openshift-backup-pre-migration20170825135732
```
Use procedure described in Cluster Restore for Multiple-member etcd Clusters or Cluster Restore for Single-member etcd Clusters.

Restart master services; run:

# systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers

Migrating etcd Data: v2 to v3

Overview

Before You Begin

Running the Automated Migration Playbook

Running the Migration Manually

Recovering from Migration Issues