This is a cache of https://docs.okd.io/latest/nodes/nodes/nodes-nodes-replace-control-plane.html. It is a snapshot of the page at 2025-12-30T19:03:15.188+0000.
Replacing a failed bare-metal control plane node without BMC credentials - Working with nodes | Nodes | OKD 4
×

If a control plane node on your bare-metal cluster has failed and cannot be recovered, but you installed your cluster without providing baseboard management controller (BMC) credentials, you must take extra steps in order to replace the failed node with a new one.

Prerequisites

  • You have identified the unhealthy bare metal etcd member.

  • You have verified that either the machine is not running or the node is not ready.

  • You have access to the cluster as a user with the cluster-admin role.

  • You have taken an etcd backup in case you encounter any issues.

  • You have downloaded and installed the coreos-installer CLI.

  • Your cluster does not have a control plane machineset. You can check for machinesets by running the following command:

    $ oc get machinesets,controlplanemachinesets -n openshift-machine-api

    There should be only one or more machinesets for the workers. If controlplanemachinesets exists for the control plane, do not use this procedure.

Removing the unhealthy etcd member

Begin removing the failed control plane node by first removing the unhealthy etcd member.

Procedure
  1. List etcd pods by running the following command and make note of a pod that is not on the affected node:

    $ oc -n openshift-etcd get pods -l k8s-app=etcd -o wide
    Example output
    etcd-openshift-control-plane-0   5/5   Running   11   3h56m   192.168.10.9    openshift-control-plane-0  <none>           <none>
    etcd-openshift-control-plane-1   5/5   Running   0    3h54m   192.168.10.10   openshift-control-plane-1   <none>           <none>
    etcd-openshift-control-plane-2   5/5   Running   0    3h58m   192.168.10.11   openshift-control-plane-2   <none>           <none>
  2. Connect to a running etcd container by running the following command:

    $ oc rsh -n openshift-etcd <etcd_pod>

    Replace <etcd_pod> with the name of an etcd pod associated with one of the healthy nodes.

    Example command
    $ oc rsh -n openshift-etcd etcd-openshift-control-plane-0
  3. View the etcd member list by running the following command. Make note of the ID and the name of the unhealthy etcd member because these values are required later.

    sh-4.2# etcdctl member list -w table
    Example output
    +------------------+---------+------------------------------+---------------------------+---------------------------+
    |        ID        | STATUS  |             NAME             |        PEER ADDRS         |       CLIENT ADDRS        |
    +------------------+---------+------------------------------+---------------------------+---------------------------+
    | 6fc1e7c9db35841d | started | openshift-control-plane-2    | https://10.0.131.183:2380 | https://10.0.131.183:2379 |
    | 757b6793e2408b6c | started | openshift-control-plane-1    | https://10.0.164.97:2380  | https://10.0.164.97:2379  |
    | ca8c2990a0aa29d1 | started | openshift-control-plane-0    | https://10.0.154.204:2380 | https://10.0.154.204:2379 |
    +------------------+---------+------------------------------+---------------------------+---------------------------+

    The etcdctl endpoint health command will list the removed member until the replacement is complete and the new member is added.

  4. Remove the unhealthy etcd member by running the following command:

    sh-4.2# etcdctl member remove <unhealthy_member_id>

    Replace <unhealthy_member_id> with the ID of the etcd member on the unhealthy node.

    Example command
    sh-4.2# etcdctl member remove 6fc1e7c9db35841d
    Example output
    Member 6fc1e7c9db35841d removed from cluster b23536c33f2cdd1b
  5. View the member list again by running the following command and verify that the member was removed:

    sh-4.2# etcdctl member list -w table
    Example output
    +------------------+---------+------------------------------+---------------------------+---------------------------+
    |        ID        | STATUS  |             NAME             |        PEER ADDRS         |       CLIENT ADDRS        |
    +------------------+---------+------------------------------+---------------------------+---------------------------+
    | 757b6793e2408b6c | started | openshift-control-plane-1    | https://10.0.164.97:2380  | https://10.0.164.97:2379  |
    | ca8c2990a0aa29d1 | started | openshift-control-plane-0    | https://10.0.154.204:2380 | https://10.0.154.204:2379 |
    +------------------+---------+------------------------------+---------------------------+---------------------------+

    After you remove the member, the cluster might be unreachable for a short time while the remaining etcd instances reboot.

  6. Exit the rsh session into the etcd pod by running the following command:

    sh-4.2# exit
  7. Turn off the etcd quorum guard by running the following command:

    $ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableetcd": true}}}'

    This command ensures that you can successfully re-create secrets and roll out the static pods.

  8. List the secrets for the removed, unhealthy etcd member by running the following command:

    $ oc get secrets -n openshift-etcd | grep <node_name>

    Replace <node_name> with the name of the failed node whose etcd member you removed.

    Example command
    $ oc get secrets -n openshift-etcd | grep openshift-control-plane-2
    Example output
    etcd-peer-openshift-control-plane-2             kubernetes.io/tls   2   134m
    etcd-serving-metrics-openshift-control-plane-2  kubernetes.io/tls   2   134m
    etcd-serving-openshift-control-plane-2          kubernetes.io/tls   2   134m
  9. Delete the secrets associated with the affected node that was removed:

    1. Delete the peer secret by running the following command:

      $ oc delete secret -n openshift-etcd etcd-peer-<node_name>

      Replace <node_name> with the name of the affected node.

    2. Delete the serving secret by running the following command:

      $ oc delete secret -n openshift-etcd etcd-serving-<node_name>

      Replace <node_name> with the name of the affected node.

    3. Delete the metrics secret by running the following command:

      $ oc delete secret -n openshift-etcd etcd-serving-metrics-<node_name> (1)

      Replace <node_name> with the name of the affected node.

Deleting the machine of the unhealthy etcd member

Finish removing the failed control plane node by deleting the machine of the unhealthy etcd member.

Procedure
  1. Ensure that the Bare Metal Operator is available by running the following command:

    $ oc get clusteroperator baremetal
    Example output
    NAME        VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    baremetal   4.20.0    True        False         False      3d15h
  2. Save the BareMetalHost object of the affected node to a file for later use by running the following command:

    $ oc get -n openshift-machine-api bmh <node_name> -o yaml > bmh_affected.yaml

    Replace <node_name> with the name of the affected node, which usually matches the associated BareMetalHost name.

  3. View the YAML file of the saved BareMetalHost object by running the following command, and ensure the content is correct:

    $ cat bmh_affected.yaml
  4. Remove the affected BareMetalHost object by running the following command:

    $ oc delete -n openshift-machine-api bmh <node_name>

    Replace <node_name> with the name of the affected node.

  5. List all machines by running the following command and identify the machine associated with the affected node:

    $ oc get machines -n openshift-machine-api -o wide
    Example output
    NAME                            PHASE    TYPE  REGION  ZONE  AGE    NODE                       PROVIDERID                                                                                             STATE
    examplecluster-control-plane-0  Running                      3h11m  openshift-control-plane-0  baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e  externally provisioned
    examplecluster-control-plane-1  Running                      3h11m  openshift-control-plane-1  baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1  externally provisioned
    examplecluster-control-plane-2  Running                      3h11m  openshift-control-plane-2  baremetalhost:///openshift-machine-api/openshift-control-plane-2/3354bdac-61d8-410f-be5b-6a395b056135  externally provisioned
    examplecluster-compute-0        Running                      165m   openshift-compute-0        baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f        provisioned
    examplecluster-compute-1        Running                      165m   openshift-compute-1        baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9        provisioned
  6. Delete the machine of the unhealthy member by running the following command:

    $ oc delete machine -n openshift-machine-api <machine_name>

    Replace <machine_name> with the machine name associated with the affected node.

    Example command
    $ oc delete machine -n openshift-machine-api examplecluster-control-plane-2

    After you remove the BareMetalHost and Machine objects, the machine controller automatically deletes the Node object.

  7. If deletion of the machine is delayed for any reason or the command is obstructed and delayed, force deletion by removing the machine object finalizer field.

    Do not interrupt machine deletion by pressing Ctrl+c. You must allow the command to proceed to completion. Open a new terminal window to edit and delete the finalizer fields.

    1. On a new terminal window, edit the machine configuration by running the following command:

      $ oc edit machine -n openshift-machine-api examplecluster-control-plane-2
    2. Delete the following fields in the Machine custom resource, and then save the updated file:

      finalizers:
      - machine.machine.openshift.io
      Example output
      machine.machine.openshift.io/examplecluster-control-plane-2 edited

Verifying that the failed node was deleted

Before proceeding to create a replacement control plane node, verify that the failed node was successfully deleted.

Procedure
  1. Verify that the machine was deleted by running the following command:

    $ oc get machines -n openshift-machine-api -o wide
    Example output
    NAME                              PHASE     TYPE   REGION   ZONE   AGE     NODE                                 PROVIDERID                                                                                       STATE
    examplecluster-control-plane-0    Running                          3h11m   openshift-control-plane-0   baremetalhost:///openshift-machine-api/openshift-control-plane-0/da1ebe11-3ff2-41c5-b099-0aa41222964e   externally provisioned
    examplecluster-control-plane-1    Running                          3h11m   openshift-control-plane-1   baremetalhost:///openshift-machine-api/openshift-control-plane-1/d9f9acbc-329c-475e-8d81-03b20280a3e1   externally provisioned
    examplecluster-compute-0          Running                          165m    openshift-compute-0         baremetalhost:///openshift-machine-api/openshift-compute-0/3d685b81-7410-4bb3-80ec-13a31858241f         provisioned
    examplecluster-compute-1          Running                          165m    openshift-compute-1         baremetalhost:///openshift-machine-api/openshift-compute-1/0fdae6eb-2066-4241-91dc-e7ea72ab13b9         provisioned
  2. Verify that the node has been deleted by running the following command:

    $ oc get nodes
    Example output
    NAME                     STATUS ROLES   AGE   VERSION
    openshift-control-plane-0 Ready master 3h24m v1.33.4
    openshift-control-plane-1 Ready master 3h24m v1.33.4
    openshift-compute-0       Ready worker 176m v1.33.4
    openshift-compute-1       Ready worker 176m v1.33.4
  3. Wait for all of the cluster Operators to complete rolling out changes. Run the following command to monitor the progress:

    $ watch oc get co

Creating the new control plane node

Begin creating the new control plane node by creating a BareMetalHost object and node.

Procedure
  1. Edit the bmh_affected.yaml file that you previously saved:

    1. Remove the following metadata items from the file:

      • creationTimestamp

      • generation

      • resourceVersion

      • uid

    2. Remove the status section of the file.

    The resulting file should resemble the following example:

    Example bmh_affected.yaml file
    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      labels:
        installer.openshift.io/role: control-plane
      name: openshift-control-plane-2
      namespace: openshift-machine-api
    spec:
      automatedCleaningMode: disabled
      bmc:
        address:
        credentialsName:
        disableCertificateVerification: true
      bootMACAddress: ab:cd:ef:ab:cd:ef
      bootMode: UEFI
      externallyProvisioned: true
      online: true
      rootDeviceHints:
        deviceName: /dev/disk/by-path/pci-0000:04:00.0-nvme-1
      userData:
        name: master-user-data-managed
        namespace: openshift-machine-api
  2. Create the BareMetalHost object using the bmh_affected.yaml file by running the following command:

    $ oc create -f bmh_affected.yaml

    The following warning is expected upon creation of the BareMetalHost object:

    Warning: metadata.finalizers: "baremetalhost.metal3.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
  3. Extract the control plane ignition secret by running the following command:

    $ oc extract secret/master-user-data-managed \
        -n openshift-machine-api \
        --keys=userData \
        --to=- \
        | sed '/^userData/d' > new_controlplane.ign

    This command also removes the starting userData line of the ignition secret.

  4. Create an Nmstate YAML file titled new_controlplane_nmstate.yaml for the new node’s network configuration, using the following example for reference:

    Example Nmstate YAML file
    interfaces:
      - name: eno1
        type: ethernet
        state: up
        mac-address: "ab:cd:ef:01:02:03"
        ipv4:
          enabled: true
          address:
            - ip: 192.168.20.11
              prefix-length: 24
          dhcp: false
        ipv6:
          enabled: false
    dns-resolver:
      config:
        search:
          - iso.sterling.home
        server:
          - 192.168.20.8
    routes:
      config:
      - destination: 0.0.0.0/0
        metric: 100
        next-hop-address: 192.168.20.1
        next-hop-interface: eno1
        table-id: 254

    If you installed your cluster using the Agent-based Installer, you can use the failed node’s networkConfig section in the agent-config.yaml file from the original cluster deployment as a starting point for the new control plane node’s Nmstate file. For example, the following command extracts the networkConfig section for the first control plane node:

    $ cat agent-config-iso.yaml | yq .hosts[0].networkConfig > new_controlplane_nmstate.yaml
  5. Create the customized Fedora CoreOS (FCOS) live ISO by running the following command:

    $ coreos-installer iso customize rhcos-live.86_64.iso \
        --dest-ignition new_controlplane.ign \
        --network-nmstate new_controlplane_nmstate.yaml \
        --dest-device /dev/disk/by-path/<device_path> \
        -f

    Replace <device_path> with the path to the target device on which the ISO will be generated.

  6. Boot the new control plane node with the customized FCOS live ISO.

  7. Approve the Certificate Signing Requests (CSR) to join the new node to the cluster.

Linking the node, bare metal host, and machine together

Continue creating the new control plane node by creating a machine and then linking it with the new BareMetalHost object and node.

Procedure
  1. Get the providerID for control plane nodes by running the following command:

    $ oc get -n openshift-machine-api baremetalhost -l installer.openshift.io/role=control-plane -ojson | jq -r '.items[] | "baremetalhost:///openshift-machine-api/" + .metadata.name + "/" + .metadata.uid'
    Example output
    baremetalhost:///openshift-machine-api/master-00/6214c5cf-c798-4168-8c78-1ff1a3cd2cb4
    baremetalhost:///openshift-machine-api/master-01/58fb60bd-b2a6-4ff3-a88d-208c33abf954
    baremetalhost:///openshift-machine-api/master-02/dc5a94f3-625b-43f6-ab5a-7cc4fc79f105
  2. Get cluster information for labels by running the following command:

    $ oc get machine -n openshift-machine-api \
        -l machine.openshift.io/cluster-api-machine-role=master \
        -L machine.openshift.io/cluster-api-cluster
    Example output
    NAME                           PHASE   TYPE  REGION  ZONE  AGE  CLUSTER-API-CLUSTER
    ci-op-jcp3s7wx-ng5sd-master-0  Running                     10h  ci-op-jcp3s7wx-ng5sd
    ci-op-jcp3s7wx-ng5sd-master-1  Running                     10h  ci-op-jcp3s7wx-ng5sd
    ci-op-jcp3s7wx-ng5sd-master-2  Running                     10h  ci-op-jcp3s7wx-ng5sd
  3. Create a Machine object for the new control plane node by creating a yaml file similar to the following:

    apiVersion: machine.openshift.io/v1beta1
    kind: Machine
    metadata:
      annotations:
        metal3.io/BareMetalHost: openshift-machine-api/<new_control_plane_machine> (1)
      finalizers:
        - machine.machine.openshift.io
      labels:
        machine.openshift.io/cluster-api-cluster: <cluster_api_cluster> (2)
        machine.openshift.io/cluster-api-machine-role: master
        machine.openshift.io/cluster-api-machine-type: master
      name: <new_control_plane_machine> (1)
      namespace: openshift-machine-api
    spec:
      metadata: {}
      providerID: <provider_id> (3)
      providerSpec:
        value:
          apiVersion: baremetal.cluster.k8s.io/v1alpha1
          hostSelector: {}
          image:
            checksum: ""
            url: ""
          kind: BareMetalMachineProviderSpec
          userData:
            name: master-user-data-managed

    where:

    <new_control_plane_machine>

    Specifies the name of the new machine, which can be the same as the previously deleted machine name.

    <cluster_api_cluster>

    Specifies the CLUSTER-API-CLUSTER value for the other control plane machines, shown in the output of the previous step.

    <provider_id>

    Specifies the providerID value of the new bare metal host, shown in the output of an earlier step.

    The following warning is expected:

    Warning: metadata.finalizers: "machine.machine.openshift.io": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers
  4. Link the new control plane node and Machine object to the BareMetalHost object by performing the following steps in a single bash shell session:

    1. Define the NEW_NODE_NAME variable by running the following command:

      $ NEW_NODE_NAME=<new_node_name>

      Replace <new_node_name> with the name of the new control plane node.

    2. Define the NEW_MACHINE_NAME variable by running the following command:

      $ NEW_MACHINE_NAME=<new_machine_name>

      Replace <new_machine_name> with the name of the new machine.

    3. Define the BMH_UID by running the following commands to extract it from the new node’s BareMetalHost object:

      $ BMH_UID=$(oc get -n openshift-machine-api bmh $NEW_NODE_NAME -ojson | jq -r .metadata.uid)
      $ echo $BMH_UID
    4. Patch the consumerRef object into the bare metal host by running the following command:

      $ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --type merge --patch '{"spec":{"consumerRef":{"apiVersion":"machine.openshift.io/v1beta1","kind":"Machine","name":"'$NEW_MACHINE_NAME'","namespace":"openshift-machine-api"}}}'
    5. Patch the providerID value into the new node by running the following command:

      $ oc patch node $NEW_NODE_NAME --type merge --patch '{"spec":{"providerID":"baremetalhost:///openshift-machine-api/'$NEW_NODE_NAME'/'$BMH_UID'"}}'
    6. Review the providerID values by running the following command:

      $ oc get node -l node-role.kubernetes.io/control-plane -ojson | jq -r '.items[] | .metadata.name + "  " + .spec.providerID'
  5. Set the BareMetalHost object’s poweredOn status to true by running the following command:

    $ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --subresource status --type json -p '[{"op":"replace","path":"/status/poweredOn","value":true}]'
  6. Review the BareMetalHost object’s poweredOn status by running the following command:

    $ oc get bmh -n openshift-machine-api -ojson | jq -r '.items[] | .metadata.name + "   PoweredOn:" +  (.status.poweredOn | tostring)'
  7. Review the BareMetalHost object’s provisioning state by running the following command:

    $ oc get bmh -n openshift-machine-api -ojson | jq -r '.items[] | .metadata.name + "   ProvisioningState:" +  .status.provisioning.state'

    If the provisioning state is not unmanaged, change the provisioning state by running the following command:

    $ oc patch -n openshift-machine-api bmh $NEW_NODE_NAME --subresource status --type json -p '[{"op":"replace","path":"/status/provisioning/state","value":"unmanaged"}]'
  8. Set the machine’s state to Provisioned by running the following command:

    $ oc patch -n openshift-machine-api machines $NEW_MACHINE_NAME -n openshift-machine-api --subresource status --type json -p '[{"op":"replace","path":"/status/phase","value":"Provisioned"}]'

Adding the new etcd member

Finish adding the new control plane node by adding the new etcd member to the cluster.

Procedure
  1. Add the new etcd member to the cluster by performing the following steps in a single bash shell session:

    1. Find the IP of the new control plane node by running the following command:

      $ oc get nodes -owide -l node-role.kubernetes.io/control-plane

      Make note of the node’s IP address for later use.

    2. List the etcd pods by running the following command:

      $ oc get -n openshift-etcd pods -l k8s-app=etcd -o wide
    3. Connect to one of the running etcd pods by running the following command. The etcd pod on the new node should be in a CrashLoopBackOff state.

      $ oc rsh -n openshift-etcd <running_pod>

      Replace <running_pod> with the name of a running pod shown in the previous step.

    4. View the etcd member list by running the following command:

      sh-4.2# etcdctl member list -w table
    5. Add the new control plane etcd member by running the following command:

      sh-4.2# etcdctl member add <new_node> --peer-urls="https://<ip_address>:2380"

      where:

      <new_node>

      Specifies the name of the new control plane node

      <ip_address>

      Specifies the IP address of the new node.

    6. Exit the rsh shell by running the following command:

      sh-4.2# exit
  2. Force an etcd redeployment by running the following command:

    $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
  3. Turn the etcd quorum guard back on by running the following command:

    $ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}'
  4. Monitor the cluster Operator rollout by running the following command:

    $ watch oc get co