$ oc new-project validate $ oc new-app cakephp-mysql-example
This topic contains steps to verify the overall health of the OKD cluster and the various components, as well as describing the intended behavior.
Knowing the verification process for the various components is the first step to troubleshooting issues. If experiencing issues, you can use the checks provided in this section to diagnose any problems.
To verify the end-to-end functionality of an OKD cluster, build and deploy an example application.
Create a new project named validate
, as well as an example application from the cakephp-mysql-example
template:
$ oc new-project validate $ oc new-app cakephp-mysql-example
You can check the logs to follow the build:
$ oc logs -f bc/cakephp-mysql-example
Once the build is complete, two pods should be running: a database and an application:
$ oc get pods NAME READY STATUS RESTARTS AGE cakephp-mysql-example-1-build 0/1 Completed 0 1m cakephp-mysql-example-2-247xm 1/1 Running 0 39s mysql-1-hbk46 1/1 Running 0 1m
Visit the application URL. The Cake PHP framework welcome page should be
visible. The URL should have the following format
cakephp-mysql-example-validate.<app_domain>
.
Once the functionality has been verified, the validate
project can be
deleted:
$ oc delete project validate
All resources within the project will be deleted as well.
You can integrate OKD with Prometheus to create visuals and alerts to help diagnose any environment issues before they arise. These issues can include if a node goes down, if a pod is consuming too much CPU or memory, and more.
See the Prometheus on OpenShift Container Platform section in the Installation and configuration guide for more information.
Prometheus on OKD is a Technology Preview feature only. |
To verify that the cluster is up and running, connect to a master instance, and run the following:
$ oc get nodes NAME STATUS AGE VERSION ocp-infra-node-1clj Ready 1h v1.6.1+5115d708d7 ocp-infra-node-86qr Ready 1h v1.6.1+5115d708d7 ocp-infra-node-g8qw Ready 1h v1.6.1+5115d708d7 ocp-master-94zd Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-master-gjkm Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-master-wc8w Ready,SchedulingDisabled 1h v1.6.1+5115d708d7 ocp-node-c5dg Ready 1h v1.6.1+5115d708d7 ocp-node-ghxn Ready 1h v1.6.1+5115d708d7 ocp-node-w135 Ready 1h v1.6.1+5115d708d7
The above cluster example consists of three master hosts, three infrastructure node hosts, and three node hosts. All of them are running. All hosts in the cluster should be visible in this output.
The Ready
status means that master hosts can communicate with node hosts and
that the nodes are ready to run pods (excluding the nodes in which scheduling is
disabled).
Before you run etcd commands, source the etcd.conf file:
# source /etc/etcd/etcd.conf
You can check the basic etcd health status from any master instance with the
etcdctl
command:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \ --ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS cluster-health member 59df5107484b84df is healthy: got healthy result from https://10.156.0.5:2379 member 6df7221a03f65299 is healthy: got healthy result from https://10.156.0.6:2379 member fea6dfedf3eecfa3 is healthy: got healthy result from https://10.156.0.9:2379 cluster is healthy
However, to get more information about etcd hosts, including the associated master host:
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \ --ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS member list 295750b7103123e0: name=ocp-master-zh8d peerURLs=https://10.156.0.7:2380 clientURLs=https://10.156.0.7:2379 isLeader=true b097a72f2610aea5: name=ocp-master-qcg3 peerURLs=https://10.156.0.11:2380 clientURLs=https://10.156.0.11:2379 isLeader=false fea6dfedf3eecfa3: name=ocp-master-j338 peerURLs=https://10.156.0.9:2380 clientURLs=https://10.156.0.9:2379 isLeader=false
All etcd hosts should contain the master host name if the etcd cluster is co-located with master services, or all etcd instances should be visible if etcd is running separately.
|
To check if a router service is running:
$ oc -n default get deploymentconfigs/router NAME REVISION DESIRED CURRENT TRIGGERED BY router 1 3 3 config
The values in the DESIRED
and CURRENT
columns should match the number of
nodes hosts.
Use the same command to check the registry status:
$ oc -n default get deploymentconfigs/docker-registry NAME REVISION DESIRED CURRENT TRIGGERED BY docker-registry 1 3 3 config
Multiple running instances of the container registry require backend storage supporting writes by multiple processes. If the chosen infrastructure provider does not contain this ability, running a single instance of a container registry is acceptable. |
To verify that all pods are running and on which hosts:
$ oc -n default get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE docker-registry-1-54nhl 1/1 Running 0 2d 172.16.2.3 ocp-infra-node-tl47 docker-registry-1-jsm2t 1/1 Running 0 2d 172.16.8.2 ocp-infra-node-62rc docker-registry-1-qbt4g 1/1 Running 0 2d 172.16.14.3 ocp-infra-node-xrtz registry-console-2-gbhcz 1/1 Running 0 2d 172.16.8.4 ocp-infra-node-62rc router-1-6zhf8 1/1 Running 0 2d 10.156.0.4 ocp-infra-node-62rc router-1-ffq4g 1/1 Running 0 2d 10.156.0.10 ocp-infra-node-tl47 router-1-zqxbl 1/1 Running 0 2d 10.156.0.8 ocp-infra-node-xrtz
If OKD is using an external container registry, the internal registry service does not need to be running. |
Network connectivity has two main networking layers: the cluster network for node interaction, and the software defined network (SDN) for pod interaction. OKD supports multiple network configurations, often optimized for a specific infrastructure provider.
Due to the complexity of networking, not all verification scenarios are covered in this section. |
etcd and master hosts
Master services keep their state synchronized using the etcd key-value store.
Communication between master and etcd services is important, whether those
etcd services are collocated on master hosts, or running on hosts designated
only for the etcd service. This communication happens on TCP ports 2379
and
2380
. See the
Host
health section for methods to check this communication.
SkyDNS
SkyDNS
provides name resolution of local services running in OKD.
This service uses TCP
and UDP
port 8053
.
To verify the name resolution:
$ dig +short docker-registry.default.svc.cluster.local 172.30.150.7
If the answer matches the output of the following, SkyDNS
service is working correctly:
$ oc get svc/docker-registry -n default NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE docker-registry 172.30.150.7 <none> 5000/TCP 3d
API service and web console
Both the API service and web console share the same port, usually TCP
8443
or 443
, depending on the setup. This port needs to be available within the
cluster and to everyone who needs to work with the deployed environment. The
URLs under which this port is reachable may differ for internal cluster and for
external clients.
In the following example, the https://internal-master.example.com:443
URL is
used by the internal cluster, and the https://master.example.com:443
URL is
used by external clients. On any node host:
$ curl https://internal-master.example.com:443/version { "major": "1", "minor": "6", "gitVersion": "v1.6.1+5115d708d7", "gitCommit": "fff65cf", "gitTreeState": "clean", "buildDate": "2017-10-11T22:44:25Z", "goVersion": "go1.7.6", "compiler": "gc", "platform": "linux/amd64" }
This must be reachable from client’s network:
$ curl -k https://master.example.com:443/healthz ok
The SDN connecting pod communication on nodes uses UDP
port 4789
by default.
To verify node host functionality, create a new application. The following example ensures the node reaches the docker registry, which is running on an infrastructure node:
Create a new project:
$ oc new-project sdn-test
Deploy an httpd application:
$ oc new-app centos/httpd-24-centos7~https://github.com/sclorg/httpd-ex
Wait until the build is complete:
$ oc get pods NAME READY STATUS RESTARTS AGE httpd-ex-1-205hz 1/1 Running 0 34s httpd-ex-1-build 0/1 Completed 0 1m
Connect to the running pod:
$ oc rsh po/<pod-name>
For example:
$ oc rsh po/httpd-ex-1-205hz
Check the healthz
path of the internal registry service:
$ curl -kv https://docker-registry.default.svc.cluster.local:5000/healthz * About to connect() to docker-registry.default.svc.cluster.locl port 5000 (#0) * Trying 172.30.150.7... * Connected to docker-registry.default.svc.cluster.local (172.30.150.7) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * skipping SSL peer certificate verification * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=172.30.150.7 * start date: Nov 30 17:21:51 2017 GMT * expire date: Nov 30 17:21:52 2019 GMT * common name: 172.30.150.7 * issuer: CN=openshift-signer@1512059618 > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: docker-registry.default.svc.cluster.local:5000 > Accept: */* > < HTTP/1.1 200 OK < Cache-Control: no-cache < Date: Mon, 04 Dec 2017 16:26:49 GMT < Content-Length: 0 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host docker-registry.default.svc.cluster.local left intact sh-4.2$ *exit*
The HTTP/1.1 200 OK
response means the node is correctly connecting.
Clean up the test project:
$ oc delete project sdn-test project "sdn-test" deleted
The node host is listening on TCP
port 10250
. This port needs to be
reachable by all master hosts on any node, and if monitoring is deployed in the
cluster, the infrastructure nodes must have access to this port on all instances
as well. Broken communication on this port can be detected with the following
command:
$ oc get nodes NAME STATUS AGE VERSION ocp-infra-node-1clj Ready 4d v1.6.1+5115d708d7 ocp-infra-node-86qr Ready 4d v1.6.1+5115d708d7 ocp-infra-node-g8qw Ready 4d v1.6.1+5115d708d7 ocp-master-94zd Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-master-gjkm Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-master-wc8w Ready,SchedulingDisabled 4d v1.6.1+5115d708d7 ocp-node-c5dg Ready 4d v1.6.1+5115d708d7 ocp-node-ghxn Ready 4d v1.6.1+5115d708d7 ocp-node-w135 NotReady 4d v1.6.1+5115d708d7
In the output above, the node service on the ocp-node-w135
node is
not reachable by the master services, which is represented by its NotReady
status.
The last service is the router, which is responsible for routing connections
to the correct services running in the OKD cluster. Routers listen
on TCP
ports 80
and 443
on infrastructure nodes for ingress traffic.
Before routers can start working, DNS must be configured:
$ dig *.apps.example.com ; <<>> DiG 9.11.1-P3-RedHat-9.11.1-8.P3.fc27 <<>> *.apps.example.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45790 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;*.apps.example.com. IN A ;; ANSWER SECTION: *.apps.example.com. 3571 IN CNAME apps.example.com. apps.example.com. 3561 IN A 35.xx.xx.92 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Tue Dec 05 16:03:52 CET 2017 ;; MSG SIZE rcvd: 105
The IP address, in this case 35.xx.xx.92
, should be pointing to the load
balancer distributing ingress traffic to all infrastructure nodes. To verify the
functionality of the routers, check the registry service once more, but this
time from outside the cluster:
$ curl -kv https://docker-registry-default.apps.example.com/healthz * Trying 35.xx.xx.92... * TCP_NODELAY set * Connected to docker-registry-default.apps.example.com (35.xx.xx.92) port 443 (#0) ... < HTTP/2 200 < cache-control: no-cache < content-type: text/plain; charset=utf-8 < content-length: 0 < date: Tue, 05 Dec 2017 15:13:27 GMT < * Connection #0 to host docker-registry-default.apps.example.com left intact
Master instances need at least 40 GB of hard disk space for the /var
directory. Check the disk usage of a master host using the df
command:
$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 xfs 45G 2.8G 43G 7% / devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm tmpfs tmpfs 3.6G 63M 3.6G 2% /run tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup tmpfs tmpfs 732M 0 732M 0% /run/user/1000 tmpfs tmpfs 732M 0 732M 0% /run/user/0
Node instances need at least 15 GB space for the /var
directory, and at least
another 15 GB for Docker storage (/var/lib/docker
in this case). Depending on
the size of the cluster and the amount of ephemeral storage desired for pods, a
separate partition should be created for
/var/lib/origin/openshift.local.volumes
on the nodes.
$ df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 xfs 25G 2.4G 23G 10% / devtmpfs devtmpfs 3.6G 0 3.6G 0% /dev tmpfs tmpfs 3.6G 0 3.6G 0% /dev/shm tmpfs tmpfs 3.6G 147M 3.5G 4% /run tmpfs tmpfs 3.6G 0 3.6G 0% /sys/fs/cgroup /dev/sdb xfs 25G 2.7G 23G 11% /var/lib/docker /dev/sdc xfs 50G 33M 50G 1% /var/lib/origin/openshift.local.volumes tmpfs tmpfs 732M 0 732M 0% /run/user/1000
Persistent storage for pods should be handled outside of the instances running the OKD cluster. Persistent volumes for pods can be provisioned by the infrastructure provider, or with the use of container native storage or container ready storage.
Docker Storage can be backed by one of two options. The first is a thin pool logical volume with device mapper, the second, since Red Hat Enterprise Linux version 7.4, is an overlay2 file system. The overlay2 file system is generally recommended due to the ease of setup and increased performance.
The Docker storage disk is mounted as /var/lib/docker
and formatted with xfs
file system. Docker storage is configured to use overlay2 filesystem:
$ cat /etc/sysconfig/docker-storage DOCKER_STORAGE_OPTIONS='--storage-driver overlay2'
To verify this storage driver is used by Docker:
# docker info Containers: 4 Running: 4 Paused: 0 Stopped: 0 Images: 4 Server Version: 1.12.6 Storage Driver: overlay2 Backing Filesystem: xfs Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: overlay host bridge null Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp selinux Kernel Version: 3.10.0-693.11.1.el7.x86_64 Operating System: Employee SKU OSType: linux Architecture: x86_64 Number of Docker Hooks: 3 CPUs: 2 Total Memory: 7.147 GiB Name: ocp-infra-node-1clj ID: T7T6:IQTG:WTUX:7BRU:5FI4:XUL5:PAAM:4SLW:NWKL:WU2V:NQOW:JPHC Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://registry.access.redhat.com/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Insecure Registries: 127.0.0.0/8 Registries: registry.access.redhat.com (secure), registry.access.redhat.com (secure), docker.io (secure)
The OpenShift API service, atomic-openshift-master-api.service
, runs on all
master instances. To see the status of the service:
$ systemctl status atomic-openshift-master-api.service ● atomic-openshift-master-api.service - Atomic OpenShift Master API Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2017-11-30 11:40:19 EST; 5 days ago Docs: https://github.com/openshift/origin Main PID: 30047 (openshift) Memory: 65.0M CGroup: /system.slice/atomic-openshift-master-api.service └─30047 /usr/bin/openshift start master api --config=/etc/origin/master/ma... Dec 06 09:15:49 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:49.85... Dec 06 09:15:50 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:50.96... Dec 06 09:15:52 ocp-master-94zd atomic-openshift-master-api[30047]: I1206 09:15:52.34...
The API service exposes a health check, which can be queried externally with:
$ curl -k https://master.example.com/healthz ok
The OKD controller service,
atomic-openshift-master-controllers.service
, is available across all master
hosts. The service runs in active/passive mode, meaning it should only be
running on one master at any time.
The OKD controllers execute a procedure to choose which host runs
the service. The current running value is stored in an annotation in a special
configmap
stored in the kube-system
project.
Verify the master host running the atomic-openshift-master-controllers
service as a cluster-admin
user:
$ oc get -n kube-system cm openshift-master-controllers -o yaml apiVersion: v1 kind: ConfigMap metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"master-ose-master-0.example.com-10.19.115.212-dnwrtcl4","leaseDurationSeconds":15,"acquireTime":"2018-02-17T18:16:54Z","renewTime":"2018-02-19T13:50:33Z","leaderTransitions":16}' creationTimestamp: 2018-02-02T10:30:04Z name: openshift-master-controllers namespace: kube-system resourceVersion: "17349662" selfLink: /api/v1/namespaces/kube-system/configmaps/openshift-master-controllers uid: 08636843-0804-11e8-8580-fa163eb934f0
The command outputs the current master controller in the
control-plane.alpha.kubernetes.io/leader
annotation, within the
holderIdentity
property as:
master-<hostname>-<ip>-<8_random_characters>
Find the hostname of the master host by filtering the output using the following:
$ oc get -n kube-system cm openshift-master-controllers -o json | jq -r '.metadata.annotations[] | fromjson.holderIdentity | match("^master-(.*)-[0-9.]*-[0-9a-z]{8}$") | .captures[0].string' ose-master-0.example.com
Verifying the maximum transmission unit (MTU) prevents a possible networking misconfiguration that can masquerade as an SSL certificate issue.
When a packet is larger than the MTU size that is transmitted over HTTP, the physical network router is able to break the packet into multiple packets to transmit the data. However, when a packet is larger than the MTU size is that transmitted over HTTPS, the router is forced to drop the packet.
Installation produces certificates that provide secure connections to multiple components that include:
master hosts
node hosts
infrastructure nodes
registry
router
These certificates can be found within the /etc/origin/master
directory for
the master nodes and /etc/origin/node
directory for the infra and app nodes.
After installation, you can verify connectivity to the
REGISTRY_OPENSHIFT_SERVER_ADDR
using the process outlined in the
Network
connectivity section.
From a master host, get the HTTPS address:
$ oc get dc docker-registry -o jsonpath='{.spec.template.spec.containers[].env[?(@.name=="OPENSHIFT_DEFAULT_REGISTRY")].value}{"\n"}' docker-registry.default.svc:5000
The above gives the output of docker-registry.default.svc:5000
.
Append /healthz
to the value given above, use it to check on all hosts
(master, infrastructure, node):
$ curl -v https://docker-registry.default.svc:5000/healthz * About to connect() to docker-registry.default.svc port 5000 (#0) * Trying 172.30.11.171... * Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=172.30.11.171 * start date: Oct 18 05:30:10 2017 GMT * expire date: Oct 18 05:30:11 2019 GMT * common name: 172.30.11.171 * issuer: CN=openshift-signer@1508303629 > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: docker-registry.default.svc:5000 > Accept: */* > < HTTP/1.1 200 OK < Cache-Control: no-cache < Date: Tue, 24 Oct 2017 19:42:35 GMT < Content-Length: 0 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host docker-registry.default.svc left intact
The above example output shows the MTU size being used to ensure the SSL connection is correct. The attempt to connect is successful, followed by connectivity being established and completes with initializing the NSS with the certpath and all the server certificate information regarding the docker-registry.
An improper MTU size results in a timeout:
$ curl -v https://docker-registry.default.svc:5000/healthz * About to connect() to docker-registry.default.svc port 5000 (#0) * Trying 172.30.11.171... * Connected to docker-registry.default.svc (172.30.11.171) port 5000 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb
The above example shows that the connection is established, but it cannot finish
initializing NSS with certpath. The issue deals with improper MTU size set
within the /etc/origin/node/node-config.yaml
file.
To fix this issue, adjust the MTU size within the
/etc/origin/node/node-config.yaml
to 50 bytes smaller than the MTU size being
used by the OpenShift SDN Ethernet device.
View the MTU size of the desired Ethernet device (i.e. eth0
):
$ ip link show eth0 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether fa:16:3e:92:6a:86 brd ff:ff:ff:ff:ff:ff
The above shows MTU set to 1500.
To change the MTU size, modify the /etc/origin/node/node-config.yaml
file
and set a value that is 50 bytes smaller than output provided by the ip
command.
For example, if the MTU size is set to 1500, adjust the MTU size to 1450 within
the /etc/origin/node/node-config.yaml
file:
networkConfig:
mtu: 1450
Save the changes and reboot the node:
You must change the MTU size on all masters and nodes that are part of the OKD SDN. Also, the MTU size of the tun0 interface must be the same across all nodes that are part of the cluster. |
Once the node is back online, confirm the issue no longer exists by re-running
the original curl
command.
$ curl -v https://docker-registry.default.svc:5000/healthz
If the timeout persists, continue to adjust the MTU size in increments of 50 bytes and repeat the process.