$ oc adm new-project logging --node-selector=""
The Elasticsearch, Fluentd, and Kibana (EFK) stack aggregates logs from nodes and applications running inside your OKD installation. Once deployed it uses Fluentd to aggregate event logs from all nodes, projects, and pods into Elasticsearch (ES). It also provides a centralized Kibana web UI where users and administrators can create rich visualizations and dashboards with the aggregated data.
Fluentd bulk uploads logs to an index, in JSON format, then Elasticsearch routes your search requests to the appropriate shards.
The general procedure for installing an aggregate logging stack in OKD is described in Aggregating Container Logs. There are some important things to keep in mind while going through the installation guide:
In order for the logging pods to spread evenly across your cluster, an empty node selector should be used.
$ oc adm new-project logging --node-selector=""
In conjunction with node labeling, which is done later, this controls pod placement across the logging project. You can now create the logging project.
$ oc project logging
Elasticsearch (ES) should be deployed with a cluster size of at least three for
resiliency to node failures. This is specified by setting the
openshift_logging_es_cluster_size
parameter in the inventory host file.
Refer to Ansible Variables for a full list of parameters.
If you do not have an existing Kibana installation, you can use
kibana.example.com as a value to openshift_logging_kibana_hostname
.
Installation can take some time depending on whether the images were already retrieved from the registry or not, and on the size of your cluster.
Inside the logging namespace, you can check your deployment with oc get all
.
$ oc get all NAME REVISION REPLICAS TRIGGERED BY logging-curator 1 1 logging-es-6cvk237t 1 1 logging-es-e5x4t4ai 1 1 logging-es-xmwvnorv 1 1 logging-kibana 1 1 NAME DESIRED CURRENT AGE logging-curator-1 1 1 3d logging-es-6cvk237t-1 1 1 3d logging-es-e5x4t4ai-1 1 1 3d logging-es-xmwvnorv-1 1 1 3d logging-kibana-1 1 1 3d NAME HOST/PORT PATH SERVICE TERMINATION LABELS logging-kibana kibana.example.com logging-kibana reencrypt component=support,logging-infra=support,provider=openshift logging-kibana-ops kibana-ops.example.com logging-kibana-ops reencrypt component=support,logging-infra=support,provider=openshift NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE logging-es 172.24.155.177 <none> 9200/TCP 3d logging-es-cluster None <none> 9300/TCP 3d logging-es-ops 172.27.197.57 <none> 9200/TCP 3d logging-es-ops-cluster None <none> 9300/TCP 3d logging-kibana 172.27.224.55 <none> 443/TCP 3d logging-kibana-ops 172.25.117.77 <none> 443/TCP 3d NAME READY STATUS RESTARTS AGE logging-curator-1-6s7wy 1/1 Running 0 3d logging-deployer-un6ut 0/1 Completed 0 3d logging-es-6cvk237t-1-cnpw3 1/1 Running 0 3d logging-es-e5x4t4ai-1-v933h 1/1 Running 0 3d logging-es-xmwvnorv-1-adr5x 1/1 Running 0 3d logging-fluentd-156xn 1/1 Running 0 3d logging-fluentd-40biz 1/1 Running 0 3d logging-fluentd-8k847 1/1 Running 0 3d
You should end up with a similar setup to the following.
$ oc get pods -o wide NAME READY STATUS RESTARTS AGE NODE logging-curator-1-6s7wy 1/1 Running 0 3d ip-172-31-24-239.us-west-2.compute.internal logging-deployer-un6ut 0/1 Completed 0 3d ip-172-31-6-152.us-west-2.compute.internal logging-es-6cvk237t-1-cnpw3 1/1 Running 0 3d ip-172-31-24-238.us-west-2.compute.internal logging-es-e5x4t4ai-1-v933h 1/1 Running 0 3d ip-172-31-24-235.us-west-2.compute.internal logging-es-xmwvnorv-1-adr5x 1/1 Running 0 3d ip-172-31-24-233.us-west-2.compute.internal logging-fluentd-156xn 1/1 Running 0 3d ip-172-31-24-241.us-west-2.compute.internal logging-fluentd-40biz 1/1 Running 0 3d ip-172-31-24-236.us-west-2.compute.internal logging-fluentd-8k847 1/1 Running 0 3d ip-172-31-24-237.us-west-2.compute.internal logging-fluentd-9a3qx 1/1 Running 0 3d ip-172-31-24-231.us-west-2.compute.internal logging-fluentd-abvgj 1/1 Running 0 3d ip-172-31-24-228.us-west-2.compute.internal logging-fluentd-bh74n 1/1 Running 0 3d ip-172-31-24-238.us-west-2.compute.internal ... ...
By default the amount of RAM allocated to each ES instance is 8GB.
openshift_logging_es_memory_limit
is the parameter used in the openshift-ansible
host inventory file.
Keep in mind that half of this value will be passed to the individual
elasticsearch pods java processes
heap
size.
At 100 nodes or more, it is recommended to first pre-pull the logging images
from docker pull registry.access.redhat.com/openshift3/logging-fluentd:v3.7
.
After deploying the logging infrastructure pods (Elasticsearch, Kibana, and
Curator), node labeling should be done in steps of 20 nodes at a time. For
example:
Using a simple loop:
$ while read node; do oc label nodes $node logging-infra-fluentd=true; done < 20_fluentd.lst
The following also works:
$ oc label nodes 10.10.0.{100..119} logging-infra-fluentd=true
Labeling nodes in groups paces the DaemonSets used by OpenShift logging, helping to avoid contention on shared resources such as the image registry.
Check for the occurence of any "CrashLoopBackOff | ImagePullFailed | Error" issues.
|
Rate-limiting
In Red Hat Enterprise Linux (RHEL) 7 the systemd-journald.socket unit creates /dev/log during the boot process, and then passes input to systemd-journald.service. Every syslog() call goes to the journal.
Rsyslog uses the imjournal module as a default input mode for journal files. Refer to Interaction of rsyslog and journal for detailed information about this topic.
A simple test harness was developed, which uses
logger across the cluster nodes to make
entries of different sizes at different rates in the system log. During testing
simulations under a default Red Hat Enterprise Linux (RHEL) 7 installation with
systemd-219-19.el7.x86_64
at certain logging rates (approximately 40 log lines
per second), we encountered the default rate limit of rsyslogd
. After
adjusting these limits, entries stopped being written to journald due to local
journal file corruption.
This issue is resolved in
later versions of systemd.
Scaling up
As you scale up your project, the default logging environment might need some adjustments. After updating to systemd-219-22.el7.x86_64, we added:
$IMUXSockRateLimitInterval 0 $IMJournalRatelimitInterval 0
to /etc/rsyslog.conf and:
# Disable rate limiting RateLimitInterval=1s RateLimitBurst=10000 Storage=volatile Compress=no MaxRetentionSec=30s
to /etc/systemd/journald.conf.
Now, restart the services.
$ systemctl restart systemd-journald.service $ systemctl restart rsyslog.service
These settings account for the bursty nature of uploading in bulk.
After removing the rate limit, you may see increased CPU utilization on the system logging daemons as it processes any messages that would have previously been throttled.
Rsyslog is configured (see ratelimit.interval, ratelimit.burst) to rate-limit entries read from the journal at 10,000 messages in 300 seconds. A good rule of thumb is to ensure that the rsyslog rate-limits account for the systemd-journald rate-limits.
If you do not indicate the desired scale at first deployment, the least
disruptive way of adjusting your cluster is by re-running the Ansible logging playbook
after updating the inventory file with an updated openshift_logging_es_cluster_size
value.
parameter. Refer to the
Performing
Administrative Elasticsearch Operations section for more in-depth information.
A highly-available Elasticsearch environment requires at least three Elasticsearch nodes,
each on a different host, and setting the |
An Elasticsearch index is a collection of shards and their corresponding replicas. This is how ES implements high availability internally, therefore there is little need to use hardware based mirroring RAID variants. RAID 0 can still be used to increase overall disk performance.
Every search request needs to hit a copy of every shard in the index. Each ES instance requires its own individual storage, but an OKD deployment can only provide volumes shared by all of its pods, which again means that Elasticsearch shouldn’t be implemented with a single node.
A persistent volume should be added to each Elasticsearch deployment configuration so that we have one volume per replica shard. On OKD this is often achieved through Persistent Volume Claims
1 volume per shard
1 volume per replica shard
The PVCs must be named based on the openshift_logging_es_pvc_prefix setting. Refer to Persistent Elasticsearch Storage for more details.
Below are capacity planning guidelines for OKD aggregate logging. Example scenario
Assumptions:
Which application: Apache
Bytes per line: 256
Lines per second load on application: 1
Raw text data → JSON
Baseline (256 characters per second → 15KB/min)
Logging Infra Pods | Storage Throughput |
---|---|
3 es 1 kibana 1 curator 1 fluentd |
6 pods total: 90000 x 1440 = 128,6 MB/day |
3 es 1 kibana 1 curator 11 fluentd |
16 pods total: 240000 x 1440 = 345,6 MB/day |
3 es 1 kibana 1 curator 20 fluentd |
25 pods total: 375000 x 1440 = 540 MB/day |
Calculating total logging throughput and disk space required for your logging environment requires knowledge of your application. For example, if one of your applications on average logs 10 lines-per-second, each 256 bytes-per-line, calculate per-application throughput and disk space as follows:
(bytes-per-line * (lines-per-second) = 2560 bytes per app per second (2560) * (number-of-pods-per-node,100) = 256,000 bytes per second per node 256k * (number-of-nodes) = total logging throughput per cluster
Fluentd ships any logs from systemd journal and /var/lib/docker/containers/ to Elasticsearch. Learn more.
Local SSD drives are recommended in order to achieve the best performance. In Red Hat Enterprise Linux (RHEL) 7, the deadline IO scheduler is the default for all block devices except SATA disks. For SATA disks, the default IO scheduler is cfq.
Sizing storage for ES is greatly dependent on how you optimize your indices. Therefore, consider how much data you need in advance and that you are aggregating application log data. Some Elasticsearch users have found that it is necessary to keep absolute storage consumption around 50% and below 70% at all times. This helps to avoid Elasticsearch becoming unresponsive during large merge operations.