Understanding cluster logging alerts - Troubleshooting cluster logging | logging

Viewing logging collector alerts
About logging collector alerts
About Elasticsearch alerting rules

All of the logging collector alerts are listed on the Alerting UI of the OpenShift Container Platform web console.

Viewing logging collector alerts

Alerts are shown in the OpenShift Container Platform web console, on the Alerts tab of the Alerting UI. Alerts are in one of the following states:

Firing. The alert condition is true for the duration of the timeout. Click the Options menu at the end of the firing alert to view more information or silence the alert.
Pending The alert condition is currently true, but the timeout has not been reached.
Not Firing. The alert is not currently triggered.

Procedure

To view cluster logging and other OpenShift Container Platform alerts:

In the OpenShift Container Platform console, click Monitoring → Alerting.
Click the Alerts tab. The alerts are listed, based on the filters selected.

Additional resources

For more information on the Alerting UI, see Managing cluster alerts.

About logging collector alerts

The following alerts are generated by the logging collector. You can view these alerts in the OpenShift Container Platform web console, on the Alerts page of the Alerting UI.

Table 1. Fluentd Prometheus alerts
Alert	Message	Description	Severity
`FluentdErrorsHigh`	`In the last minute, <value> errors reported by fluentd <instance>.`	Fluentd is reporting a higher number of issues than the specified number, default 10.	Critical
`FluentdNodeDown`	`Prometheus could not scrape fluentd <instance> for more than 10m.`	Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance.	Critical
`FluentdQueueLengthBurst`	`In the last minute, fluentd <instance> buffer queue length increased more than 32. Current value is <value>.`	Fluentd is reporting that it is overwhelmed.	Warning
`FluentdQueueLengthIncreasing`	`In the last 12h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.`	Fluentd is reporting queue usage issues.	Critical

About Elasticsearch alerting rules

You can view these alerting rules in Prometheus.

Alert	Description	Severity
ElasticsearchClusterNotHealthy	Cluster health status has been RED for at least 2m. Cluster does not accept writes, shards may be missing or master node hasn’t been elected yet.	critical
ElasticsearchClusterNotHealthy	Cluster health status has been YELLOW for at least 20m. Some shard replicas are not allocated.	warning
ElasticsearchBulkRequestsRejectionJumps	High Bulk Rejection Ratio at node in cluster. This node may not be keeping up with the indexing speed.	warning
ElasticsearchNodeDiskWatermarkReached	Disk Low Watermark Reached at node in cluster. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.	alert
ElasticsearchNodeDiskWatermarkReached	Disk High Watermark Reached at node in cluster. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.	high
ElasticsearchJVMHeapUseHigh	JVM Heap usage on the node in cluster is <value>	alert
AggregatedloggingSystemCPUHigh	System CPU usage on the node in cluster is <value>	alert
ElasticsearchProcessCPUHigh	ES process CPU usage on the node in cluster is <value>	alert

Alert

Description

Severity

ElasticsearchClusterNotHealthy

Cluster health status has been RED for at least 2m. Cluster does not accept writes, shards may be missing or master node hasn’t been elected yet.

critical

ElasticsearchClusterNotHealthy

Cluster health status has been YELLOW for at least 20m. Some shard replicas are not allocated.

warning

ElasticsearchBulkRequestsRejectionJumps

High Bulk Rejection Ratio at node in cluster. This node may not be keeping up with the indexing speed.

warning

ElasticsearchNodeDiskWatermarkReached

Disk Low Watermark Reached at node in cluster. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.

alert

ElasticsearchNodeDiskWatermarkReached

Disk High Watermark Reached at node in cluster. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.

high

ElasticsearchJVMHeapUseHigh

JVM Heap usage on the node in cluster is <value>

alert

AggregatedloggingSystemCPUHigh

System CPU usage on the node in cluster is <value>

alert

ElasticsearchProcessCPUHigh

ES process CPU usage on the node in cluster is <value>

alert