Understanding cluster logging alerts - Troubleshooting cluster logging | logging

Viewing logging collector alerts
About logging collector alerts
About Elasticsearch alerting rules

All of the logging collector alerts are listed on the Alerting UI of the OpenShift Container Platform web console.

Viewing logging collector alerts

Alerts are shown in the OpenShift Container Platform web console, on the Alerts tab of the Alerting UI. Alerts are in one of the following states:

Firing. The alert condition is true for the duration of the timeout. Click the Options menu at the end of the firing alert to view more information or silence the alert.
Pending The alert condition is currently true, but the timeout has not been reached.
Not Firing. The alert is not currently triggered.

Procedure

To view cluster logging and other OpenShift Container Platform alerts:

In the OpenShift Container Platform console, click Monitoring → Alerting.
Click the Alerts tab. The alerts are listed, based on the filters selected.

Additional resources

For more information on the Alerting UI, see Managing alerts.

About logging collector alerts

The following alerts are generated by the logging collector. You can view these alerts in the OpenShift Container Platform web console, on the Alerts page of the Alerting UI.

Table 1. Fluentd Prometheus alerts
Alert	Message	Description	Severity
`FluentDHighErrorRate`	`<value> of records have resulted in an error by fluentd <instance>.`	The number of FluentD output errors is high, by default more than 10 in the previous 15 minutes.	Warning
`FluentdNodeDown`	`Prometheus could not scrape fluentd <instance> for more than 10m.`	Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance.	Critical
`FluentdQueueLengthBurst`	`In the last minute, fluentd <instance> buffer queue length increased more than 32. Current value is <value>.`	Fluentd is reporting that it cannot keep up with the data being indexed.	Warning
`FluentdQueueLengthIncreasing`	`In the last 12h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.`	Fluentd is reporting that the queue size is increasing.	Critical
`FluentDVeryHighErrorRate`	`<value> of records have resulted in an error by fluentd <instance>.`	The number of FluentD output errors is very high, by default more than 25 in the previous 15 minutes.	Critical

About Elasticsearch alerting rules

You can view these alerting rules in Prometheus.

Alert	Description	Severity
ElasticsearchClusterNotHealthy	The cluster health status has been RED for at least 2 minutes. The cluster does not accept writes, shards may be missing, or the master node hasn’t been elected yet.	critical
ElasticsearchClusterNotHealthy	The cluster health status has been YELLOW for at least 20 minutes. Some shard replicas are not allocated.	warning
ElasticsearchDiskSpaceRunningLow	The cluster is expected to be out of disk space within the next 6 hours.	Critical
ElasticsearchHighFileDescriptorUsage	The cluster is predicted to be out of file descriptors within the next hour.	warning
ElasticsearchJVMHeapUseHigh	The JVM Heap usage on the specified node is high.	Alert
ElasticsearchNodeDiskWatermarkReached	The specified node has hit the low watermark due to low free disk space. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.	info
ElasticsearchNodeDiskWatermarkReached	The specified node has hit the high watermark due to low free disk space. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.	warning
ElasticsearchNodeDiskWatermarkReached	The specified node has hit the flood watermark due to low free disk space. Every index that has a shard allocated on this node is enforced a read-only block. The index block must be manually released when the disk use falls below the high watermark.	critical
ElasticsearchJVMHeapUseHigh	The JVM Heap usage on the specified node is too high.	alert
ElasticsearchWriteRequestsRejectionJumps	Elasticsearch is experiencing an increase in write rejections on the specified node. This node might not be keeping up with the indexing speed.	Warning
AggregatedloggingSystemCPUHigh	The CPU used by the system on the specified node is too high.	alert
ElasticsearchProcessCPUHigh	The CPU used by Elasticsearch on the specified node is too high.	alert

Alert

Description

Severity

ElasticsearchClusterNotHealthy

The cluster health status has been RED for at least 2 minutes. The cluster does not accept writes, shards may be missing, or the master node hasn’t been elected yet.

critical

ElasticsearchClusterNotHealthy

The cluster health status has been YELLOW for at least 20 minutes. Some shard replicas are not allocated.

warning

ElasticsearchDiskSpaceRunningLow

The cluster is expected to be out of disk space within the next 6 hours.

Critical

ElasticsearchHighFileDescriptorUsage

The cluster is predicted to be out of file descriptors within the next hour.

warning

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is high.

Alert

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the low watermark due to low free disk space. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.

info

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the high watermark due to low free disk space. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.

warning

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the flood watermark due to low free disk space. Every index that has a shard allocated on this node is enforced a read-only block. The index block must be manually released when the disk use falls below the high watermark.

critical

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is too high.

alert

ElasticsearchWriteRequestsRejectionJumps

Elasticsearch is experiencing an increase in write rejections on the specified node. This node might not be keeping up with the indexing speed.

Warning

AggregatedloggingSystemCPUHigh

The CPU used by the system on the specified node is too high.

alert

ElasticsearchProcessCPUHigh

The CPU used by Elasticsearch on the specified node is too high.

alert