$ curl http://<example_app_endpoint>/metrics
Get familiar with the OKD monitoring concepts and terms. Learn about how you can improve performance and scale of your cluster, store and record data, manage metrics and alerts, and more.
You can optimize the performance and scale of your clusters. You can configure the default monitoring stack by performing any of the following actions:
Control the placement and distribution of monitoring components:
Use node selectors to move components to specific nodes.
Assign tolerations to enable moving components to tainted nodes.
Use pod topology spread constraints.
Set the body size limit for metrics scraping.
Manage CPU and memory resources.
Use metrics collection profiles.
By using the nodeSelector
constraint with labeled nodes, you can move any of the monitoring stack components to specific nodes.
By doing so, you can control the placement and distribution of the monitoring components across a cluster.
By controlling placement and distribution of monitoring components, you can optimize system resource use, improve performance, and separate workloads based on specific requirements or policies.
If you move monitoring components by using node selector constraints, be aware that other constraints to control pod scheduling might exist for a cluster:
Topology spread constraints might be in place to control pod placement.
Hard anti-affinity rules are in place for Prometheus, Alertmanager, and other monitoring components to ensure that multiple pods for these components are always spread across different nodes and are therefore always highly available.
When scheduling pods onto nodes, the pod scheduler tries to satisfy all existing constraints when determining pod placement. That is, all constraints compound when the pod scheduler determines which pods will be placed on which nodes.
Therefore, if you configure a node selector constraint but existing constraints cannot all be satisfied, the pod scheduler cannot match all constraints and will not schedule a pod for placement onto a node.
To maintain resilience and high availability for monitoring components, ensure that enough nodes are available and match all constraints when you configure a node selector constraint to move a component.
You can use pod topology spread constraints to control how the monitoring pods are spread across a network topology when OKD pods are deployed in multiple availability zones.
Pod topology spread constraints are suitable for controlling pod scheduling within hierarchical topologies in which nodes are spread across different infrastructure levels, such as regions and zones within those regions. Additionally, by being able to schedule pods in different zones, you can improve network latency in certain scenarios.
You can configure pod topology spread constraints for all the pods deployed by the Cluster Monitoring Operator to control how pod replicas are scheduled to nodes across zones. This ensures that the pods are highly available and run more efficiently, because workloads are spread across nodes in different data centers or hierarchical infrastructure zones.
You can configure resource limits and requests for the following core platform monitoring components:
Alertmanager
kube-state-metrics
monitoring-plugin
node-exporter
openshift-state-metrics
Prometheus
Metrics Server
Prometheus Operator and its admission webhook service
Telemeter Client
Thanos Querier
You can configure resource limits and requests for the following components that monitor user-defined projects:
Alertmanager
Prometheus
Thanos Ruler
By defining the resource limits, you limit a container’s resource usage, which prevents the container from exceeding the specified maximum values for CPU and memory resources.
By defining the resource requests, you specify that a container can be scheduled only on a node that has enough CPU and memory resources available to match the requested resources.
Metrics collection profile is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
By default, Prometheus collects metrics exposed by all default metrics targets in OKD components. However, you might want Prometheus to collect fewer metrics from a cluster in certain scenarios:
If cluster administrators require only alert, telemetry, and console metrics and do not require other metrics to be available.
If a cluster increases in size, and the increased size of the default metrics data collected now requires a significant increase in CPU and memory resources.
You can use a metrics collection profile to collect either the default amount of metrics data or a minimal amount of metrics data. When you collect minimal metrics data, basic monitoring features such as alerting continue to work. At the same time, the CPU and memory resources required by Prometheus decrease.
You can enable one of two metrics collection profiles:
full: Prometheus collects metrics data exposed by all platform components. This setting is the default.
minimal: Prometheus collects only the metrics data required for platform alerts, recording rules, telemetry, and console dashboards.
You can store and record data to help you protect the data and use them for troubleshooting. You can configure the default monitoring stack by performing any of the following actions:
Configure persistent storage:
Protect your metrics and alerting data from data loss by storing them in a persistent volume (PV). As a result, they can survive pods being restarted or recreated.
Avoid getting duplicate notifications and losing silences for alerts when the Alertmanager pods are restarted.
Modify the retention time and size for Prometheus and Thanos Ruler metrics data.
Configure logging to help you troubleshoot issues with your cluster:
Configure audit logs for Metrics Server.
Set log levels for monitoring.
Enable the query logging for Prometheus and Thanos Querier.
By default, Prometheus retains metrics data for the following durations:
Core platform monitoring: 15 days
Monitoring for user-defined projects: 24 hours
You can modify the retention time for the Prometheus instance to change how soon the data is deleted. You can also set the maximum amount of disk space the retained metrics data uses. If the data reaches this size limit, Prometheus deletes the oldest data first until the disk space used is again below the limit.
Note the following behaviors of these data retention settings:
The size-based retention policy applies to all data block directories in the /prometheus
directory, including persistent blocks, write-ahead log (WAL) data, and m-mapped chunks.
Data in the /wal
and /head_chunks
directories counts toward the retention size limit, but Prometheus never purges data from these directories based on size- or time-based retention policies.
Thus, if you set a retention size limit lower than the maximum size set for the /wal
and /head_chunks
directories, you have configured the system not to retain any data blocks in the /prometheus
data directories.
The size-based retention policy is applied only when Prometheus cuts a new data block, which occurs every two hours after the WAL contains at least three hours of data.
If you do not explicitly define values for either retention
or retentionSize
, retention time defaults to 15 days for core platform monitoring and 24 hours for user-defined project monitoring. Retention size is not set.
If you define values for both retention
and retentionSize
, both values apply.
If any data blocks exceed the defined retention time or the defined size limit, Prometheus purges these data blocks.
If you define a value for retentionSize
and do not define retention
, only the retentionSize
value applies.
If you do not define a value for retentionSize
and only define a value for retention
, only the retention
value applies.
If you set the retentionSize
or retention
value to 0
, the default settings apply. The default settings set retention time to 15 days for core platform monitoring and 24 hours for user-defined project monitoring. By default, retention size is not set.
Data compaction occurs every two hours. Therefore, a persistent volume (PV) might fill up before compaction, potentially exceeding the |
In OKD 4, cluster components are monitored by scraping metrics exposed through service endpoints. You can also configure metrics collection for user-defined projects. Metrics enable you to monitor how cluster components and your own workloads are performing.
You can define the metrics that you want to provide for your own workloads by using Prometheus client libraries at the application level.
In OKD, metrics are exposed through an HTTP service endpoint under the /metrics
canonical name. You can list all available metrics for a service by running a curl
query against http://<endpoint>/metrics
. For instance, you can expose a route to the prometheus-example-app
example application and then run the following to view all of its available metrics:
$ curl http://<example_app_endpoint>/metrics
# HELP http_requests_total Count of all HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 4
http_requests_total{code="404",method="get"} 2
# HELP version Version information about this binary
# TYPE version gauge
version{version="v0.1.0"} 1
Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id
attribute is unbound because it has an infinite number of possible values.
Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.
Cluster administrators can use the following measures to control the impact of unbound metrics attributes in user-defined projects:
Limit the number of samples that can be accepted per target scrape in user-defined projects
Limit the number of scraped labels, the length of label names, and the length of label values
Configure the intervals between consecutive scrapes and between Prometheus rule evaluations
Create alerts that fire when a scrape sample threshold is reached or when the target cannot be scraped
Limiting scrape samples can help prevent the issues caused by adding many unbound attributes to labels. Developers can also prevent the underlying cause by limiting the number of unbound attributes that they define for metrics. Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations. |
If you manage multiple OKD clusters and use the remote write feature to send metrics data from these clusters to an external storage location, you can add cluster ID labels to identify the metrics data coming from different clusters. You can then query these labels to identify the source cluster for a metric and distinguish that data from similar metrics data sent by other clusters.
This way, if you manage many clusters for multiple customers and send metrics data to a single centralized storage system, you can use cluster ID labels to query metrics for a particular cluster or customer.
Creating and using cluster ID labels involves three general steps:
Configuring the write relabel settings for remote write storage.
Adding cluster ID labels to the metrics.
Querying these labels to identify the source cluster or customer for a metric.
OKD provides a set of monitoring dashboards that help you understand the state of cluster components and user-defined workloads.
Use the Administrator perspective to access dashboards for the core OKD components, including the following items:
API performance
etcd
Kubernetes compute resources
Kubernetes network resources
Prometheus
USE method dashboards relating to cluster and node performance
Node performance metrics
In the OKD, the Alerting UI enables you to manage alerts, silences, and alerting rules.
Alerting rules. Alerting rules contain a set of conditions that outline a particular state within a cluster. Alerts are triggered when those conditions are true. An alerting rule can be assigned a severity that defines how the alerts are routed.
Alerts. An alert is fired when the conditions defined in an alerting rule are true. Alerts provide a notification that a set of circumstances are apparent within an OKD cluster.
Silences. A silence can be applied to an alert to prevent notifications from being sent when the conditions for an alert are true. You can mute an alert after the initial notification, while you work on resolving the issue.
The alerts, silences, and alerting rules that are available in the Alerting UI relate to the projects that you have access to. For example, if you are logged in as a user with the |
You can create a silence for an alert in the OKD web console in both the Administrator and Developer perspectives. After you create a silence, you will not receive notifications about an alert when the alert fires.
Creating silences is useful in scenarios where you have received an initial alert notification, and you do not want to receive further notifications during the time in which you resolve the underlying issue causing the alert to fire.
When creating a silence, you must specify whether it becomes active immediately or at a later time. You must also set a duration period after which the silence expires.
After you create silences, you can view, edit, and expire them.
When you create silences, they are replicated across Alertmanager pods. However, if you do not configure persistent storage for Alertmanager, silences might be lost. This can happen, for example, if all Alertmanager pods restart at the same time. |
The OKD monitoring includes a large set of default alerting rules for platform metrics. As a cluster administrator, you can customize this set of rules in two ways:
Modify the settings for existing platform alerting rules by adjusting thresholds or by adding and modifying labels.
For example, you can change the severity
label for an alert from warning
to critical
to help you route and triage issues flagged by an alert.
Define and add new custom alerting rules by constructing a query expression based on core platform metrics in the openshift-monitoring
namespace.
New alerting rules must be based on the default OKD monitoring metrics.
You must create the AlertingRule
and AlertRelabelConfig
objects in the openshift-monitoring
namespace.
You can only add and modify alerting rules. You cannot create new recording rules or modify existing recording rules.
If you modify existing platform alerting rules by using an AlertRelabelConfig
object, your modifications are not reflected in the Prometheus alerts API.
Therefore, any dropped alerts still appear in the OKD web console even though they are no longer forwarded to Alertmanager.
Additionally, any modifications to alerts, such as a changed severity
label, do not appear in the web console.
If you customize core platform alerting rules to meet your organization’s specific needs, follow these guidelines to help ensure that the customized rules are efficient and effective.
Minimize the number of new rules. Create only rules that are essential to your specific requirements. By minimizing the number of rules, you create a more manageable and focused alerting system in your monitoring environment.
Focus on symptoms rather than causes. Create rules that notify users of symptoms instead of underlying causes. This approach ensures that users are promptly notified of a relevant symptom so that they can investigate the root cause after an alert has triggered. This tactic also significantly reduces the overall number of rules you need to create.
Plan and assess your needs before implementing changes. First, decide what symptoms are important and what actions you want users to take if these symptoms occur. Then, assess existing rules and decide if you can modify any of them to meet your needs instead of creating entirely new rules for each symptom. By modifying existing rules and creating new ones judiciously, you help to streamline your alerting system.
Provide clear alert messaging. When you create alert messages, describe the symptom, possible causes, and recommended actions. Include unambiguous, concise explanations along with troubleshooting steps or links to more information. Doing so helps users quickly assess the situation and respond appropriately.
Include severity levels. Assign severity levels to your rules to indicate how a user needs to react when a symptom occurs and triggers an alert. For example, classifying an alert as Critical signals that an individual or a critical response team needs to respond immediately. By defining severity levels, you help users know how to respond to an alert and help ensure that the most urgent issues receive prompt attention.
In OKD, you can create alerting rules for user-defined projects. Those alerting rules will trigger alerts based on the values of the chosen metrics.
If you create alerting rules for a user-defined project, consider the following key behaviors and important limitations when you define the new rules:
A user-defined alerting rule can include metrics exposed by its own project in addition to the default metrics from core platform monitoring. You cannot include metrics from another user-defined project.
For example, an alerting rule for the ns1
user-defined project can use metrics exposed by the ns1
project in addition to core platform metrics, such as CPU and memory metrics.
However, the rule cannot include metrics from a different ns2
user-defined project.
By default, when you create an alerting rule, the namespace
label is enforced on it even if a rule with the same name exists in another project. To create alerting rules that are not bound to their project of origin, see "Creating cross-project alerting rules for user-defined projects".
To reduce latency and to minimize the load on core platform monitoring components, you can add the openshift.io/prometheus-rule-evaluation-scope: leaf-prometheus
label to a rule.
This label forces only the Prometheus instance deployed in the openshift-user-workload-monitoring
project to evaluate the alerting rule and prevents the Thanos Ruler instance from doing so.
If an alerting rule has this label, your alerting rule can use only those metrics exposed by your user-defined project. Alerting rules you create based on default platform metrics might not trigger alerts. |
In OKD, you can view, edit, and remove alerting rules in user-defined projects.
The default alerting rules are used specifically for the OKD cluster.
Some alerting rules intentionally have identical names. They send alerts about the same event with different thresholds, different severity, or both.
Inhibition rules prevent notifications for lower severity alerts that are firing when a higher severity alert is also firing.
You can optimize alerting for your own projects by considering the following recommendations when creating alerting rules:
Minimize the number of alerting rules that you create for your project. Create alerting rules that notify you of conditions that impact you. It is more difficult to notice relevant alerts if you generate many alerts for conditions that do not impact you.
Create alerting rules for symptoms instead of causes. Create alerting rules that notify you of conditions regardless of the underlying cause. The cause can then be investigated. You will need many more alerting rules if each relates only to a specific cause. Some causes are then likely to be missed.
Plan before you write your alerting rules. Determine what symptoms are important to you and what actions you want to take if they occur. Then build an alerting rule for each symptom.
Provide clear alert messaging. State the symptom and recommended actions in the alert message.
Include severity levels in your alerting rules. The severity of an alert depends on how you need to react if the reported symptom occurs. For example, a critical alert should be triggered if a symptom requires immediate attention by an individual or a critical response team.
You can filter the alerts, silences, and alerting rules that are displayed in the Alerting UI. This section provides a description of each of the available filtering options.
In the Administrator perspective, the Alerts page in the Alerting UI provides details about alerts relating to default OKD and user-defined projects. The page includes a summary of severity, state, and source for each alert. The time at which an alert went into its current state is also shown.
You can filter by alert state, severity, and source. By default, only Platform alerts that are Firing are displayed. The following describes each alert filtering option:
State filters:
Firing. The alert is firing because the alert condition is true and the optional for
duration has passed. The alert continues to fire while the condition remains true.
Pending. The alert is active but is waiting for the duration that is specified in the alerting rule before it fires.
Silenced. The alert is now silenced for a defined time period. Silences temporarily mute alerts based on a set of label selectors that you define. Notifications are not sent for alerts that match all the listed values or regular expressions.
Severity filters:
Critical. The condition that triggered the alert could have a critical impact. The alert requires immediate attention when fired and is typically paged to an individual or to a critical response team.
Warning. The alert provides a warning notification about something that might require attention to prevent a problem from occurring. Warnings are typically routed to a ticketing system for non-immediate review.
Info. The alert is provided for informational purposes only.
None. The alert has no defined severity.
You can also create custom severity definitions for alerts relating to user-defined projects.
Source filters:
Platform. Platform-level alerts relate only to default OKD projects. These projects provide core OKD functionality.
User. User alerts relate to user-defined projects. These alerts are user-created and are customizable. User-defined workload monitoring can be enabled postinstallation to provide observability into your own workloads.
In the Administrator perspective, the Silences page in the Alerting UI provides details about silences applied to alerts in default OKD and user-defined projects. The page includes a summary of the state of each silence and the time at which a silence ends.
You can filter by silence state. By default, only Active and Pending silences are displayed. The following describes each silence state filter option:
State filters:
Active. The silence is active and the alert will be muted until the silence is expired.
Pending. The silence has been scheduled and it is not yet active.
Expired. The silence has expired and notifications will be sent if the conditions for an alert are true.
In the Administrator perspective, the Alerting rules page in the Alerting UI provides details about alerting rules relating to default OKD and user-defined projects. The page includes a summary of the state, severity, and source for each alerting rule.
You can filter alerting rules by alert state, severity, and source. By default, only Platform alerting rules are displayed. The following describes each alerting rule filtering option:
Alert state filters:
Firing. The alert is firing because the alert condition is true and the optional for
duration has passed. The alert continues to fire while the condition remains true.
Pending. The alert is active but is waiting for the duration that is specified in the alerting rule before it fires.
Silenced. The alert is now silenced for a defined time period. Silences temporarily mute alerts based on a set of label selectors that you define. Notifications are not sent for alerts that match all the listed values or regular expressions.
Not Firing. The alert is not firing.
Severity filters:
Critical. The conditions defined in the alerting rule could have a critical impact. When true, these conditions require immediate attention. Alerts relating to the rule are typically paged to an individual or to a critical response team.
Warning. The conditions defined in the alerting rule might require attention to prevent a problem from occurring. Alerts relating to the rule are typically routed to a ticketing system for non-immediate review.
Info. The alerting rule provides informational alerts only.
None. The alerting rule has no defined severity.
You can also create custom severity definitions for alerting rules relating to user-defined projects.
Source filters:
Platform. Platform-level alerting rules relate only to default OKD projects. These projects provide core OKD functionality.
User. User-defined workload alerting rules relate to user-defined projects. These alerting rules are user-created and are customizable. User-defined workload monitoring can be enabled postinstallation to provide observability into your own workloads.
In the Developer perspective, the Alerts page in the Alerting UI provides a combined view of alerts and silences relating to the selected project. A link to the governing alerting rule is provided for each displayed alert.
In this view, you can filter by alert state and severity. By default, all alerts in the selected project are displayed if you have permission to access the project. These filters are the same as those described for the Administrator perspective.
As a cluster administrator, you can enable alert routing for user-defined projects.
With this feature, you can allow users with the alert-routing-edit
cluster role to configure alert notification routing and receivers for user-defined projects.
These notifications are routed by the default Alertmanager instance or, if enabled, an optional Alertmanager instance dedicated to user-defined monitoring.
Users can then create and configure user-defined alert routing by creating or editing the AlertmanagerConfig
objects for their user-defined projects without the help of an administrator.
After a user has defined alert routing for a user-defined project, user-defined alert notifications are routed as follows:
To the alertmanager-main
pods in the openshift-monitoring
namespace if using the default platform Alertmanager instance.
To the alertmanager-user-workload
pods in the openshift-user-workload-monitoring
namespace if you have enabled a separate instance of Alertmanager for user-defined projects.
Review the following limitations of alert routing for user-defined projects:
|
In OKD 4, firing alerts can be viewed in the Alerting UI. Alerts are not configured by default to be sent to any notification systems. You can configure OKD to send alerts to the following receiver types:
PagerDuty
Webhook
Slack
Microsoft Teams
Routing alerts to receivers enables you to send timely notifications to the appropriate teams when failures occur. For example, critical alerts require immediate attention and are typically paged to an individual or a critical response team. Alerts that provide non-critical warning notifications might instead be routed to a ticketing system for non-immediate review.
OKD monitoring includes a watchdog alert that fires continuously. Alertmanager repeatedly sends watchdog alert notifications to configured notification providers. The provider is usually configured to notify an administrator when it stops receiving the watchdog alert. This mechanism helps you quickly identify any communication issues between Alertmanager and the notification provider.