Common monitoring configuration scenarios - Monitoring | Observability

Configuring core platform monitoring: Postinstallation steps
Configuring monitoring for user-defined projects: Getting started

After OKD is installed, core platform monitoring components immediately begin collecting metrics, which you can query and view. The default in-cluster monitoring stack includes the core platform Prometheus instance that collects metrics from your cluster and the core Alertmanager instance that routes alerts, among other components. Depending on who will use the monitoring stack and for what purposes, as a cluster administrator, you can further configure these monitoring components to suit the needs of different users in various scenarios.

In addition to core platform monitoring, you can also optionally enable monitoring for user-defined projects for user workload monitoring. Users can then monitor their own services and workloads without the need for an additional monitoring solution.

Configuring core platform monitoring: Postinstallation steps

After OKD is installed, cluster administrators typically configure core platform monitoring to suit their needs. These activities include setting up storage and configuring options for Prometheus, Alertmanager, and other monitoring components.

By default, in a newly installed OKD system, users can query and view collected metrics. You need only configure an alert receiver if you want users to receive alert notifications. Any other configuration options listed here are optional.

Create the cluster-monitoring-config configmap object if it does not exist.
Configure alert receivers so that Alertmanager can send alerts to an external notification system such as email, Slack, or PagerDuty.
Configure notifications for default platform alerts.

For shorter term data retention, configure persistent storage for Prometheus and Alertmanager to store metrics and alert data. Specify the metrics data retention parameters for Prometheus and Thanos Ruler.

In multi-node clusters, you must configure persistent storage for Prometheus, Alertmanager, and Thanos Ruler to ensure high availability.
By default, in a newly installed OKD system, the monitoring ClusterOperator resource reports a PrometheusDataPersistenceNotConfigured status message to remind you that storage is not configured.

For longer term data retention, configure the remote write feature to enable Prometheus to send ingested metrics to remote systems for storage.

Be sure to add cluster ID labels to metrics for use with your remote write storage configuration.
Assign monitoring cluster roles to any non-administrator users that need to access certain monitoring features.
Assign tolerations to monitoring stack components so that administrators can move them to tainted nodes.
Set the body size limit for metrics collection to help avoid situations in which Prometheus consumes excessive amounts of memory when scraped targets return a response that contains a large amount of data.
Modify or create alerting rules for your cluster. These rules specify the conditions that trigger alerts, such as high CPU or memory usage, network latency, and so forth.
Specify resource limits and requests for monitoring components to ensure that the containers that run monitoring components have enough CPU and memory resources.

With the monitoring stack configured to suit your needs, Prometheus collects metrics from the specified services and stores these metrics according to your settings. You can go to the Observe pages in the OKD web console to view and query collected metrics, manage alerts, identify performance bottlenecks, and scale resources as needed:

View dashboards to visualize collected metrics, troubleshoot alerts, and monitor other information about your cluster.
Query collected metrics by creating PromQL queries or using predefined queries.

Configuring monitoring for user-defined projects: Getting started

As a cluster administrator, you can optionally enable monitoring for user-defined projects in addition to core platform monitoring. Non-administrator users such as developers can then monitor their own projects outside of core platform monitoring.

Cluster administrators typically complete the following activities to configure user-defined projects so that users can view collected metrics, query these metrics, and receive alerts for their own projects:

Enable user-defined projects.
Assign the monitoring-rules-view, monitoring-rules-edit, or monitoring-edit cluster roles to grant non-administrator users permissions to monitor user-defined projects.
Assign the user-workload-monitoring-config-edit role to grant non-administrator users permission to configure user-defined projects.
Enable alert routing for user-defined projects so that developers and other users can configure custom alerts and alert routing for their projects.
If needed, configure alert routing for user-defined projects to use an optional Alertmanager instance dedicated for use only by user-defined projects.
Configure alert receivers for user-defined projects.
Configure notifications for user-defined alerts.

After monitoring for user-defined projects is enabled and configured, developers and other non-administrator users can then perform the following activities to set up and use monitoring for their own projects:

Deploy and monitor services.
Create and manage alerting rules.
Receive and manage alerts for their projects.
If granted the user-workload-monitoring-config-edit role, configure alert routing.
Use the OKD web console to view dashboards.
Query the collected metrics by creating PromQL queries or using predefined queries.