Configuring the monitoring stack

Maintenance and support for monitoring
- Support considerations for monitoring
- Support version matrix for monitoring components
Configuring the monitoring stack
Configurable monitoring components
Using node selectors to move monitoring components
- How node selectors work with other constraints
- Moving monitoring components to different nodes
Assigning tolerations to monitoring components
Managing CPU and memory resources for monitoring components
- About specifying limits and requests for monitoring components
- Specifying limits and requests for monitoring components
Configuring persistent storage
Configuring remote write storage
Adding cluster ID labels to metrics
- Creating cluster ID labels for metrics
Controlling the impact of unbound metrics attributes in user-defined projects
- Setting scrape sample and label limits for user-defined projects
Configuring external Alertmanager instances
Configuring secrets for Alertmanager
- Adding a secret to the Alertmanager configuration
Attaching additional labels to your time series and alerts
Using pod topology spread constraints for monitoring
- Configuring pod topology spread constraints
Setting log levels for monitoring components
Enabling the query log file for Prometheus

This section explains what configuration is supported, shows how to configure the monitoring stack for user-defined projects, and demonstrates several common configuration scenarios.

Not all configuration parameters for the monitoring stack are exposed. Only the parameters and fields listed in the Config map reference for the Cluster Monitoring Operator are supported for configuration.

Red Hat OpenShift Service on AWS	Prometheus Operator	Prometheus	Metrics Server	Alertmanager	kube-state-metrics agent	monitoring-plugin	node-exporter agent	Thanos
4.17	0.75.2	2.53.1	0.7.1	0.27.0	2.13.0	1.0.0	1.8.2	0.35.1
4.16	0.73.2	2.52.0	0.7.1	0.26.0	2.12.0	1.0.0	1.8.0	0.35.0
4.15	0.70.0	2.48.0	0.6.4	0.26.0	2.10.1	1.0.0	1.7.0	0.32.5
4.14	0.67.1	2.46.0	N/A	0.25.0	2.9.2	1.0.0	1.6.1	0.30.2
4.13	0.63.0	2.42.0	N/A	0.25.0	2.8.1	N/A	1.5.0	0.30.2
4.12	0.60.1	2.39.1	N/A	0.24.0	2.6.0	N/A	1.4.0	0.28.1

1	Defines the Prometheus component and the subsequent lines define its configuration.
2	Configures a twenty-four hour data retention period for the Prometheus instance that monitors user-defined projects.
3	Defines a minimum resource request of 200 millicores for the Prometheus container.
4	Defines a minimum pod resource request of 2 GiB of memory for the Prometheus container.

Component	user-workload-monitoring-config config map key
Alertmanager	`alertmanager`
Prometheus Operator	`prometheusOperator`
Prometheus	`prometheus`
Thanos Ruler	`thanosRuler`

1	Substitute `<component>` with the appropriate monitoring stack component name.
2	Substitute `<node-label-1>` with the label you added to the node.
3	Optional: Specify additional labels. If you specify additional labels, the pods for the component are only scheduled on the nodes that contain all of the specified labels.

1	Specify the component for user-defined monitoring for which you want to configure the PVC.
2	Specify an existing storage class. If a storage class is not specified, the default storage class is used.
3	Specify the amount of required storage.

1	The retention time: a number directly followed by `ms` (milliseconds), `s` (seconds), `m` (minutes), `h` (hours), `d` (days), `w` (weeks), or `y` (years). You can also combine time values for specific times, such as `1h30m15s`.
2	The retention size: a number directly followed by `B` (bytes), `KB` (kilobytes), `MB` (megabytes), `GB` (gigabytes), `TB` (terabytes), `PB` (petabytes), or `EB` (exabytes).

1	The URL of the remote write endpoint.
2	The authentication method and credentials for the endpoint. Currently supported authentication methods are AWS Signature Version 4, authentication using HTTP an `Authorization` request header, basic authentication, OAuth 2.0, and TLS client. See Supported remote write authentication settings below for sample configurations of supported authentication methods.

1	The AWS region.
2	The name of the `Secret` object containing the AWS API access credentials.
3	The key that contains the AWS API access key in the specified `Secret` object.
4	The key that contains the AWS API secret key in the specified `Secret` object.
5	The name of the AWS profile that is being used to authenticate.
6	The unique identifier for the Amazon Resource Name (ARN) assigned to your role.

1	The name of the `Secret` object that contains the authentication credentials.
2	The key that contains the username in the specified `Secret` object.
3	The key that contains the password in the specified `Secret` object.

1	The authentication type of the request. The default value is `Bearer`.
2	The name of the `Secret` object that contains the authentication credentials.
3	The key that contains the authentication token in the specified `Secret` object.

1	The username.
2	The password.

1	The Oauth 2.0 ID.
2	The OAuth 2.0 secret.

1	The number of samples to buffer per shard before they are dropped from the queue.
2	The minimum number of shards.
3	The maximum number of shards.
4	The maximum number of samples per send.
5	The maximum time for a sample to wait in buffer.
6	The initial time to wait before retrying a failed request. The time gets doubled for every retry up to the `maxbackoff` time.
7	The maximum time to wait before retrying a failed request.
8	Set this parameter to `true` to retry a request after receiving a 429 status code from the remote write storage.
9	The samples that are older than the `sampleAgeLimit` limit are dropped from the queue. If the value is undefined or set to `0s`, the parameter is ignored.

1	The system initially applies a temporary cluster ID source label named `__tmp_openshift_cluster_id__`. This temporary label gets replaced by the cluster ID label name that you specify.
2	Specify the name of the cluster ID label for metrics sent to remote write storage. If you use a label name that already exists for a metric, that value is overwritten with the name of this cluster ID label. For the label name, do not use `__tmp_openshift_cluster_id__`. The final relabeling step removes labels that use this name.
3	The `replace` write relabel action replaces the temporary label with the target label for outgoing metrics. This action is the default and is applied if no action is specified.

1	The name of the corresponding `Secret` object. Note that `ClientId` can alternatively refer to a `configmap` object, although `clientSecret` must refer to a `Secret` object.
2	The key that contains the OAuth 2.0 credentials in the specified `Secret` object.
3	The URL used to fetch a token with the specified `clientId` and `clientSecret`.
4	The OAuth 2.0 scopes for the authorization request. These scopes limit what data the tokens can access.
5	The OAuth 2.0 authorization request parameters required for the authorization server.

1	The CA certificate in the Prometheus container with which to validate the server certificate.
2	The client certificate for authentication with the server.
3	The client key.

1	The name of the corresponding `Secret` object that contains the TLS authentication credentials. Note that `ca` and `cert` can alternatively refer to a `configmap` object, though `keySecret` must refer to a `Secret` object.
2	The key in the specified `Secret` object that contains the CA certificate for the endpoint.
3	The key in the specified `Secret` object that contains the client certificate for the endpoint.
4	The key in the specified `Secret` object that contains the client key secret.

1	Add a list of write relabel configurations for metrics that you want to send to the remote endpoint.
2	Substitute the label configuration for the metrics sent to the remote write endpoint.

1	Specifies the maximum number of labels per scrape. The default value is `0`, which specifies no limit.
2	Specifies the maximum length in characters of a label name. The default value is `0`, which specifies no limit.
3	Specifies the maximum length in characters of a label value. The default value is `0`, which specifies no limit.

1	This section contains the secrets to be mounted into Alertmanager. The secrets must be located within the same namespace as the Alertmanager object.
2	The name of the `Secret` object that contains authentication credentials for the receiver. If you add multiple secrets, place each one on a new line.

1	Specify a name of the component for which you want to set up pod topology spread constraints.
2	Specify a numeric value for `maxSkew`, which defines the degree to which pods are allowed to be unevenly distributed.
3	Specify a key of node labels for `topologyKey`. Nodes that have a label with this key and identical values are considered to be in the same topology. The scheduler tries to put a balanced number of pods into each domain.
4	Specify a value for `whenUnsatisfiable`. Available options are `DoNotSchedule` and `ScheduleAnyway`. Specify `DoNotSchedule` if you want the `maxSkew` value to define the maximum difference allowed between the number of matching pods in the target topology and the global minimum. Specify `ScheduleAnyway` if you want the scheduler to still schedule the pod but to give higher priority to nodes that might reduce the skew.
5	Specify `labelSelector` to find matching pods. Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain.

1	The monitoring stack component for which you are setting a log level. For user workload monitoring, available component values are `alertmanager`, `prometheus`, `prometheusOperator`, and `thanosRuler`.
2	The log level to apply to the component. The available values are `error`, `warn`, `info`, and `debug`. The default value is `info`.

Maintenance and support for monitoring

Support considerations for monitoring

Support version matrix for monitoring components

Configuring the monitoring stack

Configurable monitoring components

Using node selectors to move monitoring components

How node selectors work with other constraints

Moving monitoring components to different nodes

Assigning tolerations to monitoring components

Managing CPU and memory resources for monitoring components

About specifying limits and requests for monitoring components

Specifying limits and requests for monitoring components

Configuring persistent storage

Persistent storage prerequisites

Configuring a persistent volume claim

Modifying the retention time and size for Prometheus metrics data

Modifying the retention time for Thanos Ruler metrics data

Configuring remote write storage

Supported remote write authentication settings

Example remote write authentication settings

Sample YAML for AWS Signature Version 4 authentication

Sample YAML for Basic authentication

Sample YAML for authentication with a bearer token using a Secret Object

Sample YAML for OAuth 2.0 authentication

Sample YAML for TLS client authentication

Example remote write queue configuration

Adding cluster ID labels to metrics

Creating cluster ID labels for metrics

Controlling the impact of unbound metrics attributes in user-defined projects

Setting scrape sample and label limits for user-defined projects

Configuring external Alertmanager instances

Configuring secrets for Alertmanager

Adding a secret to the Alertmanager configuration

Attaching additional labels to your time series and alerts

Using pod topology spread constraints for monitoring

Configuring pod topology spread constraints

Setting log levels for monitoring components

Enabling the query log file for Prometheus

Sample YAML for authentication with a bearer token using a `Secret` Object