Scaling the Cluster Monitoring Operator | Scalability and performance

Prometheus database storage requirements
Configuring cluster monitoring

OKD exposes metrics that the Cluster Monitoring Operator collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OKD web console by navigating to Observe → Dashboards.

Prometheus database storage requirements

Red Hat performed various tests for different scale sizes.

The Prometheus storage requirements below are not prescriptive and should be used as a reference. Higher resource consumption might be observed in your cluster depending on workload activity and resource density, including the number of pods, containers, routes, or other resources exposing metrics collected by Prometheus.

Table 1. Prometheus Database storage requirements based on number of nodes/pods in the cluster
Number of Nodes	Number of pods (2 containers per pod)	Prometheus storage growth per day	Prometheus storage growth per 15 days	Network (per tsdb chunk)
50	1800	6.3 GB	94 GB	16 MB
100	3600	13 GB	195 GB	26 MB
150	5400	19 GB	283 GB	36 MB
200	7200	25 GB	375 GB	46 MB

Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.

The above calculation is for the default OKD Cluster Monitoring Operator.

CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.

Recommendations for OKD

Use at least two infrastructure (infra) nodes.
Use at least three openshift-container-storage nodes with non-volatile memory express (SSD or NVMe) drives.

Configuring cluster monitoring

You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.

Procedure

To increase the storage capacity for Prometheus:

Create a YAML configuration file, cluster-monitoring-config.yaml. For example:

apiVersion: v1
kind: configmap
data:
  config.yaml: |
    prometheusK8s:
      retention: {{PROMETHEUS_RETENTION_PERIOD}} (1)
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      volumeClaimTemplate:
        spec:
          storageClassName: {{STORAGE_CLASS}} (2)
          resources:
            requests:
              storage: {{PROMETHEUS_STORAGE_SIZE}} (3)
    alertmanagerMain:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      volumeClaimTemplate:
        spec:
          storageClassName: {{STORAGE_CLASS}} (2)
          resources:
            requests:
              storage: {{ALERTMANAGER_STORAGE_SIZE}} (4)
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring

1	The default value of Prometheus retention is `PROMETHEUS_RETENTION_PERIOD=15d`. Units are measured in time using one of these suffixes: s, m, h, d.
2	The storage class for your cluster.
3	A typical value is `PROMETHEUS_STORAGE_SIZE=2000Gi`. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
4	A typical value is `ALERTMANAGER_STORAGE_SIZE=20Gi`. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.

Add values for the retention period, storage class, and storage sizes.
Save the file.

Apply the changes by running:

$ oc create -f cluster-monitoring-config.yaml