Visualizing power monitoring metrics - Power Monitoring | Observability

Power monitoring dashboards overview
Accessing power monitoring dashboards as a cluster administrator
Accessing power monitoring dashboards as a developer
Power monitoring metrics overview
Additional resources

Power monitoring is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can visualize power monitoring metrics in the OKD web console by accessing power monitoring dashboards or by exploring Metrics under the Observe tab.

Power monitoring dashboards overview

There are two types of power monitoring dashboards. Both provide different levels of details around power consumption metrics for a single cluster:

Power Monitoring / Overview dashboard

With this dashboard, you can observe the following information:

An aggregated view of CPU architecture and its power source (rapl-sysfs, rapl-msr, or estimator) along with total nodes with this configuration
Total energy consumption by a cluster in the last 24 hours (measured in kilowatt-hour)
The amount of power consumed by the top 10 namespaces in a cluster in the last 24 hours
Detailed node information, such as its CPU architecture and component power source

These features allow you to effectively monitor the energy consumption of the cluster without needing to investigate each namespace separately.

Ensure that the Components Source column does not display estimator as the power source.

Figure 1. The Detailed Node Information table with rapl-sysfs as the component power source

If Kepler is unable to obtain hardware power consumption metrics, the Components Source column displays estimator as the power source, which is not supported in Technology Preview. If that happens, then the values from the nodes are not accurate.

Power Monitoring / Namespace dashboard

This dashboard allows you to view metrics by namespace and pod. You can observe the following information:

The power consumption metrics, such as consumption in DRAM and PKG
The energy consumption metrics in the last hour, such as consumption in DRAM and PKG for core and uncore components

This feature allows you to investigate key peaks and easily identify the primary root causes of high consumption.

Accessing power monitoring dashboards as a cluster administrator

You can access power monitoring dashboards from the Administrator perspective of the OKD web console.

Prerequisites

You have access to the OKD web console.
You are logged in as a user with the cluster-admin role.
You have installed the Power monitoring Operator.
You have deployed Kepler in your cluster.
You have enabled monitoring for user-defined projects.

Procedure

In the Administrator perspective of the web console, go to Observe → Dashboards.
From the Dashboard drop-down list, select the power monitoring dashboard you want to see:
- Power Monitoring / Overview
- Power Monitoring / Namespace

Accessing power monitoring dashboards as a developer

You can access power monitoring dashboards from the Developer perspective of the OKD web console.

Prerequisites

You have access to the OKD web console.
You have access to the cluster as a developer or as a user.
You have installed the Power monitoring Operator.
You have deployed Kepler in your cluster.
You have enabled monitoring for user-defined projects.
You have view permissions for the namespace openshift-power-monitoring, the namespace where Kepler is deployed to.

Procedure

In the Developer perspective of the web console, go to Observe → Dashboard.
From the Dashboard drop-down list, select the power monitoring dashboard you want to see:
- Power Monitoring / Overview

Power monitoring metrics overview

The Power monitoring Operator exposes the following metrics, which you can view by using the OKD web console under the Observe → Metrics tab.

This list of exposed metrics is not definitive. Metrics might be added or removed in future releases.

Table 1. Power monitoring Operator metrics
Metric name	Description
`kepler_container_joules_total`	The aggregated package or socket energy consumption of CPU, DRAM, and other host components by a container.
`kepler_container_core_joules_total`	The total energy consumption across CPU cores used by a container. If the system has access to `RAPL_` metrics, this metric reflects the proportional container energy consumption of the RAPL Power Plan 0 (PP0), which is the energy consumed by all CPU cores in the socket.
`kepler_container_dram_joules_total`	The total energy consumption of DRAM by a container.
`kepler_container_uncore_joules_total`	The cumulative energy consumption by uncore components used by a container. The number of components might vary depending on the system. The uncore metric is processor model-specific and might not be available on some server CPUs.
`kepler_container_package_joules_total`	The cumulative energy consumed by the CPU socket used by a container. It includes all core and uncore components.
`kepler_container_other_joules_total`	The cumulative energy consumption of host components, excluding CPU and DRAM, used by a container. Generally, this metric is the energy consumption of ACPI hosts.
`kepler_container_bpf_cpu_time_us_total`	The total CPU time used by the container that utilizes the BPF tracing.
`kepler_container_cpu_cycles_total`	The total CPU cycles used by the container that utilizes hardware counters. CPU cycles is a metric directly related to CPU frequency. On systems where processors run at a fixed frequency, CPU cycles and total CPU time are roughly equivalent. On systems where processors run at varying frequencies, CPU cycles and total CPU time have different values.
`kepler_container_cpu_instructions_total`	The total CPU instructions used by the container that utilizes hardware counters. CPU instructions is a metric that accounts how the CPU is used.
`kepler_container_cache_miss_total`	The total cache miss that occurs for a container that uses hardware counters.
`kepler_container_cgroupfs_cpu_usage_us_total`	The total CPU time used by a container reading from control group statistics.
`kepler_container_cgroupfs_memory_usage_bytes_total`	The total memory in bytes used by a container reading from control group statistics.
`kepler_container_cgroupfs_system_cpu_usage_us_total`	The total CPU time in kernel space used by the container reading from control group statistics.
`kepler_container_cgroupfs_user_cpu_usage_us_total`	The total CPU time in user space used by a container reading from control group statistics.
`kepler_container_bpf_net_tx_irq_total`	The total number of packets transmitted to network cards of a container that uses the BPF tracing.
`kepler_container_bpf_net_rx_irq_total`	The total number of packets received from network cards of a container that uses the BPF tracing.
`kepler_container_bpf_block_irq_total`	The total number of block I/O calls of a container that uses the BPF tracing.
`kepler_node_info`	The node metadata, such as the node CPU architecture.
`kepler_node_core_joules_total`	The total energy consumption across CPU cores used by all containers running on a node and operating system.
`kepler_node_uncore_joules_total`	The cumulative energy consumption by uncore components used by all containers running on the node and operating system. The number of components might vary depending on the system.
`kepler_node_dram_joules_total`	The total energy consumption of DRAM by all containers running on the node and operating system.
`kepler_node_package_joules_total`	The cumulative energy consumed by the CPU socket used by all containers running on the node and operating system. It includes all core and uncore components.
`kepler_node_other_host_components_joules_total`	The cumulative energy consumption of host components, excluding CPU and DRAM, used by all containers running on the node and operating system. Generally, this metric is the energy consumption of ACPI hosts.
`kepler_node_platform_joules_total`	The total energy consumption of the host. Generally, this metric is the host energy consumption from Redfish BMC or ACPI.
`kepler_node_energy_stat`	Multiple metrics from nodes labeled with container resource utilization control group metrics that are used in the model server.
`kepler_node_accelerator_intel_qat`	The utilization of the accelerator Intel QAT on a certain node. If the system contains Intel QATs, Kepler can calculate the utilization of the node’s QATs through telemetry.

Additional resources

Enabling monitoring for user-defined projects