Monitoring CPU Usage in Kubernetes with Prometheus

Dec 01, 2025 · Programming · 16 views · 7.8

Keywords: Kubernetes | Prometheus | CPU usage

Abstract: This article discusses how to accurately calculate CPU usage for containers in a Kubernetes cluster using Prometheus metrics. It addresses common pitfalls, provides queries for cluster-level and per-pod CPU usage, and explains the usage of related Prometheus queries. The content is structured from key knowledge points, offering in-depth technical analysis.

Introduction

In Kubernetes environments, monitoring container CPU usage is crucial for resource optimization and performance tuning. Prometheus, as a common monitoring tool, provides various metrics such as <span class="code">container_cpu_usage_seconds_total</span> and <span class="code">process_cpu_seconds_total</span>. However, directly using these metrics to calculate CPU usage may lead to inaccurate results, for example, causing usage to exceed 1.

Problem Background

In typical calculations, users might attempt to use the following formula: increment of <span class="code">container_cpu_usage_seconds_total{id="/"}</span> divided by increment of <span class="code">process_cpu_seconds_total</span>. However, due to differences in metric definitions and aggregation methods, this approach often makes the increment of <span class="code">container_cpu_usage_seconds_total</span> larger than <span class="code">process_cpu_seconds_total</span>, resulting in calculated CPU usage exceeding 1, which is theoretically unexpected. This describes common pitfalls, such as when text mentions <span class="code"><br></span> tags not being used for line breaks but as described objects, thus requiring escaping.

Solution

To leverage Prometheus effectively, the following calculation methods are proposed, based on queries from the best answer. First, at the cluster level, CPU usage can be calculated using <span class="code">sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100</span>, which converts CPU time to a percentage. Second, for each pod, use <span class="code">sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)</span> to obtain specific CPU usage rates.

Query Details

In these queries, the <span class="code">rate</span> function computes the average increment rate of metrics within a specified time window, typically 1 minute for data smoothing. <span class="code">container_cpu_usage_seconds_total{id="/"}</span> selects container metrics under the root namespace, while <span class="code">machine_cpu_cores</span> provides physical CPU core counts for normalization. For pod-level queries, the condition <span class="code">image!=""</span> ensures only metrics with valid container images are included, grouped by <span class="code">by (pod_name)</span> to get CPU usage per pod. These queries can be explored further in the complete solution on GitHub for more details.

Comprehensive Analysis

By using these queries, issues with abnormal values in the original method can be avoided, ensuring CPU usage calculations remain within theoretical bounds. Additionally, these queries can be easily integrated into monitoring systems such as Grafana. In practical applications, it is recommended to adjust the time windows of queries periodically to adapt to different load conditions.

Conclusion

This article combines features of Kubernetes and Prometheus to detail accurate methods for calculating CPU usage. By reorganizing the logic structure, it provides a comprehensive perspective from cluster to pod levels to enhance the effectiveness of resource monitoring. Developers are advised to use the queries mentioned herein and refer to the attached solution for improved monitoring capabilities.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.