Monitoring CPU Usage in Kubernetes with Prometheus

Keywords: Kubernetes | Prometheus | CPU usage

Abstract: This article discusses how to accurately calculate CPU usage for containers in a Kubernetes cluster using Prometheus metrics. It addresses common pitfalls, provides queries for cluster-level and per-pod CPU usage, and explains the usage of related Prometheus queries. The content is structured from key knowledge points, offering in-depth technical analysis.

Introduction

In Kubernetes environments, monitoring container CPU usage is crucial for resource optimization and performance tuning. Prometheus, as a common monitoring tool, provides various metrics such as container_cpu_usage_seconds_total and process_cpu_seconds_total. However, directly using these metrics to calculate CPU usage may lead to inaccurate results, for example, causing usage to exceed 1.

Problem Background

In typical calculations, users might attempt to use the following formula: increment of container_cpu_usage_seconds_total{id="/"} divided by increment of process_cpu_seconds_total. However, due to differences in metric definitions and aggregation methods, this approach often makes the increment of container_cpu_usage_seconds_total larger than process_cpu_seconds_total, resulting in calculated CPU usage exceeding 1, which is theoretically unexpected. This describes common pitfalls, such as when text mentions tags not being used for line breaks but as described objects, thus requiring escaping.

Solution

To leverage Prometheus effectively, the following calculation methods are proposed, based on queries from the best answer. First, at the cluster level, CPU usage can be calculated using sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100, which converts CPU time to a percentage. Second, for each pod, use sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name) to obtain specific CPU usage rates.

Query Details

In these queries, the rate function computes the average increment rate of metrics within a specified time window, typically 1 minute for data smoothing. container_cpu_usage_seconds_total{id="/"} selects container metrics under the root namespace, while machine_cpu_cores provides physical CPU core counts for normalization. For pod-level queries, the condition image!="" ensures only metrics with valid container images are included, grouped by by (pod_name) to get CPU usage per pod. These queries can be explored further in the complete solution on GitHub for more details.

Comprehensive Analysis

By using these queries, issues with abnormal values in the original method can be avoided, ensuring CPU usage calculations remain within theoretical bounds. Additionally, these queries can be easily integrated into monitoring systems such as Grafana. In practical applications, it is recommended to adjust the time windows of queries periodically to adapt to different load conditions.

Conclusion

This article combines features of Kubernetes and Prometheus to detail accurate methods for calculating CPU usage. By reorganizing the logic structure, it provides a comprehensive perspective from cluster to pod levels to enhance the effectiveness of resource monitoring. Developers are advised to use the queries mentioned herein and refer to the attached solution for improved monitoring capabilities.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.