Debugging Kubernetes Nodes in 'Not Ready' State

Keywords: Kubernetes | Node Debugging | Not Ready State

Abstract: This article provides a comprehensive guide for troubleshooting Kubernetes nodes stuck in 'Not Ready' state. It covers systematic debugging approaches including node status inspection via kubectl describe, kubelet log analysis, and system service verification. Based on practical operational experience, the guide addresses common issues like network connectivity, resource pressure, and certificate authentication problems with detailed code examples and step-by-step instructions.

Node Status Diagnosis Fundamentals

When Kubernetes nodes display 'Not Ready' status, it's essential to understand the node health checking mechanism. Kubernetes relies on kubelet to periodically report node status to the control plane, including resource availability and various condition states.

Core Debugging Steps

Use the kubectl describe nodes command to obtain detailed node information, focusing on the following sections:

Conditions:
  Type              Status
  ----              ------
  OutOfDisk         False
  MemoryPressure    False
  DiskPressure      False
  Ready             True
Capacity:
 cpu:       2
 memory:    2052588Ki
 pods:      110
Allocatable:
 cpu:       2
 memory:    1950188Ki
 pods:      110

The Conditions section displays various health statuses of the node. If Ready is False, further investigation into the specific cause is required.

Kubelet Log Analysis

Connect to the problematic node via SSH and check the kubelet service status and log output:

systemctl status kubelet
journalctl -u kubelet

Common errors include certificate authentication issues, network connection failures, and resource shortages. Logs typically provide detailed error messages pointing to the root cause.

System Service Verification

Ensure all required dependency services are running properly:

systemctl status docker
systemctl status kubelet

If services are found to be abnormal, attempt to restart them:

systemctl daemon-reload
systemctl restart kubelet

Resource Monitoring and Verification

Check the node's resource usage to ensure no resource pressure exists:

df -h  # Check disk space
free -m  # Check memory usage
top  # Check CPU and memory utilization

Pay special attention to the space usage of the /var directory, which serves as the primary working directory for Kubernetes components and container runtime.

Network Connectivity Testing

Although users have confirmed nodes can ping each other, Kubernetes-specific network requirements still need verification:

ping <control-plane-endpoint>
telnet <api-server-ip> 6443

Ensure nodes can access the Kubernetes API server, which is a prerequisite for kubelet to function properly.

System Component Inspection

Verify the operational status of Kubernetes system components:

kubectl get pods -n kube-system

Check if core components like network plugins and DNS services are running normally, as failures in these components can also cause node status abnormalities.

Fault Recovery Process

Take appropriate remediation measures based on diagnostic results:

For certificate issues, regenerate or update node certificates
For resource shortages, clean up space or add resources
For service abnormalities, restart relevant services
For network problems, check network configuration and firewall rules

After remediation, recheck node status:

kubectl get nodes
kubectl describe node <node-name>

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.