Keywords: Kubernetes | Node Debugging | Not Ready State
Abstract: This article provides a comprehensive guide for troubleshooting Kubernetes nodes stuck in 'Not Ready' state. It covers systematic debugging approaches including node status inspection via kubectl describe, kubelet log analysis, and system service verification. Based on practical operational experience, the guide addresses common issues like network connectivity, resource pressure, and certificate authentication problems with detailed code examples and step-by-step instructions.
Node Status Diagnosis Fundamentals
When Kubernetes nodes display 'Not Ready' status, it's essential to understand the node health checking mechanism. Kubernetes relies on kubelet to periodically report node status to the control plane, including resource availability and various condition states.
Core Debugging Steps
Use the kubectl describe nodes command to obtain detailed node information, focusing on the following sections:
Conditions:
Type Status
---- ------
OutOfDisk False
MemoryPressure False
DiskPressure False
Ready True
Capacity:
cpu: 2
memory: 2052588Ki
pods: 110
Allocatable:
cpu: 2
memory: 1950188Ki
pods: 110
The Conditions section displays various health statuses of the node. If Ready is False, further investigation into the specific cause is required.
Kubelet Log Analysis
Connect to the problematic node via SSH and check the kubelet service status and log output:
systemctl status kubelet
journalctl -u kubelet
Common errors include certificate authentication issues, network connection failures, and resource shortages. Logs typically provide detailed error messages pointing to the root cause.
System Service Verification
Ensure all required dependency services are running properly:
systemctl status docker
systemctl status kubelet
If services are found to be abnormal, attempt to restart them:
systemctl daemon-reload
systemctl restart kubelet
Resource Monitoring and Verification
Check the node's resource usage to ensure no resource pressure exists:
df -h # Check disk space
free -m # Check memory usage
top # Check CPU and memory utilization
Pay special attention to the space usage of the /var directory, which serves as the primary working directory for Kubernetes components and container runtime.
Network Connectivity Testing
Although users have confirmed nodes can ping each other, Kubernetes-specific network requirements still need verification:
ping <control-plane-endpoint>
telnet <api-server-ip> 6443
Ensure nodes can access the Kubernetes API server, which is a prerequisite for kubelet to function properly.
System Component Inspection
Verify the operational status of Kubernetes system components:
kubectl get pods -n kube-system
Check if core components like network plugins and DNS services are running normally, as failures in these components can also cause node status abnormalities.
Fault Recovery Process
Take appropriate remediation measures based on diagnostic results:
- For certificate issues, regenerate or update node certificates
- For resource shortages, clean up space or add resources
- For service abnormalities, restart relevant services
- For network problems, check network configuration and firewall rules
After remediation, recheck node status:
kubectl get nodes
kubectl describe node <node-name>