Kubernetes Troubleshooting Guide: From Red Pods to Healthy Clusters
📅 Published: June 2026
⏱️ Estimated Reading Time: 25 minutes
🏷️ Tags: Kubernetes, Troubleshooting, Debugging, K8s Issues, Pod Failures
Introduction: The Kubernetes Debugging Mindset
Kubernetes is powerful, but when something goes wrong, the error messages can be cryptic. A pod stuck in CrashLoopBackOff tells you something is wrong, but not why. A service that won't connect gives no obvious indication of the problem.
Successful Kubernetes troubleshooting follows a systematic approach:
Identify the symptom – What isn't working?
Check the obvious – Are pods running? Is the service there?
Narrow down – Which component is failing?
Gather data – Logs, events, describe output
Find the root cause – What actually broke?
Fix and verify – Make the change, confirm it works
This guide covers the most common Kubernetes problems and how to solve them.
Part 1: The Essential Debugging Commands
# View pod status kubectl get pods kubectl get pods -n namespace kubectl get pods -w # Watch in real-time # Detailed pod information kubectl describe pod my-pod kubectl describe pod my-pod -n namespace # View logs kubectl logs my-pod kubectl logs my-pod -c container-name # Multi-container pod kubectl logs my-pod --previous # Previous crashed instance kubectl logs -l app=myapp # All pods with label # Execute commands inside pod kubectl exec -it my-pod -- /bin/bash kubectl exec my-pod -- ls -la # Port forward to pod kubectl port-forward pod/my-pod 8080:80 # View events cluster-wide kubectl get events --sort-by='.lastTimestamp' kubectl get events -n namespace # Resource usage kubectl top nodes kubectl top pods
Part 2: Pod States and What They Mean
| State | Meaning | What to Do |
|---|---|---|
| Pending | Pod accepted, waiting for node or containers | Check node resources, PVC binding, image pull |
| ContainerCreating | Pod is starting, pulling images | Usually normal. If stuck, check network/registry |
| Running | Pod is running | Good state unless app has issues |
| CrashLoopBackOff | Container starts then crashes repeatedly | Check logs, check command, check health probes |
| ImagePullBackOff | Cannot pull container image | Check image name, registry access, imagePullSecrets |
| ErrImagePull | Failed to pull image | Same as ImagePullBackOff |
| CreateContainerConfigError | Config issue (missing ConfigMap/Secret) | Check ConfigMap and Secret references |
| Completed | Container exited with 0 (batch job) | Normal for jobs. Delete if not needed |
| OOMKilled | Container killed due to memory limit | Increase memory limit or fix memory leak |
| Evicted | Pod removed due to resource pressure | Check node resources, adjust requests/limits |
Part 3: Pod Troubleshooting
Problem 1: Pod Stuck in Pending
Symptoms:
kubectl get pods NAME READY STATUS RESTARTS AGE my-pod 0/1 Pending 0 5m
Investigation:
# Describe pod to see events kubectl describe pod my-pod # Common events: # - "0/1 nodes are available: 1 Insufficient cpu" # - "0/1 nodes are available: 1 node(s) had taint" # - "pod has unbound immediate PersistentVolumeClaims"
Common Causes and Fixes:
Cause 1: Insufficient resources
Events: Type Reason Message Warning FailedScheduling 0/3 nodes available: insufficient cpu
Fix: Reduce resource requests or add nodes.
Cause 2: Node taints
Events: Warning FailedScheduling 0/3 nodes available: 3 node(s) had taint
Fix: Add toleration to pod:
tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"
Cause 3: Unbound PersistentVolumeClaim
Events: Warning FailedScheduling pod has unbound immediate PersistentVolumeClaims
Fix: Check PVC status:
kubectl get pvc kubectl describe pvc my-pvc
Problem 2: Pod in CrashLoopBackOff
Symptoms:
kubectl get pods NAME READY STATUS RESTARTS AGE my-pod 0/1 CrashLoopBackOff 5 10m
Investigation:
# Check current logs kubectl logs my-pod # Check logs from previous crashed instance kubectl logs my-pod --previous # Describe pod for events kubectl describe pod my-pod
Common Causes and Fixes:
Cause 1: Application error on startup
kubectl logs my-pod --previous # Error: Cannot find module '/app/server.js'
Fix: Check Dockerfile, ensure files are copied correctly.
Cause 2: Missing environment variable or ConfigMap
kubectl logs my-pod --previous # Error: DATABASE_URL environment variable not set
Fix: Add missing environment variable or ConfigMap.
Cause 3: Command or arguments incorrect
# Check pod spec kubectl get pod my-pod -o yaml | grep -A5 command
Fix: Correct the command or arguments.
Cause 4: OOMKilled (exit code 137)
kubectl describe pod my-pod | grep -A5 State # State: Terminated # Reason: OOMKilled
Fix: Increase memory limit or fix memory leak.
Cause 5: Liveness probe failing
kubectl describe pod my-pod | grep -A10 Liveness
Fix: Adjust probe settings (initialDelaySeconds, periodSeconds, timeoutSeconds).
Problem 3: ImagePullBackOff
Symptoms:
kubectl get pods NAME READY STATUS RESTARTS AGE my-pod 0/1 ImagePullBackOff 0 2m
Investigation:
kubectl describe pod my-pod # Events: # Failed to pull image "myapp:latest": rpc error: code = NotFound
Common Causes and Fixes:
Cause 1: Wrong image name
Fix: Check and correct image name in pod spec.
Cause 2: Image doesn't exist in registry
Fix: Build and push the image, or use correct tag.
Cause 3: Private registry needs authentication
# Create image pull secret kubectl create secret docker-registry regcred \ --docker-server=myregistry.io \ --docker-username=user \ --docker-password=pass # Add to pod spec spec: imagePullSecrets: - name: regcred
Cause 4: Docker Hub rate limit exceeded
Fix: Use authenticated pulls or different registry.
Problem 4: CreateContainerConfigError
Symptoms:
kubectl get pods NAME READY STATUS RESTARTS AGE my-pod 0/1 CreateContainerConfigError 0 1m
Investigation:
kubectl describe pod my-pod # Events: # Error: configmap "app-config" not found
Common Causes and Fixes:
Cause 1: ConfigMap doesn't exist
Fix: Create the ConfigMap or fix the reference.
Cause 2: Secret doesn't exist
Fix: Create the Secret or fix the reference.
Cause 3: ConfigMap key doesn't exist
Fix: Check ConfigMap content and reference correct key.
Part 4: Service Troubleshooting
Problem 5: Cannot Connect to Service
Symptoms:
Connection refused or timeout when accessing service
curlfrom another pod fails
Investigation:
# Check service exists kubectl get svc # NAME TYPE CLUSTER-IP PORT(S) # my-service ClusterIP 10.96.0.1 8080/TCP # Check endpoints (should have pod IPs) kubectl get endpoints my-service # NAME ENDPOINTS # my-service 10.244.1.5:8080,10.244.2.3:8080 # If endpoints are empty, selector doesn't match kubectl describe svc my-service | grep Selector # Selector: app=myapp # Check pod labels kubectl get pods --show-labels # Test connectivity from test pod kubectl run test --image=busybox -it --rm -- /bin/sh wget -O- http://my-service:8080
Common Causes and Fixes:
Cause 1: Selector doesn't match pod labels
Fix: Update service selector or pod labels.
Cause 2: Wrong target port
# Check service ports kubectl get svc my-service -o yaml | grep -A5 ports
Fix: Correct targetPort to match containerPort.
Cause 3: Pods not ready
kubectl get pods
# Pods in 0/1 Ready state don't get trafficFix: Check pod logs for startup issues.
Cause 4: Network policy blocking traffic
kubectl get networkpolicies kubectl describe networkpolicy default-deny
Fix: Add ingress rule to allow traffic.
Cause 5: Service type is ClusterIP (can't access from outside)
kubectl get svc
# TYPE: ClusterIPFix: Use NodePort or LoadBalancer for external access.
Part 5: Node Troubleshooting
Problem 6: Node Not Ready
Symptoms:
kubectl get nodes NAME STATUS ROLES AGE worker-1 NotReady <none> 10d
Investigation:
# Describe node for details kubectl describe node worker-1 # Check node conditions # Conditions: # Ready: Unknown # MemoryPressure: False # DiskPressure: True # PIDPressure: False
Common Causes and Fixes:
Cause 1: Disk pressure
# SSH to node ssh worker-1 # Check disk space df -h # Clean up docker system prune -a kubectl get pods --all-namespaces | grep Evicted | awk '{print $2}' | xargs kubectl delete pod
Cause 2: Kubelet not running
# On node systemctl status kubelet journalctl -u kubelet -n 50
Fix: systemctl restart kubelet
Cause 3: Node unreachable from control plane
# From master ping worker-1
Fix: Check network connectivity, firewall rules.
Cause 4: Memory pressure
free -h
Fix: Reduce pod memory usage or add more memory.
Problem 7: Node Out of Memory (OOM)
Symptoms:
Pods being evicted
kubectl top nodesshows high memory usage
Investigation:
# Check memory usage kubectl top nodes kubectl top pods --all-namespaces --sort-by=memory # Check pod limits kubectl get pods -o custom-columns=NAME:.metadata.name,MEMORY_LIMIT:.spec.containers[0].resources.limits.memory # Check node memory pressure kubectl describe node worker-1 | grep -A5 Conditions
Fixes:
Increase pod memory limits
Add more nodes to cluster
Identify and fix memory leaks in applications
Use
vertical-pod-autoscalerto auto-adjust memory
Part 6: Ingress Troubleshooting
Problem 8: Ingress Not Routing Traffic
Symptoms:
curlto Ingress host returns 404 or connection refusedBrowser shows "No healthy upstream"
Investigation:
# Check Ingress exists kubectl get ingress # NAME CLASS HOSTS ADDRESS # my-ingress nginx example.com 10.0.0.1 # Describe Ingress kubectl describe ingress my-ingress # Check Ingress Controller logs kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
Common Causes and Fixes:
Cause 1: Service doesn't exist or has no endpoints
kubectl get endpoints my-service
# Should show pod IPsFix: Check service selector matches pod labels.
Cause 2: Wrong service port
# Ingress rule backend: service: name: my-service port: number: 80 # Must match service port
Fix: Correct the port number.
Cause 3: TLS secret missing
spec: tls: - hosts: - example.com secretName: example-tls # Secret must exist
Fix: Create TLS secret or remove tls section.
Cause 4: Host header doesn't match
curl -H "Host: example.com" http://ingress-ip
Fix: Use correct hostname or configure default backend.
Part 7: Storage Troubleshooting
Problem 9: PVC Stuck in Pending
Symptoms:
kubectl get pvc NAME STATUS VOLUME CAPACITY my-pvc Pending
Investigation:
kubectl describe pvc my-pvc # Events: # FailedBinding: no persistent volumes available for this claim
Common Causes and Fixes:
Cause 1: No matching PV for static provisioning
Fix: Create a PV that matches the PVC requirements:
apiVersion: v1 kind: PersistentVolume metadata: name: manual-pv spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: path: /mnt/data
Cause 2: Storage Class doesn't support dynamic provisioning
kubectl get storageclass
# PROVISIONER column should not be emptyFix: Install CSI driver or create PV manually.
Cause 3: StorageClass default not set
kubectl get storageclass
# standard (default) kubernetes.io/aws-ebsFix: Set default class: kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Cause 4: No nodes available in topology zone
Fix: Use WaitForFirstConsumer volumeBindingMode.
Part 8: Network Policy Troubleshooting
Problem 10: Network Policy Blocking Traffic
Symptoms:
Pods cannot communicate after applying Network Policies
kubectl execconnectivity tests fail
Investigation:
# List all Network Policies kubectl get netpol # Check default deny policy kubectl get netpol -o yaml | grep -A10 default-deny # Test connectivity from source pod kubectl exec source-pod -- ping target-pod-ip kubectl exec source-pod -- wget -O- http://target-service
Common Fixes:
Fix 1: Add allow rule for namespace
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-from-namespace spec: podSelector: {} ingress: - from: - namespaceSelector: matchLabels: name: my-namespace
Fix 2: Add allow rule for specific pod
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-from-frontend spec: podSelector: matchLabels: app: backend ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080
Part 9: Logging and Debugging Tools
Ephemeral Debug Container
Kubernetes 1.23+ supports ephemeral containers for debugging:
kubectl alpha debug -it my-pod --image=busybox --target=my-container
Debugging with Netshoot
# Run netshoot pod in same namespace kubectl run tmp-shell --rm -it --image nicolaka/netshoot -- /bin/bash # Test connectivity dig my-service curl my-service:8080
Accessing Pod Filesystem
# Copy files from pod kubectl cp my-pod:/var/log/app.log ./app.log # Copy files to pod kubectl cp ./config.json my-pod:/app/config.json
Part 10: Quick Troubleshooting Decision Tree
Pod not running
├── Check status: kubectl get pods
│
├── Pending
│ ├── Check events: kubectl describe pod
│ ├── Insufficient CPU/memory → Add nodes or reduce requests
│ ├── PVC not bound → Check storage
│ └── Node taints → Add tolerations
│
├── ImagePullBackOff / ErrImagePull
│ ├── Check image name
│ ├── Check registry access
│ └── Add imagePullSecrets
│
├── CrashLoopBackOff
│ ├── Check logs: kubectl logs --previous
│ ├── Application error → Fix app
│ ├── Missing config → Add ConfigMap/Secret
│ └── OOMKilled → Increase memory limit
│
├── CreateContainerConfigError
│ └── Missing ConfigMap or Secret → Create it
│
└── Running but not responding
├── Check readiness probe
├── Check service endpoints
└── Check network policyQuick Reference Card
| Problem | First Command | Most Common Fix |
|---|---|---|
| Pod pending | kubectl describe pod | Reduce resource requests |
| CrashLoopBackOff | kubectl logs --previous | Fix app error or add ConfigMap |
| ImagePullBackOff | kubectl describe pod | Correct image name or add secret |
| Service not working | kubectl get endpoints | Fix selector or port |
| Node not ready | kubectl describe node | Free disk space or restart kubelet |
| Ingress 404 | kubectl get ingress | Check service endpoints |
| PVC pending | kubectl describe pvc | Create PV or add StorageClass |
Learn More
Practice Kubernetes troubleshooting with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/
Comments
Post a Comment