Kubernetes Troubleshooting Guide:

Kubernetes Troubleshooting Guide: From Red Pods to Healthy Clusters

📅 Published: June 2026
⏱️ Estimated Reading Time: 25 minutes
🏷️ Tags: Kubernetes, Troubleshooting, Debugging, K8s Issues, Pod Failures

Introduction: The Kubernetes Debugging Mindset

Kubernetes is powerful, but when something goes wrong, the error messages can be cryptic. A pod stuck in CrashLoopBackOff tells you something is wrong, but not why. A service that won't connect gives no obvious indication of the problem.

Successful Kubernetes troubleshooting follows a systematic approach:

Identify the symptom – What isn't working?
Check the obvious – Are pods running? Is the service there?
Narrow down – Which component is failing?
Gather data – Logs, events, describe output
Find the root cause – What actually broke?
Fix and verify – Make the change, confirm it works

This guide covers the most common Kubernetes problems and how to solve them.

Part 1: The Essential Debugging Commands

# View pod status
kubectl get pods
kubectl get pods -n namespace
kubectl get pods -w  # Watch in real-time

# Detailed pod information
kubectl describe pod my-pod
kubectl describe pod my-pod -n namespace

# View logs
kubectl logs my-pod
kubectl logs my-pod -c container-name  # Multi-container pod
kubectl logs my-pod --previous          # Previous crashed instance
kubectl logs -l app=myapp               # All pods with label

# Execute commands inside pod
kubectl exec -it my-pod -- /bin/bash
kubectl exec my-pod -- ls -la

# Port forward to pod
kubectl port-forward pod/my-pod 8080:80

# View events cluster-wide
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -n namespace

# Resource usage
kubectl top nodes
kubectl top pods

Part 2: Pod States and What They Mean

State	Meaning	What to Do
Pending	Pod accepted, waiting for node or containers	Check node resources, PVC binding, image pull
ContainerCreating	Pod is starting, pulling images	Usually normal. If stuck, check network/registry
Running	Pod is running	Good state unless app has issues
CrashLoopBackOff	Container starts then crashes repeatedly	Check logs, check command, check health probes
ImagePullBackOff	Cannot pull container image	Check image name, registry access, imagePullSecrets
ErrImagePull	Failed to pull image	Same as ImagePullBackOff
CreateContainerConfigError	Config issue (missing ConfigMap/Secret)	Check ConfigMap and Secret references
Completed	Container exited with 0 (batch job)	Normal for jobs. Delete if not needed
OOMKilled	Container killed due to memory limit	Increase memory limit or fix memory leak
Evicted	Pod removed due to resource pressure	Check node resources, adjust requests/limits

Part 3: Pod Troubleshooting

Problem 1: Pod Stuck in Pending

Symptoms:

kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
my-pod    0/1     Pending   0          5m

Investigation:

# Describe pod to see events
kubectl describe pod my-pod

# Common events:
# - "0/1 nodes are available: 1 Insufficient cpu"
# - "0/1 nodes are available: 1 node(s) had taint"
# - "pod has unbound immediate PersistentVolumeClaims"

Common Causes and Fixes:

Cause 1: Insufficient resources

Events:
  Type     Reason            Message
  Warning  FailedScheduling  0/3 nodes available: insufficient cpu

Fix: Reduce resource requests or add nodes.

Cause 2: Node taints

Events:
  Warning  FailedScheduling  0/3 nodes available: 3 node(s) had taint

Fix: Add toleration to pod:

tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"

Cause 3: Unbound PersistentVolumeClaim

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Fix: Check PVC status:

kubectl get pvc
kubectl describe pvc my-pvc

Problem 2: Pod in CrashLoopBackOff

Symptoms:

kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
my-pod    0/1     CrashLoopBackOff   5          10m

Investigation:

# Check current logs
kubectl logs my-pod

# Check logs from previous crashed instance
kubectl logs my-pod --previous

# Describe pod for events
kubectl describe pod my-pod

Common Causes and Fixes:

Cause 1: Application error on startup

kubectl logs my-pod --previous
# Error: Cannot find module '/app/server.js'

Fix: Check Dockerfile, ensure files are copied correctly.

Cause 2: Missing environment variable or ConfigMap

kubectl logs my-pod --previous
# Error: DATABASE_URL environment variable not set

Fix: Add missing environment variable or ConfigMap.

Cause 3: Command or arguments incorrect

# Check pod spec
kubectl get pod my-pod -o yaml | grep -A5 command

Fix: Correct the command or arguments.

Cause 4: OOMKilled (exit code 137)

kubectl describe pod my-pod | grep -A5 State
#   State:          Terminated
#     Reason:       OOMKilled

Fix: Increase memory limit or fix memory leak.

Cause 5: Liveness probe failing

kubectl describe pod my-pod | grep -A10 Liveness

Fix: Adjust probe settings (initialDelaySeconds, periodSeconds, timeoutSeconds).

Problem 3: ImagePullBackOff

Symptoms:

kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
my-pod    0/1     ImagePullBackOff   0          2m

Investigation:

kubectl describe pod my-pod
# Events:
#   Failed to pull image "myapp:latest": rpc error: code = NotFound

Common Causes and Fixes:

Cause 1: Wrong image name
Fix: Check and correct image name in pod spec.

Cause 2: Image doesn't exist in registry
Fix: Build and push the image, or use correct tag.

Cause 3: Private registry needs authentication

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.io \
  --docker-username=user \
  --docker-password=pass

# Add to pod spec
spec:
  imagePullSecrets:
  - name: regcred

Cause 4: Docker Hub rate limit exceeded
Fix: Use authenticated pulls or different registry.

Problem 4: CreateContainerConfigError

Symptoms:

kubectl get pods
NAME      READY   STATUS                        RESTARTS   AGE
my-pod    0/1     CreateContainerConfigError    0          1m

Investigation:

kubectl describe pod my-pod
# Events:
#   Error: configmap "app-config" not found

Common Causes and Fixes:

Cause 1: ConfigMap doesn't exist
Fix: Create the ConfigMap or fix the reference.

Cause 2: Secret doesn't exist
Fix: Create the Secret or fix the reference.

Cause 3: ConfigMap key doesn't exist
Fix: Check ConfigMap content and reference correct key.

Part 4: Service Troubleshooting

Problem 5: Cannot Connect to Service

Symptoms:

Connection refused or timeout when accessing service
curl from another pod fails

Investigation:

# Check service exists
kubectl get svc
# NAME         TYPE        CLUSTER-IP     PORT(S)
# my-service   ClusterIP   10.96.0.1      8080/TCP

# Check endpoints (should have pod IPs)
kubectl get endpoints my-service
# NAME         ENDPOINTS
# my-service   10.244.1.5:8080,10.244.2.3:8080

# If endpoints are empty, selector doesn't match
kubectl describe svc my-service | grep Selector
# Selector: app=myapp

# Check pod labels
kubectl get pods --show-labels

# Test connectivity from test pod
kubectl run test --image=busybox -it --rm -- /bin/sh
wget -O- http://my-service:8080

Common Causes and Fixes:

Cause 1: Selector doesn't match pod labels
Fix: Update service selector or pod labels.

Cause 2: Wrong target port

# Check service ports
kubectl get svc my-service -o yaml | grep -A5 ports

Fix: Correct targetPort to match containerPort.

Cause 3: Pods not ready

kubectl get pods
# Pods in 0/1 Ready state don't get traffic

Fix: Check pod logs for startup issues.

Cause 4: Network policy blocking traffic

kubectl get networkpolicies
kubectl describe networkpolicy default-deny

Fix: Add ingress rule to allow traffic.

Cause 5: Service type is ClusterIP (can't access from outside)

kubectl get svc
# TYPE: ClusterIP

Fix: Use NodePort or LoadBalancer for external access.

Part 5: Node Troubleshooting

Problem 6: Node Not Ready

Symptoms:

kubectl get nodes
NAME       STATUS     ROLES    AGE
worker-1   NotReady   <none>   10d

Investigation:

# Describe node for details
kubectl describe node worker-1

# Check node conditions
# Conditions:
#   Ready: Unknown
#   MemoryPressure: False
#   DiskPressure: True
#   PIDPressure: False

Common Causes and Fixes:

Cause 1: Disk pressure

# SSH to node
ssh worker-1

# Check disk space
df -h

# Clean up
docker system prune -a
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2}' | xargs kubectl delete pod

Cause 2: Kubelet not running

# On node
systemctl status kubelet
journalctl -u kubelet -n 50

Fix: systemctl restart kubelet

Cause 3: Node unreachable from control plane

# From master
ping worker-1

Fix: Check network connectivity, firewall rules.

Cause 4: Memory pressure

free -h

Fix: Reduce pod memory usage or add more memory.

Problem 7: Node Out of Memory (OOM)

Symptoms:

Pods being evicted
kubectl top nodes shows high memory usage

Investigation:

# Check memory usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

# Check pod limits
kubectl get pods -o custom-columns=NAME:.metadata.name,MEMORY_LIMIT:.spec.containers[0].resources.limits.memory

# Check node memory pressure
kubectl describe node worker-1 | grep -A5 Conditions

Fixes:

Increase pod memory limits
Add more nodes to cluster
Identify and fix memory leaks in applications
Use vertical-pod-autoscaler to auto-adjust memory

Part 6: Ingress Troubleshooting

Problem 8: Ingress Not Routing Traffic

Symptoms:

curl to Ingress host returns 404 or connection refused
Browser shows "No healthy upstream"

Investigation:

# Check Ingress exists
kubectl get ingress
# NAME         CLASS   HOSTS              ADDRESS
# my-ingress   nginx   example.com        10.0.0.1

# Describe Ingress
kubectl describe ingress my-ingress

# Check Ingress Controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller

Common Causes and Fixes:

Cause 1: Service doesn't exist or has no endpoints

kubectl get endpoints my-service
# Should show pod IPs

Fix: Check service selector matches pod labels.

Cause 2: Wrong service port

# Ingress rule
backend:
  service:
    name: my-service
    port:
      number: 80  # Must match service port

Fix: Correct the port number.

Cause 3: TLS secret missing

spec:
  tls:
  - hosts:
    - example.com
    secretName: example-tls  # Secret must exist

Fix: Create TLS secret or remove tls section.

Cause 4: Host header doesn't match

curl -H "Host: example.com" http://ingress-ip

Fix: Use correct hostname or configure default backend.

Part 7: Storage Troubleshooting

Problem 9: PVC Stuck in Pending

Symptoms:

kubectl get pvc
NAME      STATUS    VOLUME   CAPACITY
my-pvc    Pending

Investigation:

kubectl describe pvc my-pvc
# Events:
#   FailedBinding: no persistent volumes available for this claim

Common Causes and Fixes:

Cause 1: No matching PV for static provisioning
Fix: Create a PV that matches the PVC requirements:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /mnt/data

Cause 2: Storage Class doesn't support dynamic provisioning

kubectl get storageclass
# PROVISIONER column should not be empty

Fix: Install CSI driver or create PV manually.

Cause 3: StorageClass default not set

kubectl get storageclass
# standard (default)   kubernetes.io/aws-ebs

Fix: Set default class: kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Cause 4: No nodes available in topology zone
Fix: Use WaitForFirstConsumer volumeBindingMode.

Part 8: Network Policy Troubleshooting

Problem 10: Network Policy Blocking Traffic

Symptoms:

Pods cannot communicate after applying Network Policies
kubectl exec connectivity tests fail

Investigation:

# List all Network Policies
kubectl get netpol

# Check default deny policy
kubectl get netpol -o yaml | grep -A10 default-deny

# Test connectivity from source pod
kubectl exec source-pod -- ping target-pod-ip
kubectl exec source-pod -- wget -O- http://target-service

Common Fixes:

Fix 1: Add allow rule for namespace

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-namespace
spec:
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: my-namespace

Fix 2: Add allow rule for specific pod

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Part 9: Logging and Debugging Tools

Ephemeral Debug Container

Kubernetes 1.23+ supports ephemeral containers for debugging:

kubectl alpha debug -it my-pod --image=busybox --target=my-container

Debugging with Netshoot

# Run netshoot pod in same namespace
kubectl run tmp-shell --rm -it --image nicolaka/netshoot -- /bin/bash

# Test connectivity
dig my-service
curl my-service:8080

Accessing Pod Filesystem

# Copy files from pod
kubectl cp my-pod:/var/log/app.log ./app.log

# Copy files to pod
kubectl cp ./config.json my-pod:/app/config.json

Part 10: Quick Troubleshooting Decision Tree

Pod not running
├── Check status: kubectl get pods
│
├── Pending
│   ├── Check events: kubectl describe pod
│   ├── Insufficient CPU/memory → Add nodes or reduce requests
│   ├── PVC not bound → Check storage
│   └── Node taints → Add tolerations
│
├── ImagePullBackOff / ErrImagePull
│   ├── Check image name
│   ├── Check registry access
│   └── Add imagePullSecrets
│
├── CrashLoopBackOff
│   ├── Check logs: kubectl logs --previous
│   ├── Application error → Fix app
│   ├── Missing config → Add ConfigMap/Secret
│   └── OOMKilled → Increase memory limit
│
├── CreateContainerConfigError
│   └── Missing ConfigMap or Secret → Create it
│
└── Running but not responding
    ├── Check readiness probe
    ├── Check service endpoints
    └── Check network policy

Quick Reference Card

Problem	First Command	Most Common Fix
Pod pending	`kubectl describe pod`	Reduce resource requests
CrashLoopBackOff	`kubectl logs --previous`	Fix app error or add ConfigMap
ImagePullBackOff	`kubectl describe pod`	Correct image name or add secret
Service not working	`kubectl get endpoints`	Fix selector or port
Node not ready	`kubectl describe node`	Free disk space or restart kubelet
Ingress 404	`kubectl get ingress`	Check service endpoints
PVC pending	`kubectl describe pvc`	Create PV or add StorageClass

Learn More

Practice Kubernetes troubleshooting with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

SKY Tech – Explore Technology!

Search This Blog