Skip to main content

Kubernetes Troubleshooting Guide:

 

Kubernetes Troubleshooting Guide: From Red Pods to Healthy Clusters

📅 Published: June 2026
⏱️ Estimated Reading Time: 25 minutes
🏷️ Tags: Kubernetes, Troubleshooting, Debugging, K8s Issues, Pod Failures


Introduction: The Kubernetes Debugging Mindset

Kubernetes is powerful, but when something goes wrong, the error messages can be cryptic. A pod stuck in CrashLoopBackOff tells you something is wrong, but not why. A service that won't connect gives no obvious indication of the problem.

Successful Kubernetes troubleshooting follows a systematic approach:

  1. Identify the symptom – What isn't working?

  2. Check the obvious – Are pods running? Is the service there?

  3. Narrow down – Which component is failing?

  4. Gather data – Logs, events, describe output

  5. Find the root cause – What actually broke?

  6. Fix and verify – Make the change, confirm it works

This guide covers the most common Kubernetes problems and how to solve them.


Part 1: The Essential Debugging Commands

bash
# View pod status
kubectl get pods
kubectl get pods -n namespace
kubectl get pods -w  # Watch in real-time

# Detailed pod information
kubectl describe pod my-pod
kubectl describe pod my-pod -n namespace

# View logs
kubectl logs my-pod
kubectl logs my-pod -c container-name  # Multi-container pod
kubectl logs my-pod --previous          # Previous crashed instance
kubectl logs -l app=myapp               # All pods with label

# Execute commands inside pod
kubectl exec -it my-pod -- /bin/bash
kubectl exec my-pod -- ls -la

# Port forward to pod
kubectl port-forward pod/my-pod 8080:80

# View events cluster-wide
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -n namespace

# Resource usage
kubectl top nodes
kubectl top pods

Part 2: Pod States and What They Mean

StateMeaningWhat to Do
PendingPod accepted, waiting for node or containersCheck node resources, PVC binding, image pull
ContainerCreatingPod is starting, pulling imagesUsually normal. If stuck, check network/registry
RunningPod is runningGood state unless app has issues
CrashLoopBackOffContainer starts then crashes repeatedlyCheck logs, check command, check health probes
ImagePullBackOffCannot pull container imageCheck image name, registry access, imagePullSecrets
ErrImagePullFailed to pull imageSame as ImagePullBackOff
CreateContainerConfigErrorConfig issue (missing ConfigMap/Secret)Check ConfigMap and Secret references
CompletedContainer exited with 0 (batch job)Normal for jobs. Delete if not needed
OOMKilledContainer killed due to memory limitIncrease memory limit or fix memory leak
EvictedPod removed due to resource pressureCheck node resources, adjust requests/limits

Part 3: Pod Troubleshooting

Problem 1: Pod Stuck in Pending

Symptoms:

bash
kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
my-pod    0/1     Pending   0          5m

Investigation:

bash
# Describe pod to see events
kubectl describe pod my-pod

# Common events:
# - "0/1 nodes are available: 1 Insufficient cpu"
# - "0/1 nodes are available: 1 node(s) had taint"
# - "pod has unbound immediate PersistentVolumeClaims"

Common Causes and Fixes:

Cause 1: Insufficient resources

text
Events:
  Type     Reason            Message
  Warning  FailedScheduling  0/3 nodes available: insufficient cpu

Fix: Reduce resource requests or add nodes.

Cause 2: Node taints

text
Events:
  Warning  FailedScheduling  0/3 nodes available: 3 node(s) had taint

Fix: Add toleration to pod:

yaml
tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"

Cause 3: Unbound PersistentVolumeClaim

text
Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Fix: Check PVC status:

bash
kubectl get pvc
kubectl describe pvc my-pvc

Problem 2: Pod in CrashLoopBackOff

Symptoms:

bash
kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
my-pod    0/1     CrashLoopBackOff   5          10m

Investigation:

bash
# Check current logs
kubectl logs my-pod

# Check logs from previous crashed instance
kubectl logs my-pod --previous

# Describe pod for events
kubectl describe pod my-pod

Common Causes and Fixes:

Cause 1: Application error on startup

bash
kubectl logs my-pod --previous
# Error: Cannot find module '/app/server.js'

Fix: Check Dockerfile, ensure files are copied correctly.

Cause 2: Missing environment variable or ConfigMap

bash
kubectl logs my-pod --previous
# Error: DATABASE_URL environment variable not set

Fix: Add missing environment variable or ConfigMap.

Cause 3: Command or arguments incorrect

yaml
# Check pod spec
kubectl get pod my-pod -o yaml | grep -A5 command

Fix: Correct the command or arguments.

Cause 4: OOMKilled (exit code 137)

bash
kubectl describe pod my-pod | grep -A5 State
#   State:          Terminated
#     Reason:       OOMKilled

Fix: Increase memory limit or fix memory leak.

Cause 5: Liveness probe failing

bash
kubectl describe pod my-pod | grep -A10 Liveness

Fix: Adjust probe settings (initialDelaySeconds, periodSeconds, timeoutSeconds).


Problem 3: ImagePullBackOff

Symptoms:

bash
kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
my-pod    0/1     ImagePullBackOff   0          2m

Investigation:

bash
kubectl describe pod my-pod
# Events:
#   Failed to pull image "myapp:latest": rpc error: code = NotFound

Common Causes and Fixes:

Cause 1: Wrong image name
Fix: Check and correct image name in pod spec.

Cause 2: Image doesn't exist in registry
Fix: Build and push the image, or use correct tag.

Cause 3: Private registry needs authentication

bash
# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.io \
  --docker-username=user \
  --docker-password=pass

# Add to pod spec
spec:
  imagePullSecrets:
  - name: regcred

Cause 4: Docker Hub rate limit exceeded
Fix: Use authenticated pulls or different registry.


Problem 4: CreateContainerConfigError

Symptoms:

bash
kubectl get pods
NAME      READY   STATUS                        RESTARTS   AGE
my-pod    0/1     CreateContainerConfigError    0          1m

Investigation:

bash
kubectl describe pod my-pod
# Events:
#   Error: configmap "app-config" not found

Common Causes and Fixes:

Cause 1: ConfigMap doesn't exist
Fix: Create the ConfigMap or fix the reference.

Cause 2: Secret doesn't exist
Fix: Create the Secret or fix the reference.

Cause 3: ConfigMap key doesn't exist
Fix: Check ConfigMap content and reference correct key.


Part 4: Service Troubleshooting

Problem 5: Cannot Connect to Service

Symptoms:

  • Connection refused or timeout when accessing service

  • curl from another pod fails

Investigation:

bash
# Check service exists
kubectl get svc
# NAME         TYPE        CLUSTER-IP     PORT(S)
# my-service   ClusterIP   10.96.0.1      8080/TCP

# Check endpoints (should have pod IPs)
kubectl get endpoints my-service
# NAME         ENDPOINTS
# my-service   10.244.1.5:8080,10.244.2.3:8080

# If endpoints are empty, selector doesn't match
kubectl describe svc my-service | grep Selector
# Selector: app=myapp

# Check pod labels
kubectl get pods --show-labels

# Test connectivity from test pod
kubectl run test --image=busybox -it --rm -- /bin/sh
wget -O- http://my-service:8080

Common Causes and Fixes:

Cause 1: Selector doesn't match pod labels
Fix: Update service selector or pod labels.

Cause 2: Wrong target port

yaml
# Check service ports
kubectl get svc my-service -o yaml | grep -A5 ports

Fix: Correct targetPort to match containerPort.

Cause 3: Pods not ready

bash
kubectl get pods
# Pods in 0/1 Ready state don't get traffic

Fix: Check pod logs for startup issues.

Cause 4: Network policy blocking traffic

bash
kubectl get networkpolicies
kubectl describe networkpolicy default-deny

Fix: Add ingress rule to allow traffic.

Cause 5: Service type is ClusterIP (can't access from outside)

bash
kubectl get svc
# TYPE: ClusterIP

Fix: Use NodePort or LoadBalancer for external access.


Part 5: Node Troubleshooting

Problem 6: Node Not Ready

Symptoms:

bash
kubectl get nodes
NAME       STATUS     ROLES    AGE
worker-1   NotReady   <none>   10d

Investigation:

bash
# Describe node for details
kubectl describe node worker-1

# Check node conditions
# Conditions:
#   Ready: Unknown
#   MemoryPressure: False
#   DiskPressure: True
#   PIDPressure: False

Common Causes and Fixes:

Cause 1: Disk pressure

bash
# SSH to node
ssh worker-1

# Check disk space
df -h

# Clean up
docker system prune -a
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2}' | xargs kubectl delete pod

Cause 2: Kubelet not running

bash
# On node
systemctl status kubelet
journalctl -u kubelet -n 50

Fix: systemctl restart kubelet

Cause 3: Node unreachable from control plane

bash
# From master
ping worker-1

Fix: Check network connectivity, firewall rules.

Cause 4: Memory pressure

bash
free -h

Fix: Reduce pod memory usage or add more memory.


Problem 7: Node Out of Memory (OOM)

Symptoms:

  • Pods being evicted

  • kubectl top nodes shows high memory usage

Investigation:

bash
# Check memory usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memory

# Check pod limits
kubectl get pods -o custom-columns=NAME:.metadata.name,MEMORY_LIMIT:.spec.containers[0].resources.limits.memory

# Check node memory pressure
kubectl describe node worker-1 | grep -A5 Conditions

Fixes:

  1. Increase pod memory limits

  2. Add more nodes to cluster

  3. Identify and fix memory leaks in applications

  4. Use vertical-pod-autoscaler to auto-adjust memory


Part 6: Ingress Troubleshooting

Problem 8: Ingress Not Routing Traffic

Symptoms:

  • curl to Ingress host returns 404 or connection refused

  • Browser shows "No healthy upstream"

Investigation:

bash
# Check Ingress exists
kubectl get ingress
# NAME         CLASS   HOSTS              ADDRESS
# my-ingress   nginx   example.com        10.0.0.1

# Describe Ingress
kubectl describe ingress my-ingress

# Check Ingress Controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller

Common Causes and Fixes:

Cause 1: Service doesn't exist or has no endpoints

bash
kubectl get endpoints my-service
# Should show pod IPs

Fix: Check service selector matches pod labels.

Cause 2: Wrong service port

yaml
# Ingress rule
backend:
  service:
    name: my-service
    port:
      number: 80  # Must match service port

Fix: Correct the port number.

Cause 3: TLS secret missing

yaml
spec:
  tls:
  - hosts:
    - example.com
    secretName: example-tls  # Secret must exist

Fix: Create TLS secret or remove tls section.

Cause 4: Host header doesn't match

bash
curl -H "Host: example.com" http://ingress-ip

Fix: Use correct hostname or configure default backend.


Part 7: Storage Troubleshooting

Problem 9: PVC Stuck in Pending

Symptoms:

bash
kubectl get pvc
NAME      STATUS    VOLUME   CAPACITY
my-pvc    Pending

Investigation:

bash
kubectl describe pvc my-pvc
# Events:
#   FailedBinding: no persistent volumes available for this claim

Common Causes and Fixes:

Cause 1: No matching PV for static provisioning
Fix: Create a PV that matches the PVC requirements:

yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /mnt/data

Cause 2: Storage Class doesn't support dynamic provisioning

bash
kubectl get storageclass
# PROVISIONER column should not be empty

Fix: Install CSI driver or create PV manually.

Cause 3: StorageClass default not set

bash
kubectl get storageclass
# standard (default)   kubernetes.io/aws-ebs

Fix: Set default class: kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Cause 4: No nodes available in topology zone
Fix: Use WaitForFirstConsumer volumeBindingMode.


Part 8: Network Policy Troubleshooting

Problem 10: Network Policy Blocking Traffic

Symptoms:

  • Pods cannot communicate after applying Network Policies

  • kubectl exec connectivity tests fail

Investigation:

bash
# List all Network Policies
kubectl get netpol

# Check default deny policy
kubectl get netpol -o yaml | grep -A10 default-deny

# Test connectivity from source pod
kubectl exec source-pod -- ping target-pod-ip
kubectl exec source-pod -- wget -O- http://target-service

Common Fixes:

Fix 1: Add allow rule for namespace

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-namespace
spec:
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: my-namespace

Fix 2: Add allow rule for specific pod

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Part 9: Logging and Debugging Tools

Ephemeral Debug Container

Kubernetes 1.23+ supports ephemeral containers for debugging:

bash
kubectl alpha debug -it my-pod --image=busybox --target=my-container

Debugging with Netshoot

bash
# Run netshoot pod in same namespace
kubectl run tmp-shell --rm -it --image nicolaka/netshoot -- /bin/bash

# Test connectivity
dig my-service
curl my-service:8080

Accessing Pod Filesystem

bash
# Copy files from pod
kubectl cp my-pod:/var/log/app.log ./app.log

# Copy files to pod
kubectl cp ./config.json my-pod:/app/config.json

Part 10: Quick Troubleshooting Decision Tree

text
Pod not running
├── Check status: kubectl get pods
│
├── Pending
│   ├── Check events: kubectl describe pod
│   ├── Insufficient CPU/memory → Add nodes or reduce requests
│   ├── PVC not bound → Check storage
│   └── Node taints → Add tolerations
│
├── ImagePullBackOff / ErrImagePull
│   ├── Check image name
│   ├── Check registry access
│   └── Add imagePullSecrets
│
├── CrashLoopBackOff
│   ├── Check logs: kubectl logs --previous
│   ├── Application error → Fix app
│   ├── Missing config → Add ConfigMap/Secret
│   └── OOMKilled → Increase memory limit
│
├── CreateContainerConfigError
│   └── Missing ConfigMap or Secret → Create it
│
└── Running but not responding
    ├── Check readiness probe
    ├── Check service endpoints
    └── Check network policy

Quick Reference Card

ProblemFirst CommandMost Common Fix
Pod pendingkubectl describe podReduce resource requests
CrashLoopBackOffkubectl logs --previousFix app error or add ConfigMap
ImagePullBackOffkubectl describe podCorrect image name or add secret
Service not workingkubectl get endpointsFix selector or port
Node not readykubectl describe nodeFree disk space or restart kubelet
Ingress 404kubectl get ingressCheck service endpoints
PVC pendingkubectl describe pvcCreate PV or add StorageClass

Learn More

Practice Kubernetes troubleshooting with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

Comments

Popular posts from this blog

📊 Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd

  Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd Monitoring and logging are essential for maintaining a healthy and well-performing Kubernetes cluster. In this guide, we’ll cover why monitoring is important, key monitoring tools like Prometheus and Grafana, and logging tools like Fluentd to help you gain visibility into your cluster’s performance and logs. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction In today’s fast-paced cloud-native environment, Kubernetes has emerged as the de-facto container orchestration platform. But deploying and managing applications in Kubernetes is just half the ba...

How to Use SKY TTS: The Complete, Step-by-Step Guide for 2025

 What is SKY TTS? SKY TTS  is a free, next-generation  AI audio creation platform  that brings together high-quality  Text-to-Speech ,  Speech-to-Text , and a full suite of professional  audio editing tools  in one seamless experience. Our vision is simple — to make advanced audio technology  free, accessible, and effortless  for everyone. From creators and educators to podcasters, developers, and businesses, SKY TTS helps users produce  studio-grade voice content  without expensive software or technical skills. With support for  70+ languages, natural voices, audio enhancement, waveform generation, and batch automation , SKY TTS has become a trusted all-in-one toolkit for modern digital audio workflows. Why Choose SKY TTS? Instant Conversion:  Enjoy rapid text-to-speech generation, even with large documents. Advanced Voice Settings:   Adjust speed, pitch, and style for a personalized listening experience. Multi-...

Introduction to Terraform – The Future of Infrastructure as Code

  Introduction to Terraform – The Future of Infrastructure as Code In today’s fast-paced DevOps world, managing infrastructure manually is outdated . This is where Terraform comes in—a powerful Infrastructure as Code (IaC) tool that allows you to define, provision, and manage cloud infrastructure efficiently . Whether you're working with AWS, Azure, Google Cloud, or on-premises servers , Terraform provides a declarative, automation-first approach to infrastructure deployment. Shape Your Future with AI & Infinite Knowledge...!! Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! In today’s digital-first world, agility and automation are no longer optional—they’re essential. Companies across the globe are rapidly shifting their operations to the cloud to keep up with the pace of innovatio...