Skip to main content

Kubernetes Monitoring & Logging

 Monitoring & Logging: Metrics Server, Prometheus Basics, and Centralized Logging

📅 Published: May 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: Kubernetes Monitoring, Prometheus, Metrics Server, Centralized Logging, EFK Stack


Introduction: The Observability Challenge

Kubernetes is dynamic. Pods come and go. Nodes scale up and down. Applications restart. In this environment, traditional monitoring approaches fail. You cannot SSH into a Pod that no longer exists to check its logs. You cannot monitor static IP addresses when Pods get new IPs every restart.

Kubernetes requires a different approach to observability. You need tools that understand the dynamic nature of containers and can collect, store, and query metrics and logs across ephemeral resources.

This guide covers the essential monitoring and logging tools for Kubernetes:

  • Metrics Server – Basic resource metrics for HPA and kubectl top

  • Prometheus – Powerful metrics collection and alerting

  • Centralized Logging – Aggregating logs from all Pods to a central location


Part 1: Metrics Server

What is Metrics Server?

Metrics Server is a cluster-wide aggregator of resource usage metrics. It collects CPU and memory usage from Pods and Nodes and makes them available through the Kubernetes API.

Think of Metrics Server as a lightweight sensor network. It doesn't store historical data. It doesn't provide advanced analytics. It just tells you what is happening right now.

What Metrics Server Provides

  • kubectl top pod – Current CPU/memory usage per Pod

  • kubectl top node – Current CPU/memory usage per Node

  • Horizontal Pod Autoscaler (HPA) data source

  • Vertical Pod Autoscaler (VPA) data source

Installing Metrics Server

bash
# Download the latest release
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

# Check API availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .

Metrics Server Configuration (Optional)

For clusters with custom CNI or certificate issues, you may need to modify the deployment:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - args:
        - --kubelet-insecure-tls   # Only for testing, not production
        - --kubelet-preferred-address-types=InternalIP
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        name: metrics-server

Using Metrics Server

bash
# View node resource usage
kubectl top nodes
kubectl top nodes --sort-by=cpu
kubectl top nodes --sort-by=memory

# View pod resource usage
kubectl top pods
kubectl top pods --all-namespaces
kubectl top pods -l app=nginx

# View pod resource usage in a namespace
kubectl top pods -n kube-system

# Sort pods by memory usage
kubectl top pods --sort-by=memory

Metrics Server Limitations

  • No long-term storage (only current values)

  • No custom metrics

  • No alerting

  • No historical trends

For advanced monitoring, you need Prometheus.


Part 2: Prometheus Basics

What is Prometheus?

Prometheus is an open-source monitoring system that scrapes metrics from targets at regular intervals, stores them in a time-series database, and provides powerful querying and alerting capabilities.

Think of Prometheus as a historian for your cluster. It records what happened, when it happened, and allows you to ask complex questions about your system's behavior over time.

Prometheus Architecture

text
┌─────────────────────────────────────────────────────────────────┐
│                        Prometheus Server                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Scraper   │───▶│  TSDB       │───▶│  Query      │          │
│  │  (Pull)     │    │  (Storage)  │    │  Engine     │          │
│  └─────────────┘    └─────────────┘    └──────┬──────┘          │
│         │                                     │                  │
│         ▼                                     ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │  Service    │                      │   Alert     │           │
│  │  Discovery  │                      │   Manager   │           │
│  └─────────────┘                      └──────┬──────┘           │
└──────────────────────────────────────────────┼──────────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │ Alertmanager│
                                        └──────┬──────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │  Email,     │
                                        │  Slack,     │
                                        │  PagerDuty  │
                                        └─────────────┘

Installing Prometheus with kube-prometheus-stack

The easiest way to deploy Prometheus is with the kube-prometheus-stack Helm chart.

bash
# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin

# Verify installation
kubectl get pods -n monitoring

Prometheus Configuration (scrape_configs)

yaml
# prometheus-config.yaml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules

scrape_configs:
# Scrape Kubernetes nodes (cAdvisor)
- job_name: kubernetes-nodes
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)

# Scrape Kubernetes pods
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__

PromQL (Prometheus Query Language)

PromQL is the query language for extracting data from Prometheus.

promql
# Basic queries
# Current CPU usage of a specific pod
container_cpu_usage_seconds_total{pod="nginx-abc123"}

# Rate of CPU usage per second
rate(container_cpu_usage_seconds_total[5m])

# Memory usage in bytes
container_memory_working_set_bytes

# 95th percentile request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Common PromQL Queries

promql
# CPU usage by pod (top 10)
topk(10, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))

# Memory usage by node
sum(container_memory_working_set_bytes) by (node)

# Pod restarts in last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Pods with high memory usage (>500MB)
sum(container_memory_working_set_bytes) by (pod) > 500000000

# Failed pods count
sum(kube_pod_status_phase{phase="Failed"})

# API server request latency
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, resource))

Prometheus Alerts

yaml
# alert-rules.yaml
groups:
- name: pod-alerts
  rules:
  # Alert when a pod has been restarting frequently
  - alert: PodFrequentlyRestarting
    expr: increase(kube_pod_container_status_restarts_total[15m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is restarting frequently"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last 15 minutes."

  # Alert when CPU usage exceeds threshold
  - alert: HighCPUUsage
    expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) > 0.8
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage on pod {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is using {{ $value }} cores on average over 5 minutes."

  # Alert when memory usage exceeds threshold
  - alert: HighMemoryUsage
    expr: sum(container_memory_working_set_bytes) by (pod) > 1e9
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on pod {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is using {{ $value | humanize }} bytes of memory."

  # Alert when pod is in CrashLoopBackOff
  - alert: CrashLoopBackOff
    expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is in CrashLoopBackOff"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been in CrashLoopBackOff for 5 minutes."

# Node alerts
- name: node-alerts
  rules:
  - alert: NodeHighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.node }} has high memory usage"
      description: "Node {{ $labels.node }} memory usage is above 90% (current: {{ $value }}%)"

  - alert: NodeDiskSpaceLow
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.node }} has low disk space"
      description: "Node {{ $labels.node }} has only {{ $value }}% disk space remaining."

Adding Prometheus Annotations to Applications

For Prometheus to scrape your application, add annotations to your Pod or Service:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
  - name: app
    image: myapp:latest
    ports:
    - containerPort: 8080

Prometheus Commands

bash
# Port forward to access Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090

# Access Prometheus at http://localhost:9090

# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Access Grafana at http://localhost:3000 (admin/admin)

Part 3: Centralized Logging

The Logging Challenge

In Kubernetes, logs are ephemeral. When a Pod is deleted, its logs disappear. When a Node fails, logs on that node are lost. When you have 100 replicas of a service, which Pod's logs do you check?

Centralized logging solves this by collecting logs from all Pods and storing them in a single location.

Common Logging Architectures

ArchitectureComponentsBest For
EFK StackElasticsearch, Fluentd, KibanaGeneral purpose, large deployments
LokiLoki, Promtail, GrafanaCost-effective, cloud-native
ELK StackElasticsearch, Logstash, KibanaTraditional deployments, complex parsing

This guide focuses on the EFK stack (Elasticsearch, Fluentd, Kibana), the most common Kubernetes logging solution.

EFK Architecture

text
┌─────────────────────────────────────────────────────────────────┐
│                         Kubernetes Cluster                        │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │    Pod A    │    │    Pod B    │    │    Pod C    │          │
│  │   logs      │    │   logs      │    │   logs      │          │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘          │
│         │                  │                  │                  │
│         └──────────────────┼──────────────────┘                  │
│                            │                                     │
│                     ┌──────▼──────┐                              │
│                     │   Fluentd   │  (DaemonSet on each node)    │
│                     │   (Agent)   │                              │
│                     └──────┬──────┘                              │
└────────────────────────────┼──────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Elasticsearch  │
                    │   (Cluster)     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │     Kibana      │
                    │   (Dashboard)   │
                    └─────────────────┘

Installing EFK Stack

bash
# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3 \
  --set minimumMasterNodes=2

# Install Kibana
helm install kibana elastic/kibana \
  --namespace logging \
  --set elasticsearchHosts=http://elasticsearch-master:9200

# Install Fluentd
helm install fluentd fluent/fluentd \
  --namespace logging \
  --set elasticsearch.host=elasticsearch-master \
  --set elasticsearch.port=9200

Fluentd Configuration

yaml
# fluentd-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    # Input: Read logs from all containers
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    # Parse Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    # Add Kubernetes labels to log record
    <filter kubernetes.**>
      @type record_transformer
      enable_ruby
      <record>
        namespace ${record.dig('kubernetes', 'namespace_name')}
        pod_name ${record.dig('kubernetes', 'pod_name')}
        container_name ${record.dig('kubernetes', 'container_name')}
        labels ${record.dig('kubernetes', 'labels').to_json}
      </record>
    </filter>

    # Output: Send to Elasticsearch
    <match **>
      @type elasticsearch
      host elasticsearch-master
      port 9200
      logstash_format true
      logstash_prefix kubernetes
      flush_interval 5s
    </match>

Kibana Queries

Common Kibana queries for Kubernetes logs:

json
// Find logs from a specific pod
kubernetes.pod_name: "nginx-abc123"

// Find error logs from the production namespace
kubernetes.namespace_name: "production" AND log: "error" OR log: "exception"

// Find logs with high latency
log: "*ms" AND kubernetes.labels.app: "api"

// Find logs from the last 15 minutes
kubernetes.pod_name: "my-app" AND @timestamp: now-15m

// Exclude health check logs
NOT log: "/health" AND kubernetes.container_name: "app"

Real-World Scenarios

Scenario 1: Setting up HPA with Custom Metrics

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Scenario 2: Alerting on High Error Rate

yaml
# prometheus-alert.yaml
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) 
    * 100 > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }}% over the last 5 minutes."

Scenario 3: Centralized Logging for Debugging

bash
# Search logs across all Pods for a specific error
kubectl port-forward -n logging svc/kibana 5601

# Access Kibana at http://localhost:5601

# In Kibana, search for:
# Error connecting to database
# This searches all Pods across all namespaces

Summary

ToolPurposeData RetentionQuery Language
Metrics ServerBasic resource metricsNone (current only)kubectl top
PrometheusAdvanced metrics, alertingConfigurable (days to years)PromQL
GrafanaVisualizationN/AN/A
EFK StackCentralized loggingConfigurable (days to months)Kibana query language

Best Practices

Metrics Server

  • Required for HPA and kubectl top

  • Deploy on all clusters

Prometheus

  • Use kube-prometheus-stack for production

  • Set appropriate retention based on your needs (15-30 days typical)

  • Use recording rules for expensive queries

  • Set up Alertmanager for notifications

Logging

  • Use Fluentd or Fluent Bit as log collector

  • Store logs for at least 30 days for compliance

  • Sample high-volume logs to reduce storage


Practice Questions

  1. What is the difference between Metrics Server and Prometheus?

  2. How does Prometheus discover targets to scrape in Kubernetes?

  3. Why are logs in Kubernetes considered ephemeral?

  4. How would you find all error logs across 50 Pods without centralized logging?

  5. What PromQL query would show the memory usage trend of a specific Pod over time?


Learn More

Practice Kubernetes monitoring and logging with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

Comments

Popular posts from this blog

📊 Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd

  Monitoring & Logging in Kubernetes – Tools like Prometheus, Grafana, and Fluentd Monitoring and logging are essential for maintaining a healthy and well-performing Kubernetes cluster. In this guide, we’ll cover why monitoring is important, key monitoring tools like Prometheus and Grafana, and logging tools like Fluentd to help you gain visibility into your cluster’s performance and logs. Shape Your Future with AI & Infinite Knowledge...!! Want to Generate Text-to-Voice, Images & Videos? http://www.ai.skyinfinitetech.com Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! 🚀 Introduction In today’s fast-paced cloud-native environment, Kubernetes has emerged as the de-facto container orchestration platform. But deploying and managing applications in Kubernetes is just half the ba...

How to Use SKY TTS: The Complete, Step-by-Step Guide for 2025

 What is SKY TTS? SKY TTS  is a free, next-generation  AI audio creation platform  that brings together high-quality  Text-to-Speech ,  Speech-to-Text , and a full suite of professional  audio editing tools  in one seamless experience. Our vision is simple — to make advanced audio technology  free, accessible, and effortless  for everyone. From creators and educators to podcasters, developers, and businesses, SKY TTS helps users produce  studio-grade voice content  without expensive software or technical skills. With support for  70+ languages, natural voices, audio enhancement, waveform generation, and batch automation , SKY TTS has become a trusted all-in-one toolkit for modern digital audio workflows. Why Choose SKY TTS? Instant Conversion:  Enjoy rapid text-to-speech generation, even with large documents. Advanced Voice Settings:   Adjust speed, pitch, and style for a personalized listening experience. Multi-...

Introduction to Terraform – The Future of Infrastructure as Code

  Introduction to Terraform – The Future of Infrastructure as Code In today’s fast-paced DevOps world, managing infrastructure manually is outdated . This is where Terraform comes in—a powerful Infrastructure as Code (IaC) tool that allows you to define, provision, and manage cloud infrastructure efficiently . Whether you're working with AWS, Azure, Google Cloud, or on-premises servers , Terraform provides a declarative, automation-first approach to infrastructure deployment. Shape Your Future with AI & Infinite Knowledge...!! Read In-Depth Tech & Self-Improvement Blogs http://www.skyinfinitetech.com Watch Life-Changing Videos on YouTube https://www.youtube.com/@SkyInfinite-Learning Transform Your Skills, Business & Productivity – Join Us Today! In today’s digital-first world, agility and automation are no longer optional—they’re essential. Companies across the globe are rapidly shifting their operations to the cloud to keep up with the pace of innovatio...