Kubernetes Monitoring & Logging

Monitoring & Logging: Metrics Server, Prometheus Basics, and Centralized Logging

📅 Published: May 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: Kubernetes Monitoring, Prometheus, Metrics Server, Centralized Logging, EFK Stack

Introduction: The Observability Challenge

Kubernetes is dynamic. Pods come and go. Nodes scale up and down. Applications restart. In this environment, traditional monitoring approaches fail. You cannot SSH into a Pod that no longer exists to check its logs. You cannot monitor static IP addresses when Pods get new IPs every restart.

Kubernetes requires a different approach to observability. You need tools that understand the dynamic nature of containers and can collect, store, and query metrics and logs across ephemeral resources.

This guide covers the essential monitoring and logging tools for Kubernetes:

Metrics Server – Basic resource metrics for HPA and kubectl top
Prometheus – Powerful metrics collection and alerting
Centralized Logging – Aggregating logs from all Pods to a central location

Part 1: Metrics Server

What is Metrics Server?

Metrics Server is a cluster-wide aggregator of resource usage metrics. It collects CPU and memory usage from Pods and Nodes and makes them available through the Kubernetes API.

Think of Metrics Server as a lightweight sensor network. It doesn't store historical data. It doesn't provide advanced analytics. It just tells you what is happening right now.

What Metrics Server Provides

kubectl top pod – Current CPU/memory usage per Pod
kubectl top node – Current CPU/memory usage per Node
Horizontal Pod Autoscaler (HPA) data source
Vertical Pod Autoscaler (VPA) data source

Installing Metrics Server

# Download the latest release
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

# Check API availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .

Metrics Server Configuration (Optional)

For clusters with custom CNI or certificate issues, you may need to modify the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - args:
        - --kubelet-insecure-tls   # Only for testing, not production
        - --kubelet-preferred-address-types=InternalIP
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        name: metrics-server

Using Metrics Server

# View node resource usage
kubectl top nodes
kubectl top nodes --sort-by=cpu
kubectl top nodes --sort-by=memory

# View pod resource usage
kubectl top pods
kubectl top pods --all-namespaces
kubectl top pods -l app=nginx

# View pod resource usage in a namespace
kubectl top pods -n kube-system

# Sort pods by memory usage
kubectl top pods --sort-by=memory

Metrics Server Limitations

No long-term storage (only current values)
No custom metrics
No alerting
No historical trends

For advanced monitoring, you need Prometheus.

Part 2: Prometheus Basics

What is Prometheus?

Prometheus is an open-source monitoring system that scrapes metrics from targets at regular intervals, stores them in a time-series database, and provides powerful querying and alerting capabilities.

Think of Prometheus as a historian for your cluster. It records what happened, when it happened, and allows you to ask complex questions about your system's behavior over time.

Prometheus Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Prometheus Server                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │   Scraper   │───▶│  TSDB       │───▶│  Query      │          │
│  │  (Pull)     │    │  (Storage)  │    │  Engine     │          │
│  └─────────────┘    └─────────────┘    └──────┬──────┘          │
│         │                                     │                  │
│         ▼                                     ▼                  │
│  ┌─────────────┐                      ┌─────────────┐           │
│  │  Service    │                      │   Alert     │           │
│  │  Discovery  │                      │   Manager   │           │
│  └─────────────┘                      └──────┬──────┘           │
└──────────────────────────────────────────────┼──────────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │ Alertmanager│
                                        └──────┬──────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │  Email,     │
                                        │  Slack,     │
                                        │  PagerDuty  │
                                        └─────────────┘

Installing Prometheus with kube-prometheus-stack

The easiest way to deploy Prometheus is with the kube-prometheus-stack Helm chart.

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin

# Verify installation
kubectl get pods -n monitoring

Prometheus Configuration (scrape_configs)

# prometheus-config.yaml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules

scrape_configs:
# Scrape Kubernetes nodes (cAdvisor)
- job_name: kubernetes-nodes
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)

# Scrape Kubernetes pods
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__

PromQL (Prometheus Query Language)

PromQL is the query language for extracting data from Prometheus.

# Basic queries
# Current CPU usage of a specific pod
container_cpu_usage_seconds_total{pod="nginx-abc123"}

# Rate of CPU usage per second
rate(container_cpu_usage_seconds_total[5m])

# Memory usage in bytes
container_memory_working_set_bytes

# 95th percentile request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Common PromQL Queries

# CPU usage by pod (top 10)
topk(10, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))

# Memory usage by node
sum(container_memory_working_set_bytes) by (node)

# Pod restarts in last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Pods with high memory usage (>500MB)
sum(container_memory_working_set_bytes) by (pod) > 500000000

# Failed pods count
sum(kube_pod_status_phase{phase="Failed"})

# API server request latency
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, resource))

Prometheus Alerts

# alert-rules.yaml
groups:
- name: pod-alerts
  rules:
  # Alert when a pod has been restarting frequently
  - alert: PodFrequentlyRestarting
    expr: increase(kube_pod_container_status_restarts_total[15m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is restarting frequently"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last 15 minutes."

  # Alert when CPU usage exceeds threshold
  - alert: HighCPUUsage
    expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) > 0.8
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage on pod {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is using {{ $value }} cores on average over 5 minutes."

  # Alert when memory usage exceeds threshold
  - alert: HighMemoryUsage
    expr: sum(container_memory_working_set_bytes) by (pod) > 1e9
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on pod {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is using {{ $value | humanize }} bytes of memory."

  # Alert when pod is in CrashLoopBackOff
  - alert: CrashLoopBackOff
    expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is in CrashLoopBackOff"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been in CrashLoopBackOff for 5 minutes."

# Node alerts
- name: node-alerts
  rules:
  - alert: NodeHighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.node }} has high memory usage"
      description: "Node {{ $labels.node }} memory usage is above 90% (current: {{ $value }}%)"

  - alert: NodeDiskSpaceLow
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.node }} has low disk space"
      description: "Node {{ $labels.node }} has only {{ $value }}% disk space remaining."

Adding Prometheus Annotations to Applications

For Prometheus to scrape your application, add annotations to your Pod or Service:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
  - name: app
    image: myapp:latest
    ports:
    - containerPort: 8080

Prometheus Commands

# Port forward to access Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090

# Access Prometheus at http://localhost:9090

# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Access Grafana at http://localhost:3000 (admin/admin)

Part 3: Centralized Logging

The Logging Challenge

In Kubernetes, logs are ephemeral. When a Pod is deleted, its logs disappear. When a Node fails, logs on that node are lost. When you have 100 replicas of a service, which Pod's logs do you check?

Centralized logging solves this by collecting logs from all Pods and storing them in a single location.

Common Logging Architectures

Architecture	Components	Best For
EFK Stack	Elasticsearch, Fluentd, Kibana	General purpose, large deployments
Loki	Loki, Promtail, Grafana	Cost-effective, cloud-native
ELK Stack	Elasticsearch, Logstash, Kibana	Traditional deployments, complex parsing

This guide focuses on the EFK stack (Elasticsearch, Fluentd, Kibana), the most common Kubernetes logging solution.

EFK Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Kubernetes Cluster                        │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │    Pod A    │    │    Pod B    │    │    Pod C    │          │
│  │   logs      │    │   logs      │    │   logs      │          │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘          │
│         │                  │                  │                  │
│         └──────────────────┼──────────────────┘                  │
│                            │                                     │
│                     ┌──────▼──────┐                              │
│                     │   Fluentd   │  (DaemonSet on each node)    │
│                     │   (Agent)   │                              │
│                     └──────┬──────┘                              │
└────────────────────────────┼──────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Elasticsearch  │
                    │   (Cluster)     │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │     Kibana      │
                    │   (Dashboard)   │
                    └─────────────────┘

Installing EFK Stack

# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3 \
  --set minimumMasterNodes=2

# Install Kibana
helm install kibana elastic/kibana \
  --namespace logging \
  --set elasticsearchHosts=http://elasticsearch-master:9200

# Install Fluentd
helm install fluentd fluent/fluentd \
  --namespace logging \
  --set elasticsearch.host=elasticsearch-master \
  --set elasticsearch.port=9200

Fluentd Configuration

# fluentd-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    # Input: Read logs from all containers
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    # Parse Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    # Add Kubernetes labels to log record
    <filter kubernetes.**>
      @type record_transformer
      enable_ruby
      <record>
        namespace ${record.dig('kubernetes', 'namespace_name')}
        pod_name ${record.dig('kubernetes', 'pod_name')}
        container_name ${record.dig('kubernetes', 'container_name')}
        labels ${record.dig('kubernetes', 'labels').to_json}
      </record>
    </filter>

    # Output: Send to Elasticsearch
    <match **>
      @type elasticsearch
      host elasticsearch-master
      port 9200
      logstash_format true
      logstash_prefix kubernetes
      flush_interval 5s
    </match>

Kibana Queries

Common Kibana queries for Kubernetes logs:

// Find logs from a specific pod
kubernetes.pod_name: "nginx-abc123"

// Find error logs from the production namespace
kubernetes.namespace_name: "production" AND log: "error" OR log: "exception"

// Find logs with high latency
log: "*ms" AND kubernetes.labels.app: "api"

// Find logs from the last 15 minutes
kubernetes.pod_name: "my-app" AND @timestamp: now-15m

// Exclude health check logs
NOT log: "/health" AND kubernetes.container_name: "app"

Real-World Scenarios

Scenario 1: Setting up HPA with Custom Metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Scenario 2: Alerting on High Error Rate

# prometheus-alert.yaml
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) 
    * 100 > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }}% over the last 5 minutes."

Scenario 3: Centralized Logging for Debugging

# Search logs across all Pods for a specific error
kubectl port-forward -n logging svc/kibana 5601

# Access Kibana at http://localhost:5601

# In Kibana, search for:
# Error connecting to database
# This searches all Pods across all namespaces

Summary

Tool	Purpose	Data Retention	Query Language
Metrics Server	Basic resource metrics	None (current only)	kubectl top
Prometheus	Advanced metrics, alerting	Configurable (days to years)	PromQL
Grafana	Visualization	N/A	N/A
EFK Stack	Centralized logging	Configurable (days to months)	Kibana query language

Best Practices

Metrics Server

Required for HPA and kubectl top
Deploy on all clusters

Prometheus

Use kube-prometheus-stack for production
Set appropriate retention based on your needs (15-30 days typical)
Use recording rules for expensive queries
Set up Alertmanager for notifications

Logging

Use Fluentd or Fluent Bit as log collector
Store logs for at least 30 days for compliance
Sample high-volume logs to reduce storage

Practice Questions

What is the difference between Metrics Server and Prometheus?
How does Prometheus discover targets to scrape in Kubernetes?
Why are logs in Kubernetes considered ephemeral?
How would you find all error logs across 50 Pods without centralized logging?
What PromQL query would show the memory usage trend of a specific Pod over time?

Learn More

Practice Kubernetes monitoring and logging with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

SKY Tech – Explore Technology!