Prometheus + Grafana Setup:

Prometheus + Grafana Setup: Complete Monitoring Stack

📅 Published: June 2026
⏱️ Estimated Reading Time: 22 minutes
🏷️ Tags: Prometheus, Grafana, Monitoring, Metrics, Observability, DevOps

Introduction: Why Prometheus and Grafana?

Monitoring is essential for understanding how your applications and infrastructure are performing. You need to know when something is wrong, what caused it, and how to fix it.

Prometheus and Grafana together form the most popular open-source monitoring stack:

Prometheus collects and stores metrics (CPU usage, memory, request rates, error counts)
Grafana visualizes those metrics in dashboards and charts

Think of Prometheus as the database that stores your metrics, and Grafana as the visualization layer that helps you understand them.

Why this stack is popular:

Open source and free
Huge ecosystem of exporters and integrations
Powerful query language (PromQL)
Beautiful, customizable dashboards
Active community and extensive documentation

Part 1: What is Prometheus?

Core Concepts

Concept	Description	Example
Metric	A measurement of a system	`cpu_usage_percent`
Time Series	Metric values over time	`cpu_usage_percent{instance="server1"}`
Labels	Key-value pairs for filtering	`method="GET", status="200"`
Scrape	Pulling metrics from a target	Every 15 seconds
Target	A source of metrics (app, server)	`localhost:9090/metrics`

How Prometheus Works

Pull model: Prometheus scrapes (pulls) metrics from targets, not waiting for them to be pushed
Service discovery: Automatically finds targets to scrape (Kubernetes, AWS, file-based)
Time series database: Stores metrics efficiently with labels
PromQL: Powerful query language to analyze metrics
Alertmanager: Handles alerts and sends notifications

Metric Types

Type	Description	Example
Counter	Only increases (requests, errors)	`http_requests_total`
Gauge	Can go up or down (CPU, memory)	`cpu_temperature_celsius`
Histogram	Distribution of values (latency)	`request_duration_seconds`
Summary	Similar to histogram, calculates quantiles client-side	`request_duration_seconds`

Part 2: Installing Prometheus

Option 1: Docker (Quick Start)

# Create prometheus.yml configuration
cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
EOF

# Run Prometheus
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Access Prometheus UI at http://localhost:9090

Option 2: Kubernetes (kube-prometheus-stack)

The easiest way to deploy Prometheus in Kubernetes:

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin

# Check installation
kubectl get pods -n monitoring

Option 3: Linux (Binary)

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar -xzf prometheus-2.51.0.linux-amd64.tar.gz
cd prometheus-2.51.0.linux-amd64

# Run Prometheus
./prometheus --config.file=prometheus.yml

Part 3: Prometheus Configuration (prometheus.yml)

Basic Configuration

global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules
  scrape_timeout: 10s       # Timeout for each scrape

# Alerting configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# Rule files for alerts
rule_files:
  - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Scrape node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
  
  # Scrape with file-based service discovery
  - job_name: 'dynamic-targets'
    file_sd_configs:
      - files:
          - 'targets/*.json'

Scrape Configuration Examples

Scrape a static target:

- job_name: 'my-app'
  static_configs:
    - targets: ['app-server:8080']
      labels:
        environment: 'production'
        team: 'platform'

Scrape with basic authentication:

- job_name: 'protected-app'
  static_configs:
    - targets: ['api.example.com:9090']
  basic_auth:
    username: 'prometheus'
    password: 'secret'

Scrape with Kubernetes service discovery:

- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: (.+):(?:\d+);(\d+)
      replacement: $1:$2
      target_label: __address__

Part 4: Exporters (Collecting Metrics)

Exporters are agents that collect metrics from systems and expose them in Prometheus format.

Node Exporter (System Metrics)

# Run Node Exporter with Docker
docker run -d \
  --name node-exporter \
  --network host \
  --pid host \
  prom/node-exporter

# Or with Docker Compose
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
  node-exporter:
    image: prom/node-exporter
    network_mode: host
    pid: host
    restart: unless-stopped
EOF

Metrics collected: CPU, memory, disk, network, load average, filesystem

cAdvisor (Container Metrics)

docker run -d \
  --name cadvisor \
  --network host \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:ro \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor

Metrics collected: Container CPU, memory, network, filesystem usage

Blackbox Exporter (Probe Endpoints)

docker run -d \
  --name blackbox-exporter \
  -p 9115:9115 \
  prom/blackbox-exporter

Configuration:

- job_name: 'http-probes'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - https://example.com
        - https://google.com
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:9115

Common Exporters

Exporter	Purpose	Port
node_exporter	System metrics	9100
cadvisor	Container metrics	8080
blackbox_exporter	Endpoint probing	9115
mysqld_exporter	MySQL metrics	9104
postgres_exporter	PostgreSQL metrics	9187
redis_exporter	Redis metrics	9121
nginx_exporter	Nginx metrics	9113
cloudwatch_exporter	AWS metrics	9106

Part 5: PromQL (Query Language)

Basic Queries

# CPU usage in the last 5 minutes
rate(node_cpu_seconds_total{mode="user"}[5m])

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP request rate
rate(http_requests_total[5m])

# 95th percentile request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Count of errors in last hour
increase(http_errors_total[1h])

Useful Functions

Function	Purpose	Example
`rate()`	Per-second average over time	`rate(counter[5m])`
`increase()`	Total increase over time	`increase(counter[1h])`
`sum()`	Sum of values	`sum(cpu_usage) by (instance)`
`avg()`	Average of values	`avg(memory_usage) by (pod)`
`max()`	Maximum value	`max(latency_seconds)`
`topk()`	Top K values	`topk(10, cpu_usage)`
`histogram_quantile()`	Quantiles from histogram	`histogram_quantile(0.99, latency_bucket)`
`label_replace()`	Modify labels	`label_replace(metric, "new", "$1", "old", "(.*)")`

Common Queries for Dashboards

# CPU usage by instance
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

# Memory usage by instance
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage by mount point
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Pod restarts per namespace
sum(kube_pod_container_status_restarts_total) by (namespace)

# Failed pods
sum(kube_pod_status_phase{phase="Failed"}) by (namespace)

# API request rate by endpoint
sum(rate(http_requests_total[5m])) by (endpoint, method)

# Error rate percentage
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

Part 6: Alerting with Alertmanager

Prometheus Alert Rules

Create alerts.yml:

groups:
  - name: instance-alerts
    rules:
      # Alert when instance is down
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."

  - name: cpu-alerts
    rules:
      # Alert when CPU usage is high
      - alert: HighCPUUsage
        expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for more than 10 minutes."

  - name: memory-alerts
    rules:
      # Alert when memory usage is high
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% for more than 5 minutes."

  - name: disk-alerts
    rules:
      # Alert when disk space is low
      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value }}% disk space remaining on {{ $labels.mountpoint }}"

Alertmanager Configuration

Create alertmanager.yml:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
        title: 'Prometheus Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'your-routing-key'

Part 7: Installing Grafana

Option 1: Docker

docker run -d \
  --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana

# Access at http://localhost:3000 (admin/admin)

Option 2: Kubernetes (with kube-prometheus-stack)

Grafana is included in the kube-prometheus-stack. Access it with:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Option 3: Linux (Binary)

# Download Grafana
wget https://dl.grafana.com/oss/release/grafana-10.4.0.linux-amd64.tar.gz
tar -xzf grafana-10.4.0.linux-amd64.tar.gz
cd grafana-10.4.0

# Start Grafana
./bin/grafana-server web

Part 8: Configuring Grafana

Step 1: Add Prometheus Data Source

Log into Grafana (admin/admin)
Go to Configuration → Data Sources → Add data source
Select Prometheus
Configure:
- Name: Prometheus
- URL: http://prometheus:9090 (or http://localhost:9090 for local)
- Access: Server
Click Save & Test

Step 2: Import Dashboard

Go to Dashboards → Import
Enter dashboard ID:
- 1860 for Node Exporter dashboard
- 6417 for Kubernetes cluster monitoring
- 315 for Docker monitoring
Click Load
Select Prometheus data source
Click Import

Step 3: Create a Custom Dashboard

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (instance) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "{{ instance }}"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
      }
    ]
  }
}

Part 9: Docker Compose Full Stack

version: '3.8'

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter
    container_name: node-exporter
    network_mode: host
    pid: host
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

volumes:
  grafana-data:

Part 10: Common Dashboards and IDs

Dashboard	ID	Description
Node Exporter	1860	System metrics (CPU, memory, disk, network)
Kubernetes Cluster	6417	Kubernetes cluster monitoring
Docker Monitoring	315	Docker container metrics
Prometheus Stats	3662	Prometheus self-monitoring
MySQL	7362	MySQL database metrics
PostgreSQL	9628	PostgreSQL metrics
Nginx	12708	Nginx web server metrics
Redis	11835	Redis cache metrics
Blackbox Exporter	13659	Endpoint monitoring

Part 11: Best Practices

Metric Naming

Use snake_case: http_requests_total, not httpRequestsTotal
Include units in name: request_duration_seconds, memory_usage_bytes
Use _total suffix for counters
Use _bucket, _sum, _count for histograms

Label Usage

Keep label cardinality low (avoid user IDs, email addresses)
Use labels for filtering dimensions (environment, region, service)
Common labels: job, instance, environment, service, version

Retention and Performance

# Prometheus storage settings
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
--storage.tsdb.wal-compression

Alert Design Principles

Alert on symptoms, not causes
Include runbook links in annotations
Set appropriate for durations to avoid flapping
Group related alerts

Grafana Commands Cheat Sheet

# Docker
docker run -d -p 3000:3000 grafana/grafana

# Kubernetes
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# CLI provisioning
grafana-cli plugins install grafana-piechart-panel

Summary

Component	Purpose	Default Port
Prometheus	Metrics storage and query	9090
Node Exporter	System metrics	9100
cAdvisor	Container metrics	8080
Alertmanager	Alert routing	9093
Grafana	Visualization	3000

The Prometheus + Grafana stack gives you complete visibility into your systems. Start with node-exporter and Prometheus, then add Grafana for dashboards. Add alerting last, once you understand what normal looks like.

Learn More

Practice Prometheus and Grafana with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

SKY Tech – Explore Technology!