Prometheus + Grafana Setup: Complete Monitoring Stack
📅 Published: June 2026
⏱️ Estimated Reading Time: 22 minutes
🏷️ Tags: Prometheus, Grafana, Monitoring, Metrics, Observability, DevOps
Introduction: Why Prometheus and Grafana?
Monitoring is essential for understanding how your applications and infrastructure are performing. You need to know when something is wrong, what caused it, and how to fix it.
Prometheus and Grafana together form the most popular open-source monitoring stack:
Prometheus collects and stores metrics (CPU usage, memory, request rates, error counts)
Grafana visualizes those metrics in dashboards and charts
Think of Prometheus as the database that stores your metrics, and Grafana as the visualization layer that helps you understand them.
Why this stack is popular:
Open source and free
Huge ecosystem of exporters and integrations
Powerful query language (PromQL)
Beautiful, customizable dashboards
Active community and extensive documentation
Part 1: What is Prometheus?
Core Concepts
| Concept | Description | Example |
|---|---|---|
| Metric | A measurement of a system | cpu_usage_percent |
| Time Series | Metric values over time | cpu_usage_percent{instance="server1"} |
| Labels | Key-value pairs for filtering | method="GET", status="200" |
| Scrape | Pulling metrics from a target | Every 15 seconds |
| Target | A source of metrics (app, server) | localhost:9090/metrics |
How Prometheus Works
Pull model: Prometheus scrapes (pulls) metrics from targets, not waiting for them to be pushed
Service discovery: Automatically finds targets to scrape (Kubernetes, AWS, file-based)
Time series database: Stores metrics efficiently with labels
PromQL: Powerful query language to analyze metrics
Alertmanager: Handles alerts and sends notifications
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Only increases (requests, errors) | http_requests_total |
| Gauge | Can go up or down (CPU, memory) | cpu_temperature_celsius |
| Histogram | Distribution of values (latency) | request_duration_seconds |
| Summary | Similar to histogram, calculates quantiles client-side | request_duration_seconds |
Part 2: Installing Prometheus
Option 1: Docker (Quick Start)
# Create prometheus.yml configuration cat > prometheus.yml << 'EOF' global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] EOF # Run Prometheus docker run -d \ --name prometheus \ -p 9090:9090 \ -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus # Access Prometheus UI at http://localhost:9090
Option 2: Kubernetes (kube-prometheus-stack)
The easiest way to deploy Prometheus in Kubernetes:
# Add Helm repository helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install kube-prometheus-stack helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=admin # Check installation kubectl get pods -n monitoring
Option 3: Linux (Binary)
# Download Prometheus wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz tar -xzf prometheus-2.51.0.linux-amd64.tar.gz cd prometheus-2.51.0.linux-amd64 # Run Prometheus ./prometheus --config.file=prometheus.yml
Part 3: Prometheus Configuration (prometheus.yml)
Basic Configuration
global: scrape_interval: 15s # How often to scrape targets evaluation_interval: 15s # How often to evaluate rules scrape_timeout: 10s # Timeout for each scrape # Alerting configuration alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] # Rule files for alerts rule_files: - "alerts/*.yml" # Scrape configurations scrape_configs: # Scrape Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Scrape node exporter (system metrics) - job_name: 'node' static_configs: - targets: ['localhost:9100'] # Scrape with file-based service discovery - job_name: 'dynamic-targets' file_sd_configs: - files: - 'targets/*.json'
Scrape Configuration Examples
Scrape a static target:
- job_name: 'my-app' static_configs: - targets: ['app-server:8080'] labels: environment: 'production' team: 'platform'
Scrape with basic authentication:
- job_name: 'protected-app' static_configs: - targets: ['api.example.com:9090'] basic_auth: username: 'prometheus' password: 'secret'
Scrape with Kubernetes service discovery:
- job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: (.+):(?:\d+);(\d+) replacement: $1:$2 target_label: __address__
Part 4: Exporters (Collecting Metrics)
Exporters are agents that collect metrics from systems and expose them in Prometheus format.
Node Exporter (System Metrics)
# Run Node Exporter with Docker docker run -d \ --name node-exporter \ --network host \ --pid host \ prom/node-exporter # Or with Docker Compose cat > docker-compose.yml << 'EOF' version: '3.8' services: node-exporter: image: prom/node-exporter network_mode: host pid: host restart: unless-stopped EOF
Metrics collected: CPU, memory, disk, network, load average, filesystem
cAdvisor (Container Metrics)
docker run -d \ --name cadvisor \ --network host \ -v /:/rootfs:ro \ -v /var/run:/var/run:ro \ -v /sys:/sys:ro \ -v /var/lib/docker/:/var/lib/docker:ro \ gcr.io/cadvisor/cadvisor
Metrics collected: Container CPU, memory, network, filesystem usage
Blackbox Exporter (Probe Endpoints)
docker run -d \ --name blackbox-exporter \ -p 9115:9115 \ prom/blackbox-exporter
Configuration:
- job_name: 'http-probes' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://example.com - https://google.com relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115
Common Exporters
| Exporter | Purpose | Port |
|---|---|---|
| node_exporter | System metrics | 9100 |
| cadvisor | Container metrics | 8080 |
| blackbox_exporter | Endpoint probing | 9115 |
| mysqld_exporter | MySQL metrics | 9104 |
| postgres_exporter | PostgreSQL metrics | 9187 |
| redis_exporter | Redis metrics | 9121 |
| nginx_exporter | Nginx metrics | 9113 |
| cloudwatch_exporter | AWS metrics | 9106 |
Part 5: PromQL (Query Language)
Basic Queries
# CPU usage in the last 5 minutes rate(node_cpu_seconds_total{mode="user"}[5m]) # Memory usage percentage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # HTTP request rate rate(http_requests_total[5m]) # 95th percentile request latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Count of errors in last hour increase(http_errors_total[1h])
Useful Functions
| Function | Purpose | Example |
|---|---|---|
rate() | Per-second average over time | rate(counter[5m]) |
increase() | Total increase over time | increase(counter[1h]) |
sum() | Sum of values | sum(cpu_usage) by (instance) |
avg() | Average of values | avg(memory_usage) by (pod) |
max() | Maximum value | max(latency_seconds) |
topk() | Top K values | topk(10, cpu_usage) |
histogram_quantile() | Quantiles from histogram | histogram_quantile(0.99, latency_bucket) |
label_replace() | Modify labels | label_replace(metric, "new", "$1", "old", "(.*)") |
Common Queries for Dashboards
# CPU usage by instance 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) # Memory usage by instance (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Disk usage by mount point (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 # Pod restarts per namespace sum(kube_pod_container_status_restarts_total) by (namespace) # Failed pods sum(kube_pod_status_phase{phase="Failed"}) by (namespace) # API request rate by endpoint sum(rate(http_requests_total[5m])) by (endpoint, method) # Error rate percentage (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
Part 6: Alerting with Alertmanager
Prometheus Alert Rules
Create alerts.yml:
groups: - name: instance-alerts rules: # Alert when instance is down - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.instance }} has been down for more than 5 minutes." - name: cpu-alerts rules: # Alert when CPU usage is high - alert: HighCPUUsage expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for more than 10 minutes." - name: memory-alerts rules: # Alert when memory usage is high - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 5m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}% for more than 5 minutes." - name: disk-alerts rules: # Alert when disk space is low - alert: LowDiskSpace expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value }}% disk space remaining on {{ $labels.mountpoint }}"
Alertmanager Configuration
Create alertmanager.yml:
global: smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'alerts@example.com' smtp_auth_username: 'alerts@example.com' smtp_auth_password: 'password' route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'email' receivers: - name: 'email' email_configs: - to: 'team@example.com' - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz' channel: '#alerts' title: 'Prometheus Alert' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}' - name: 'pagerduty' pagerduty_configs: - routing_key: 'your-routing-key'
Part 7: Installing Grafana
Option 1: Docker
docker run -d \ --name grafana \ -p 3000:3000 \ -e GF_SECURITY_ADMIN_PASSWORD=admin \ grafana/grafana # Access at http://localhost:3000 (admin/admin)
Option 2: Kubernetes (with kube-prometheus-stack)
Grafana is included in the kube-prometheus-stack. Access it with:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Option 3: Linux (Binary)
# Download Grafana wget https://dl.grafana.com/oss/release/grafana-10.4.0.linux-amd64.tar.gz tar -xzf grafana-10.4.0.linux-amd64.tar.gz cd grafana-10.4.0 # Start Grafana ./bin/grafana-server web
Part 8: Configuring Grafana
Step 1: Add Prometheus Data Source
Log into Grafana (admin/admin)
Go to Configuration → Data Sources → Add data source
Select Prometheus
Configure:
Name:
PrometheusURL:
http://prometheus:9090(orhttp://localhost:9090for local)Access:
Server
Click Save & Test
Step 2: Import Dashboard
Go to Dashboards → Import
Enter dashboard ID:
1860for Node Exporter dashboard6417for Kubernetes cluster monitoring315for Docker monitoring
Click Load
Select Prometheus data source
Click Import
Step 3: Create a Custom Dashboard
{ "dashboard": { "title": "System Overview", "panels": [ { "title": "CPU Usage", "type": "graph", "targets": [ { "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (instance) * 100)", "legendFormat": "{{ instance }}" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 } }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "legendFormat": "{{ instance }}" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 } } ] } }
Part 9: Docker Compose Full Stack
version: '3.8' services: prometheus: image: prom/prometheus container_name: prometheus volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml ports: - "9090:9090" command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' restart: unless-stopped node-exporter: image: prom/node-exporter container_name: node-exporter network_mode: host pid: host restart: unless-stopped cadvisor: image: gcr.io/cadvisor/cadvisor container_name: cadvisor volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ports: - "8080:8080" restart: unless-stopped grafana: image: grafana/grafana container_name: grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin volumes: - grafana-data:/var/lib/grafana restart: unless-stopped alertmanager: image: prom/alertmanager container_name: alertmanager volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml ports: - "9093:9093" command: - '--config.file=/etc/alertmanager/alertmanager.yml' restart: unless-stopped volumes: grafana-data:
Part 10: Common Dashboards and IDs
| Dashboard | ID | Description |
|---|---|---|
| Node Exporter | 1860 | System metrics (CPU, memory, disk, network) |
| Kubernetes Cluster | 6417 | Kubernetes cluster monitoring |
| Docker Monitoring | 315 | Docker container metrics |
| Prometheus Stats | 3662 | Prometheus self-monitoring |
| MySQL | 7362 | MySQL database metrics |
| PostgreSQL | 9628 | PostgreSQL metrics |
| Nginx | 12708 | Nginx web server metrics |
| Redis | 11835 | Redis cache metrics |
| Blackbox Exporter | 13659 | Endpoint monitoring |
Part 11: Best Practices
Metric Naming
Use snake_case:
http_requests_total, nothttpRequestsTotalInclude units in name:
request_duration_seconds,memory_usage_bytesUse
_totalsuffix for countersUse
_bucket,_sum,_countfor histograms
Label Usage
Keep label cardinality low (avoid user IDs, email addresses)
Use labels for filtering dimensions (environment, region, service)
Common labels:
job,instance,environment,service,version
Retention and Performance
# Prometheus storage settings --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=50GB --storage.tsdb.wal-compression
Alert Design Principles
Alert on symptoms, not causes
Include runbook links in annotations
Set appropriate
fordurations to avoid flappingGroup related alerts
Grafana Commands Cheat Sheet
# Docker docker run -d -p 3000:3000 grafana/grafana # Kubernetes kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # CLI provisioning grafana-cli plugins install grafana-piechart-panel
Summary
| Component | Purpose | Default Port |
|---|---|---|
| Prometheus | Metrics storage and query | 9090 |
| Node Exporter | System metrics | 9100 |
| cAdvisor | Container metrics | 8080 |
| Alertmanager | Alert routing | 9093 |
| Grafana | Visualization | 3000 |
The Prometheus + Grafana stack gives you complete visibility into your systems. Start with node-exporter and Prometheus, then add Grafana for dashboards. Add alerting last, once you understand what normal looks like.
Learn More
Practice Prometheus and Grafana with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/
Comments
Post a Comment