Lesson 1: Introduction to Monitoring
Monitoring a Kubernetes cluster is essential for maintaining reliability, performance, and security. This vast topic deserves its own dedicated course, but this crash course provides a solid foundation for beginners.
Why Monitor Kubernetes?
- Early Problem Detection: Catch issues before they become critical failures
- Resource Optimization: Identify underutilized or overloaded resources
- Capacity Planning: Make data-driven decisions about scaling
- Performance Troubleshooting: Diagnose bottlenecks and slowdowns
- Cost Management: Track resource consumption and optimize spending
- SLA Compliance: Ensure your applications meet service level agreements
What to Monitor in Kubernetes
Infrastructure Metrics
- Node-level: CPU, memory, disk, network
- Cluster-level: Total capacity and usage
- etcd: Database health and performance
- Control plane: API server, scheduler, controller manager
Application Metrics
- Pod metrics: Container CPU, memory, restarts
- Deployment health: Replica status, rollout progress
- Service performance: Request rates, latency, errors
- Custom metrics: Application-specific KPIs
The Four Golden Signals
1. Latency
The time it takes to service a request. Track both successful and failed requests separately.
Example: HTTP response time: 95th percentile = 200ms, 99th percentile = 500ms
2. Traffic
The demand on your system, measured in requests per second or similar metrics.
Example: HTTP requests/sec = 1,500 req/s
3. Errors
The rate of requests that fail, either explicitly or implicitly.
Example: HTTP 5xx errors = 0.1% (1 in 1000 requests)
4. Saturation
How "full" your service is, measuring the utilization of constrained resources.
Example: CPU usage = 75%, Memory usage = 60%, Disk I/O = 40%
Monitoring vs. Logging vs. Tracing
| Type | Purpose | Tools | Data Format |
|---|---|---|---|
| Monitoring | Track metrics over time, alerting | Prometheus, Grafana | Time-series numerical data |
| Logging | Record discrete events, debugging | ELK Stack, Loki, Fluentd | Text logs, structured logs (JSON) |
| Tracing | Track request flow across services | Jaeger, Zipkin, OpenTelemetry | Distributed traces, spans |
Monitoring Architecture Overview
Typical Kubernetes Monitoring Stack
Data Collection Layer:
- Metrics Server (basic resource metrics for kubectl top)
- Prometheus (comprehensive metrics collection)
- Node Exporter (node-level metrics)
- kube-state-metrics (Kubernetes object metrics)
Storage Layer:
- Prometheus TSDB (time-series database)
Visualization Layer:
- Grafana (dashboards and graphs)
Alerting Layer:
- Alertmanager (alert routing and notifications)
Lesson 2: Prometheus Metrics Collection
Prometheus is the de facto standard for monitoring Kubernetes clusters. It's a time-series database that collects, stores, and queries metrics.
What is Prometheus?
- Pull-based Model: Prometheus scrapes metrics from targets at regular intervals
- Multi-dimensional Data: Metrics are identified by name and key-value pairs (labels)
- PromQL: Powerful query language for extracting insights
- Service Discovery: Automatically discovers targets in Kubernetes
- No External Dependencies: Single binary, local storage
- Alerting: Built-in alerting with Alertmanager
Installing Prometheus on Kubernetes
# Install using Helm (recommended)
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install Prometheus stack (includes Grafana, Alertmanager, exporters)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
# Check installation
kubectl get pods -n monitoring
# NAME READY STATUS
# prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running
# prometheus-grafana-xxxxx 3/3 Running
# prometheus-kube-state-metrics-xxxxx 1/1 Running
# prometheus-prometheus-node-exporter-xxxxx 1/1 RunningHow Prometheus Collects Metrics
Step 1: Service Discovery
Prometheus discovers targets to scrape using Kubernetes API. It finds pods, services, and nodes based on configured service discovery rules.
Step 2: Scraping
At regular intervals (default: 30s), Prometheus makes HTTP requests to each target's /metrics endpoint.
Step 3: Parsing
Prometheus parses the metrics in text-based Prometheus format and applies configured relabeling rules.
Step 4: Storage
Metrics are stored in Prometheus's time-series database (TSDB) with compression and efficient indexing.
Understanding Prometheus Metrics
# Example metrics from /metrics endpoint
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 157890.45
node_cpu_seconds_total{cpu="0",mode="system"} 12345.67
node_cpu_seconds_total{cpu="0",mode="user"} 23456.78
# HELP node_memory_MemAvailable_bytes Memory available in bytes
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 2147483648
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users",status="200"} 45678
http_requests_total{method="POST",endpoint="/api/users",status="201"} 1234
http_requests_total{method="GET",endpoint="/api/users",status="500"} 12Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Cumulative value that only increases (or resets to zero) | Total HTTP requests, total errors |
| Gauge | Value that can go up and down | Current memory usage, number of pods |
| Histogram | Samples observations and counts them in buckets | Request duration buckets |
| Summary | Similar to histogram, provides quantiles | Request duration quantiles (p50, p95, p99) |
PromQL Basics
# Basic query: Get current CPU usage
node_cpu_seconds_total
# Filter by label
node_cpu_seconds_total{mode="idle"}
# Rate of change (for counters)
rate(http_requests_total[5m])
# Aggregate by label
sum(rate(http_requests_total[5m])) by (status)
# Calculate error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Get 95th percentile latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes * 100Key Metrics to Monitor
Node Metrics (from node-exporter)
node_cpu_seconds_total- CPU usage by modenode_memory_MemAvailable_bytes- Available memorynode_disk_io_time_seconds_total- Disk I/O timenode_network_receive_bytes_total- Network trafficnode_filesystem_avail_bytes- Available disk space
Kubernetes Metrics (from kube-state-metrics)
kube_pod_status_phase- Pod phase (Running, Pending, Failed)kube_pod_container_status_restarts_total- Container restartskube_deployment_status_replicas- Deployment replica countkube_node_status_condition- Node conditions (Ready, MemoryPressure)kube_persistentvolumeclaim_status_phase- PVC status
Lesson 3: Grafana Visualization
Grafana is the leading open-source platform for monitoring and observability. It transforms raw metrics into beautiful, actionable dashboards.
What is Grafana?
- Multiple Data Sources: Prometheus, Loki, Elasticsearch, MySQL, and more
- Rich Visualizations: Graphs, heatmaps, histograms, tables, gauges
- Dashboard Management: Create, share, and organize dashboards
- Templating: Dynamic dashboards with variables
- Alerting: Visual alerts and notifications
- Community Dashboards: Import ready-made dashboards from grafana.com
Accessing Grafana
# If installed with kube-prometheus-stack, get the password
kubectl get secret -n monitoring prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode ; echo
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open browser to http://localhost:3000
# Username: admin
# Password: (from command above)
# Or expose via LoadBalancer (cloud environments)
kubectl patch svc prometheus-grafana -n monitoring \
-p '{"spec": {"type": "LoadBalancer"}}'Creating Your First Dashboard
Step 1: Add Data Source
Navigate to Configuration → Data Sources → Add data source → Select Prometheus
URL: http://prometheus-kube-prometheus-prometheus.monitoring:9090
Click "Save & Test" to verify connection
Step 2: Create New Dashboard
Click the "+" icon → Create → Dashboard → Add new panel
Step 3: Define Metric Query
In the query editor, enter your PromQL query. Example:
rate(container_cpu_usage_seconds_total[5m])
Step 4: Customize Visualization
Choose visualization type (Time series, Gauge, Bar chart, etc.)
Set units (e.g., bytes, percent, seconds)
Configure thresholds and colors
Step 5: Save Panel
Click "Apply" to add panel to dashboard
Click "Save" icon to save the entire dashboard
Common Dashboard Panels
CPU Usage Panel
Query:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
by (pod, namespace) * 100Visualization: Time series graph
Unit: Percent (0-100)
Legend: {{namespace}}/{{pod}}
Memory Usage Panel
Query:
sum(container_memory_working_set_bytes{container!=""})
by (pod, namespace) / 1024 / 1024Visualization: Time series graph
Unit: MiB (Mebibytes)
Note: Shows data in megabytes for better readability
Pod Status Panel
Query:
sum(kube_pod_status_phase{phase=~"Running|Pending|Failed"})
by (phase)Visualization: Stat or Gauge
Thresholds: Green (Running), Yellow (Pending), Red (Failed)
Using Public Dashboards
# Popular Kubernetes Dashboards on grafana.com:
# Kubernetes Cluster Monitoring (via Prometheus)
Dashboard ID: 315
https://grafana.com/grafana/dashboards/315
# Kubernetes / Compute Resources / Cluster
Dashboard ID: 7249
https://grafana.com/grafana/dashboards/7249
# Node Exporter Full
Dashboard ID: 1860
https://grafana.com/grafana/dashboards/1860
# Kubernetes / Pods
Dashboard ID: 6417
https://grafana.com/grafana/dashboards/6417Importing a Public Dashboard
Method 1: Import by ID
1. Click "+" → Import
2. Enter Dashboard ID (e.g., 315)
3. Click "Load"
4. Select Prometheus data source
5. Click "Import"
Method 2: Import JSON
1. Download JSON from grafana.com
2. Click "+" → Import
3. Upload JSON file or paste JSON
4. Select Prometheus data source
5. Click "Import"
Dashboard Best Practices
- Organize by Rows: Group related panels into rows for better organization
- Use Variables: Create dashboard variables for namespace, pod name, etc. to make dashboards reusable
- Set Appropriate Time Ranges: Default to last 6 hours or 24 hours for operational dashboards
- Add Descriptions: Document what each panel shows for team members
- Use Consistent Units: Stick to standard units (MiB/GiB for memory, % for CPU)
- Avoid Over-crowding: 8-12 panels per dashboard is optimal
- Move Panels Easily: Drag and drop panels to reorganize your dashboard layout
Lesson 4: Best Practices & Alerting
Effective monitoring requires more than just collecting metrics. You need actionable alerts, proper retention policies, and standardized practices.
Essential Dashboards for Kubernetes
1. Cluster Overview Dashboard
Purpose: High-level cluster health at a glance
Key Metrics:
- Total nodes and their status (Ready/NotReady)
- Total pods by phase (Running/Pending/Failed)
- Cluster-wide CPU and memory utilization
- API server request rate and latency
- etcd health and performance
2. Node Dashboard
Purpose: Individual node resource monitoring
Key Metrics:
- CPU usage per node
- Memory usage per node
- Disk I/O and space
- Network traffic (in/out)
- System load and processes
3. Workload Dashboard
Purpose: Application and pod performance
Key Metrics:
- Pod CPU and memory usage by namespace
- Container restart counts
- Deployment replica status
- Pod network traffic
- Pod state transitions
Introduction to Alerting
- Prometheus evaluates alerting rules at regular intervals
- When conditions are met, alert enters "Pending" state
- After configured duration, alert moves to "Firing" state
- Prometheus sends alert to Alertmanager
- Alertmanager groups, deduplicates, and routes alerts
- Notifications sent via configured channels (email, Slack, PagerDuty)
Common Alert Rules
# Alert Rules YAML Configuration
# File: prometheus-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alerts
namespace: monitoring
data:
alerts.yml: |
groups:
- name: kubernetes-nodes
interval: 30s
rules:
# Alert when node is down
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node has been unreachable for more than 5 minutes"
# Alert on high CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 10 minutes"
# Alert on high memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85%"
- name: kubernetes-pods
interval: 30s
rules:
# Alert when pods are crashing
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
# Alert when pods are pending too long
- alert: PodPendingTooLong
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending"
description: "Pod has been in Pending state for more than 15 minutes"
# Alert on deployment replica mismatch
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch"
description: "Desired replicas != available replicas for more than 10 minutes"Alert Severity Levels
| Severity | When to Use | Response Time | Notification |
|---|---|---|---|
| Critical | Service is down, data loss imminent | Immediate (wake up on-call) | PagerDuty, Phone, SMS |
| Warning | Service degraded, issue developing | Within hours (during business hours) | Slack, Email |
| Info | Informational, no action required yet | When convenient | Email, Dashboard |
Monitoring Best Practices
Before monitoring, define what "healthy" means for your services:
- Availability: 99.9% uptime (43 minutes downtime/month)
- Latency: 95% of requests under 200ms
- Error Rate: Less than 0.1% errors
Alert when users are impacted, not on every component failure:
- Good: "API response time > 1s for 5 minutes"
- Bad: "Redis CPU usage > 70%"
- Every alert should be actionable
- Use appropriate thresholds and durations
- Group similar alerts
- Implement maintenance windows
- Review and tune alerts regularly
- Short-term (15 days): High-resolution data in Prometheus
- Long-term (1+ year): Downsampled data in external storage (Thanos, Cortex)
- Balance storage costs vs. query needs
- Alert when Prometheus is down
- Monitor Prometheus scrape failures
- Track Prometheus resource usage
- Ensure Grafana is accessible
Practical Monitoring Workflow
Day 1: Setup Foundation
Install Prometheus and Grafana, verify metrics collection, import community dashboards
Week 1: Create Custom Dashboards
Build dashboards based on public templates, customize for your applications, organize by team/service
Week 2-4: Establish Baselines
Observe normal behavior, identify patterns, determine appropriate thresholds
Month 2: Implement Alerting
Create alert rules, configure notification channels, test alert flow, document runbooks
Ongoing: Refine and Optimize
Tune alert thresholds, reduce false positives, add new metrics as needed, review and update quarterly
Final Quiz
Test your knowledge of Kubernetes cluster monitoring!
Question 1: What are the Four Golden Signals in monitoring?
Question 2: How does Prometheus collect metrics?
Question 3: What metric type should be used for total HTTP requests?
Question 4: What is a major benefit of using public Grafana dashboards?
Question 5: How should you set appropriate units in Grafana panels?
Question 6: What's the best practice for alert severity?
Question 7: What is PromQL used for?
Question 8: Why should you alert on symptoms rather than causes?
All correct answers are option 'b'. These monitoring principles will help you build effective observability for your Kubernetes clusters. Remember: monitoring is a vast topic worthy of its own course, but this provides a solid foundation to get started!