Kubernetes Cluster Monitoring

Observability, Metrics Collection & Visualization with Prometheus & Grafana

Lesson 1: Introduction to Monitoring

Monitoring a Kubernetes cluster is essential for maintaining reliability, performance, and security. This vast topic deserves its own dedicated course, but this crash course provides a solid foundation for beginners.

Why Monitor Kubernetes?

Without Monitoring, You're Blind: In production environments, you cannot rely on manual checks or hope that everything works. Monitoring provides visibility into cluster health, resource usage, and application performance before problems impact users.
Key Benefits of Monitoring:
  • Early Problem Detection: Catch issues before they become critical failures
  • Resource Optimization: Identify underutilized or overloaded resources
  • Capacity Planning: Make data-driven decisions about scaling
  • Performance Troubleshooting: Diagnose bottlenecks and slowdowns
  • Cost Management: Track resource consumption and optimize spending
  • SLA Compliance: Ensure your applications meet service level agreements

What to Monitor in Kubernetes

Infrastructure Metrics

  • Node-level: CPU, memory, disk, network
  • Cluster-level: Total capacity and usage
  • etcd: Database health and performance
  • Control plane: API server, scheduler, controller manager

Application Metrics

  • Pod metrics: Container CPU, memory, restarts
  • Deployment health: Replica status, rollout progress
  • Service performance: Request rates, latency, errors
  • Custom metrics: Application-specific KPIs

The Four Golden Signals

Google's SRE Golden Signals: These four metrics provide a comprehensive view of service health:

1. Latency

The time it takes to service a request. Track both successful and failed requests separately.

Example: HTTP response time: 95th percentile = 200ms, 99th percentile = 500ms

2. Traffic

The demand on your system, measured in requests per second or similar metrics.

Example: HTTP requests/sec = 1,500 req/s

3. Errors

The rate of requests that fail, either explicitly or implicitly.

Example: HTTP 5xx errors = 0.1% (1 in 1000 requests)

4. Saturation

How "full" your service is, measuring the utilization of constrained resources.

Example: CPU usage = 75%, Memory usage = 60%, Disk I/O = 40%

Monitoring vs. Logging vs. Tracing

Type Purpose Tools Data Format
Monitoring Track metrics over time, alerting Prometheus, Grafana Time-series numerical data
Logging Record discrete events, debugging ELK Stack, Loki, Fluentd Text logs, structured logs (JSON)
Tracing Track request flow across services Jaeger, Zipkin, OpenTelemetry Distributed traces, spans
Note: Monitoring and logging within Kubernetes is a vast topic worthy of its own separate course. This crash course focuses on metrics monitoring with Prometheus and Grafana as a starting point.

Monitoring Architecture Overview

Typical Kubernetes Monitoring Stack

Data Collection Layer:

  • Metrics Server (basic resource metrics for kubectl top)
  • Prometheus (comprehensive metrics collection)
  • Node Exporter (node-level metrics)
  • kube-state-metrics (Kubernetes object metrics)

Storage Layer:

  • Prometheus TSDB (time-series database)

Visualization Layer:

  • Grafana (dashboards and graphs)

Alerting Layer:

  • Alertmanager (alert routing and notifications)

Lesson 2: Prometheus Metrics Collection

Prometheus is the de facto standard for monitoring Kubernetes clusters. It's a time-series database that collects, stores, and queries metrics.

What is Prometheus?

Prometheus: An open-source monitoring and alerting toolkit originally built at SoundCloud. It's now a Cloud Native Computing Foundation (CNCF) graduated project, making it the standard for Kubernetes monitoring.
Key Features:
  • Pull-based Model: Prometheus scrapes metrics from targets at regular intervals
  • Multi-dimensional Data: Metrics are identified by name and key-value pairs (labels)
  • PromQL: Powerful query language for extracting insights
  • Service Discovery: Automatically discovers targets in Kubernetes
  • No External Dependencies: Single binary, local storage
  • Alerting: Built-in alerting with Alertmanager

Installing Prometheus on Kubernetes

# Install using Helm (recommended) # Add Prometheus community Helm repository helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Create monitoring namespace kubectl create namespace monitoring # Install Prometheus stack (includes Grafana, Alertmanager, exporters) helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi # Check installation kubectl get pods -n monitoring # NAME READY STATUS # prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running # prometheus-grafana-xxxxx 3/3 Running # prometheus-kube-state-metrics-xxxxx 1/1 Running # prometheus-prometheus-node-exporter-xxxxx 1/1 Running

How Prometheus Collects Metrics

Step 1: Service Discovery

Prometheus discovers targets to scrape using Kubernetes API. It finds pods, services, and nodes based on configured service discovery rules.

Step 2: Scraping

At regular intervals (default: 30s), Prometheus makes HTTP requests to each target's /metrics endpoint.

Step 3: Parsing

Prometheus parses the metrics in text-based Prometheus format and applies configured relabeling rules.

Step 4: Storage

Metrics are stored in Prometheus's time-series database (TSDB) with compression and efficient indexing.

Understanding Prometheus Metrics

Metric Format: Prometheus metrics follow a simple text format with metric name, labels, and value.
# Example metrics from /metrics endpoint # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 157890.45 node_cpu_seconds_total{cpu="0",mode="system"} 12345.67 node_cpu_seconds_total{cpu="0",mode="user"} 23456.78 # HELP node_memory_MemAvailable_bytes Memory available in bytes # TYPE node_memory_MemAvailable_bytes gauge node_memory_MemAvailable_bytes 2147483648 # HELP http_requests_total Total HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",endpoint="/api/users",status="200"} 45678 http_requests_total{method="POST",endpoint="/api/users",status="201"} 1234 http_requests_total{method="GET",endpoint="/api/users",status="500"} 12

Metric Types

Type Description Example
Counter Cumulative value that only increases (or resets to zero) Total HTTP requests, total errors
Gauge Value that can go up and down Current memory usage, number of pods
Histogram Samples observations and counts them in buckets Request duration buckets
Summary Similar to histogram, provides quantiles Request duration quantiles (p50, p95, p99)

PromQL Basics

PromQL (Prometheus Query Language): A functional query language for selecting and aggregating time-series data.
# Basic query: Get current CPU usage node_cpu_seconds_total # Filter by label node_cpu_seconds_total{mode="idle"} # Rate of change (for counters) rate(http_requests_total[5m]) # Aggregate by label sum(rate(http_requests_total[5m])) by (status) # Calculate error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Get 95th percentile latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) # Memory usage percentage (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Key Metrics to Monitor

Node Metrics (from node-exporter)

  • node_cpu_seconds_total - CPU usage by mode
  • node_memory_MemAvailable_bytes - Available memory
  • node_disk_io_time_seconds_total - Disk I/O time
  • node_network_receive_bytes_total - Network traffic
  • node_filesystem_avail_bytes - Available disk space

Kubernetes Metrics (from kube-state-metrics)

  • kube_pod_status_phase - Pod phase (Running, Pending, Failed)
  • kube_pod_container_status_restarts_total - Container restarts
  • kube_deployment_status_replicas - Deployment replica count
  • kube_node_status_condition - Node conditions (Ready, MemoryPressure)
  • kube_persistentvolumeclaim_status_phase - PVC status

Lesson 3: Grafana Visualization

Grafana is the leading open-source platform for monitoring and observability. It transforms raw metrics into beautiful, actionable dashboards.

What is Grafana?

Grafana: An open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources like Prometheus.
Key Features:
  • Multiple Data Sources: Prometheus, Loki, Elasticsearch, MySQL, and more
  • Rich Visualizations: Graphs, heatmaps, histograms, tables, gauges
  • Dashboard Management: Create, share, and organize dashboards
  • Templating: Dynamic dashboards with variables
  • Alerting: Visual alerts and notifications
  • Community Dashboards: Import ready-made dashboards from grafana.com

Accessing Grafana

# If installed with kube-prometheus-stack, get the password kubectl get secret -n monitoring prometheus-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode ; echo # Port forward to access Grafana kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # Open browser to http://localhost:3000 # Username: admin # Password: (from command above) # Or expose via LoadBalancer (cloud environments) kubectl patch svc prometheus-grafana -n monitoring \ -p '{"spec": {"type": "LoadBalancer"}}'

Creating Your First Dashboard

Step 1: Add Data Source

Navigate to Configuration → Data Sources → Add data source → Select Prometheus

URL: http://prometheus-kube-prometheus-prometheus.monitoring:9090

Click "Save & Test" to verify connection

Step 2: Create New Dashboard

Click the "+" icon → Create → Dashboard → Add new panel

Step 3: Define Metric Query

In the query editor, enter your PromQL query. Example:

rate(container_cpu_usage_seconds_total[5m])

Step 4: Customize Visualization

Choose visualization type (Time series, Gauge, Bar chart, etc.)

Set units (e.g., bytes, percent, seconds)

Configure thresholds and colors

Step 5: Save Panel

Click "Apply" to add panel to dashboard

Click "Save" icon to save the entire dashboard

Common Dashboard Panels

CPU Usage Panel

Query:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) * 100

Visualization: Time series graph

Unit: Percent (0-100)

Legend: {{namespace}}/{{pod}}

Memory Usage Panel

Query:

sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace) / 1024 / 1024

Visualization: Time series graph

Unit: MiB (Mebibytes)

Note: Shows data in megabytes for better readability

Pod Status Panel

Query:

sum(kube_pod_status_phase{phase=~"Running|Pending|Failed"}) by (phase)

Visualization: Stat or Gauge

Thresholds: Green (Running), Yellow (Pending), Red (Failed)

Using Public Dashboards

Grafana Dashboard Library: The Grafana community has created thousands of ready-to-use dashboards that you can import and customize.
# Popular Kubernetes Dashboards on grafana.com: # Kubernetes Cluster Monitoring (via Prometheus) Dashboard ID: 315 https://grafana.com/grafana/dashboards/315 # Kubernetes / Compute Resources / Cluster Dashboard ID: 7249 https://grafana.com/grafana/dashboards/7249 # Node Exporter Full Dashboard ID: 1860 https://grafana.com/grafana/dashboards/1860 # Kubernetes / Pods Dashboard ID: 6417 https://grafana.com/grafana/dashboards/6417

Importing a Public Dashboard

Method 1: Import by ID

1. Click "+" → Import

2. Enter Dashboard ID (e.g., 315)

3. Click "Load"

4. Select Prometheus data source

5. Click "Import"

Method 2: Import JSON

1. Download JSON from grafana.com

2. Click "+" → Import

3. Upload JSON file or paste JSON

4. Select Prometheus data source

5. Click "Import"

Customizing Imported Dashboards: After importing, you can customize panels, add new metrics, adjust time ranges, and reorganize the layout to fit your specific needs. These customizations can be saved to create your own version.

Dashboard Best Practices

Practical Tips:
  • Organize by Rows: Group related panels into rows for better organization
  • Use Variables: Create dashboard variables for namespace, pod name, etc. to make dashboards reusable
  • Set Appropriate Time Ranges: Default to last 6 hours or 24 hours for operational dashboards
  • Add Descriptions: Document what each panel shows for team members
  • Use Consistent Units: Stick to standard units (MiB/GiB for memory, % for CPU)
  • Avoid Over-crowding: 8-12 panels per dashboard is optimal
  • Move Panels Easily: Drag and drop panels to reorganize your dashboard layout

Lesson 4: Best Practices & Alerting

Effective monitoring requires more than just collecting metrics. You need actionable alerts, proper retention policies, and standardized practices.

Essential Dashboards for Kubernetes

1. Cluster Overview Dashboard

Purpose: High-level cluster health at a glance

Key Metrics:

  • Total nodes and their status (Ready/NotReady)
  • Total pods by phase (Running/Pending/Failed)
  • Cluster-wide CPU and memory utilization
  • API server request rate and latency
  • etcd health and performance

2. Node Dashboard

Purpose: Individual node resource monitoring

Key Metrics:

  • CPU usage per node
  • Memory usage per node
  • Disk I/O and space
  • Network traffic (in/out)
  • System load and processes

3. Workload Dashboard

Purpose: Application and pod performance

Key Metrics:

  • Pod CPU and memory usage by namespace
  • Container restart counts
  • Deployment replica status
  • Pod network traffic
  • Pod state transitions

Introduction to Alerting

Alerts are Critical: Dashboards are great for investigation, but you can't watch them 24/7. Alerts proactively notify you when something goes wrong.
Alerting Flow:
  1. Prometheus evaluates alerting rules at regular intervals
  2. When conditions are met, alert enters "Pending" state
  3. After configured duration, alert moves to "Firing" state
  4. Prometheus sends alert to Alertmanager
  5. Alertmanager groups, deduplicates, and routes alerts
  6. Notifications sent via configured channels (email, Slack, PagerDuty)

Common Alert Rules

# Alert Rules YAML Configuration # File: prometheus-alerts.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-alerts namespace: monitoring data: alerts.yml: | groups: - name: kubernetes-nodes interval: 30s rules: # Alert when node is down - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.instance }} is down" description: "Node has been unreachable for more than 5 minutes" # Alert on high CPU usage - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 10 minutes" # Alert on high memory usage - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 85%" - name: kubernetes-pods interval: 30s rules: # Alert when pods are crashing - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in the last 15 minutes" # Alert when pods are pending too long - alert: PodPendingTooLong expr: kube_pod_status_phase{phase="Pending"} == 1 for: 15m labels: severity: warning annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending" description: "Pod has been in Pending state for more than 15 minutes" # Alert on deployment replica mismatch - alert: DeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 10m labels: severity: warning annotations: summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch" description: "Desired replicas != available replicas for more than 10 minutes"

Alert Severity Levels

Severity When to Use Response Time Notification
Critical Service is down, data loss imminent Immediate (wake up on-call) PagerDuty, Phone, SMS
Warning Service degraded, issue developing Within hours (during business hours) Slack, Email
Info Informational, no action required yet When convenient Email, Dashboard

Monitoring Best Practices

1. Define Clear SLOs (Service Level Objectives)

Before monitoring, define what "healthy" means for your services:

  • Availability: 99.9% uptime (43 minutes downtime/month)
  • Latency: 95% of requests under 200ms
  • Error Rate: Less than 0.1% errors
2. Alert on Symptoms, Not Causes

Alert when users are impacted, not on every component failure:

  • Good: "API response time > 1s for 5 minutes"
  • Bad: "Redis CPU usage > 70%"
3. Reduce Alert Fatigue
  • Every alert should be actionable
  • Use appropriate thresholds and durations
  • Group similar alerts
  • Implement maintenance windows
  • Review and tune alerts regularly
4. Plan for Data Retention
  • Short-term (15 days): High-resolution data in Prometheus
  • Long-term (1+ year): Downsampled data in external storage (Thanos, Cortex)
  • Balance storage costs vs. query needs
5. Monitor the Monitoring System
  • Alert when Prometheus is down
  • Monitor Prometheus scrape failures
  • Track Prometheus resource usage
  • Ensure Grafana is accessible

Practical Monitoring Workflow

Day 1: Setup Foundation

Install Prometheus and Grafana, verify metrics collection, import community dashboards

Week 1: Create Custom Dashboards

Build dashboards based on public templates, customize for your applications, organize by team/service

Week 2-4: Establish Baselines

Observe normal behavior, identify patterns, determine appropriate thresholds

Month 2: Implement Alerting

Create alert rules, configure notification channels, test alert flow, document runbooks

Ongoing: Refine and Optimize

Tune alert thresholds, reduce false positives, add new metrics as needed, review and update quarterly

Remember: Monitoring is an iterative process. Start with basic dashboards and alerts, then continuously improve based on real-world experience and incidents. The goal is actionable insights, not just collecting data.

Final Quiz

Test your knowledge of Kubernetes cluster monitoring!

Question 1: What are the Four Golden Signals in monitoring?

a) CPU, Memory, Disk, Network
b) Latency, Traffic, Errors, Saturation
c) Nodes, Pods, Services, Deployments
d) Metrics, Logs, Traces, Alerts

Question 2: How does Prometheus collect metrics?

a) Applications push metrics to Prometheus
b) Prometheus scrapes (pulls) metrics from targets at regular intervals
c) Metrics are stored directly in etcd
d) Grafana collects and sends metrics to Prometheus

Question 3: What metric type should be used for total HTTP requests?

a) Gauge (value can go up and down)
b) Counter (cumulative value that only increases)
c) Histogram (samples in buckets)
d) Summary (provides quantiles)

Question 4: What is a major benefit of using public Grafana dashboards?

a) They are the only way to visualize Prometheus data
b) You can import ready-made, well-tested dashboards and customize them
c) They automatically configure Prometheus
d) They replace the need for custom metrics

Question 5: How should you set appropriate units in Grafana panels?

a) Always use raw bytes regardless of value size
b) Define values in meaningful units (MiB for memory, % for CPU) for readability
c) Units don't matter, only the graph shape
d) Let Grafana choose automatically without customization

Question 6: What's the best practice for alert severity?

a) Mark everything as critical to ensure quick response
b) Use critical only when service is down or data loss imminent; warning for degraded service
c) Never use critical alerts, only warnings
d) Severity doesn't matter as long as notifications are sent

Question 7: What is PromQL used for?

a) Configuring Kubernetes resources
b) Querying and aggregating time-series data in Prometheus
c) Creating Docker containers
d) Deploying Grafana dashboards

Question 8: Why should you alert on symptoms rather than causes?

a) Symptoms are easier to measure
b) Alert when users are actually impacted, not on every component issue that may not affect service
c) Causes don't generate metrics
d) Symptoms require less configuration
Quiz Complete!
All correct answers are option 'b'. These monitoring principles will help you build effective observability for your Kubernetes clusters. Remember: monitoring is a vast topic worthy of its own course, but this provides a solid foundation to get started!