Kubernetes Cluster Monitoring

Lesson 1: Introduction to Monitoring

Monitoring a Kubernetes cluster is essential for maintaining reliability, performance, and security. This vast topic deserves its own dedicated course, but this crash course provides a solid foundation for beginners.

Why Monitor Kubernetes?

Without Monitoring, You're Blind: In production environments, you cannot rely on manual checks or hope that everything works. Monitoring provides visibility into cluster health, resource usage, and application performance before problems impact users.

Key Benefits of Monitoring:

Early Problem Detection: Catch issues before they become critical failures
Resource Optimization: Identify underutilized or overloaded resources
Capacity Planning: Make data-driven decisions about scaling
Performance Troubleshooting: Diagnose bottlenecks and slowdowns
Cost Management: Track resource consumption and optimize spending
SLA Compliance: Ensure your applications meet service level agreements

What to Monitor in Kubernetes

Infrastructure Metrics

Node-level: CPU, memory, disk, network
Cluster-level: Total capacity and usage
etcd: Database health and performance
Control plane: API server, scheduler, controller manager

Application Metrics

Pod metrics: Container CPU, memory, restarts
Deployment health: Replica status, rollout progress
Service performance: Request rates, latency, errors
Custom metrics: Application-specific KPIs

The Four Golden Signals

Google's SRE Golden Signals: These four metrics provide a comprehensive view of service health:

1. Latency

The time it takes to service a request. Track both successful and failed requests separately.

Example: HTTP response time: 95th percentile = 200ms, 99th percentile = 500ms

2. Traffic

The demand on your system, measured in requests per second or similar metrics.

Example: HTTP requests/sec = 1,500 req/s

3. Errors

The rate of requests that fail, either explicitly or implicitly.

Example: HTTP 5xx errors = 0.1% (1 in 1000 requests)

4. Saturation

How "full" your service is, measuring the utilization of constrained resources.

Example: CPU usage = 75%, Memory usage = 60%, Disk I/O = 40%

Monitoring vs. Logging vs. Tracing

Type	Purpose	Tools	Data Format
Monitoring	Track metrics over time, alerting	Prometheus, Grafana	Time-series numerical data
Logging	Record discrete events, debugging	ELK Stack, Loki, Fluentd	Text logs, structured logs (JSON)
Tracing	Track request flow across services	Jaeger, Zipkin, OpenTelemetry	Distributed traces, spans

Note: Monitoring and logging within Kubernetes is a vast topic worthy of its own separate course. This crash course focuses on metrics monitoring with Prometheus and Grafana as a starting point.

Monitoring Architecture Overview

Typical Kubernetes Monitoring Stack

Data Collection Layer:

Metrics Server (basic resource metrics for kubectl top)
Prometheus (comprehensive metrics collection)
Node Exporter (node-level metrics)
kube-state-metrics (Kubernetes object metrics)

Storage Layer:

Prometheus TSDB (time-series database)

Visualization Layer:

Grafana (dashboards and graphs)

Alerting Layer:

Alertmanager (alert routing and notifications)

Lesson 2: Prometheus Metrics Collection

Prometheus is the de facto standard for monitoring Kubernetes clusters. It's a time-series database that collects, stores, and queries metrics.

What is Prometheus?

Prometheus: An open-source monitoring and alerting toolkit originally built at SoundCloud. It's now a Cloud Native Computing Foundation (CNCF) graduated project, making it the standard for Kubernetes monitoring.

Key Features:

Pull-based Model: Prometheus scrapes metrics from targets at regular intervals
Multi-dimensional Data: Metrics are identified by name and key-value pairs (labels)
PromQL: Powerful query language for extracting insights
Service Discovery: Automatically discovers targets in Kubernetes
No External Dependencies: Single binary, local storage
Alerting: Built-in alerting with Alertmanager

Installing Prometheus on Kubernetes

# Install using Helm (recommended)
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install Prometheus stack (includes Grafana, Alertmanager, exporters)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# Check installation
kubectl get pods -n monitoring
# NAME                                                     READY   STATUS
# prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running
# prometheus-grafana-xxxxx                                 3/3     Running
# prometheus-kube-state-metrics-xxxxx                      1/1     Running
# prometheus-prometheus-node-exporter-xxxxx                1/1     Running

How Prometheus Collects Metrics

Step 1: Service Discovery

Prometheus discovers targets to scrape using Kubernetes API. It finds pods, services, and nodes based on configured service discovery rules.

Step 2: Scraping

At regular intervals (default: 30s), Prometheus makes HTTP requests to each target's /metrics endpoint.

Step 3: Parsing

Prometheus parses the metrics in text-based Prometheus format and applies configured relabeling rules.

Step 4: Storage

Metrics are stored in Prometheus's time-series database (TSDB) with compression and efficient indexing.

Understanding Prometheus Metrics

Metric Format: Prometheus metrics follow a simple text format with metric name, labels, and value.

# Example metrics from /metrics endpoint

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 157890.45
node_cpu_seconds_total{cpu="0",mode="system"} 12345.67
node_cpu_seconds_total{cpu="0",mode="user"} 23456.78

# HELP node_memory_MemAvailable_bytes Memory available in bytes
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 2147483648

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users",status="200"} 45678
http_requests_total{method="POST",endpoint="/api/users",status="201"} 1234
http_requests_total{method="GET",endpoint="/api/users",status="500"} 12

Metric Types

Type	Description	Example
Counter	Cumulative value that only increases (or resets to zero)	Total HTTP requests, total errors
Gauge	Value that can go up and down	Current memory usage, number of pods
Histogram	Samples observations and counts them in buckets	Request duration buckets
Summary	Similar to histogram, provides quantiles	Request duration quantiles (p50, p95, p99)

PromQL Basics

PromQL (Prometheus Query Language): A functional query language for selecting and aggregating time-series data.

# Basic query: Get current CPU usage
node_cpu_seconds_total

# Filter by label
node_cpu_seconds_total{mode="idle"}

# Rate of change (for counters)
rate(http_requests_total[5m])

# Aggregate by label
sum(rate(http_requests_total[5m])) by (status)

# Calculate error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# Get 95th percentile latency
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  /
node_memory_MemTotal_bytes * 100

Key Metrics to Monitor

Node Metrics (from node-exporter)

node_cpu_seconds_total - CPU usage by mode
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network traffic
node_filesystem_avail_bytes - Available disk space

Kubernetes Metrics (from kube-state-metrics)

kube_pod_status_phase - Pod phase (Running, Pending, Failed)
kube_pod_container_status_restarts_total - Container restarts
kube_deployment_status_replicas - Deployment replica count
kube_node_status_condition - Node conditions (Ready, MemoryPressure)
kube_persistentvolumeclaim_status_phase - PVC status

Lesson 3: Grafana Visualization

Grafana is the leading open-source platform for monitoring and observability. It transforms raw metrics into beautiful, actionable dashboards.

What is Grafana?

Grafana: An open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources like Prometheus.

Key Features:

Multiple Data Sources: Prometheus, Loki, Elasticsearch, MySQL, and more
Rich Visualizations: Graphs, heatmaps, histograms, tables, gauges
Dashboard Management: Create, share, and organize dashboards
Templating: Dynamic dashboards with variables
Alerting: Visual alerts and notifications
Community Dashboards: Import ready-made dashboards from grafana.com

Accessing Grafana

# If installed with kube-prometheus-stack, get the password
kubectl get secret -n monitoring prometheus-grafana \
  -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Open browser to http://localhost:3000
# Username: admin
# Password: (from command above)

# Or expose via LoadBalancer (cloud environments)
kubectl patch svc prometheus-grafana -n monitoring \
  -p '{"spec": {"type": "LoadBalancer"}}'

Creating Your First Dashboard

Step 1: Add Data Source

Navigate to Configuration → Data Sources → Add data source → Select Prometheus

URL: http://prometheus-kube-prometheus-prometheus.monitoring:9090

Click "Save & Test" to verify connection

Step 2: Create New Dashboard

Click the "+" icon → Create → Dashboard → Add new panel

Step 3: Define Metric Query

In the query editor, enter your PromQL query. Example:

rate(container_cpu_usage_seconds_total[5m])

Step 4: Customize Visualization

Choose visualization type (Time series, Gauge, Bar chart, etc.)

Set units (e.g., bytes, percent, seconds)

Configure thresholds and colors

Step 5: Save Panel

Click "Apply" to add panel to dashboard

Click "Save" icon to save the entire dashboard

Common Dashboard Panels

CPU Usage Panel

Query:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
  by (pod, namespace) * 100

Visualization: Time series graph

Unit: Percent (0-100)

Legend: {{namespace}}/{{pod}}

Memory Usage Panel

Query:

sum(container_memory_working_set_bytes{container!=""})
  by (pod, namespace) / 1024 / 1024

Visualization: Time series graph

Unit: MiB (Mebibytes)

Note: Shows data in megabytes for better readability

Pod Status Panel

Query:

sum(kube_pod_status_phase{phase=~"Running|Pending|Failed"})
  by (phase)

Visualization: Stat or Gauge

Thresholds: Green (Running), Yellow (Pending), Red (Failed)

Using Public Dashboards

Grafana Dashboard Library: The Grafana community has created thousands of ready-to-use dashboards that you can import and customize.

# Popular Kubernetes Dashboards on grafana.com:

# Kubernetes Cluster Monitoring (via Prometheus)
Dashboard ID: 315
https://grafana.com/grafana/dashboards/315

# Kubernetes / Compute Resources / Cluster
Dashboard ID: 7249
https://grafana.com/grafana/dashboards/7249

# Node Exporter Full
Dashboard ID: 1860
https://grafana.com/grafana/dashboards/1860

# Kubernetes / Pods
Dashboard ID: 6417
https://grafana.com/grafana/dashboards/6417

Importing a Public Dashboard

Method 1: Import by ID

1. Click "+" → Import

2. Enter Dashboard ID (e.g., 315)

3. Click "Load"

4. Select Prometheus data source

5. Click "Import"

Method 2: Import JSON

1. Download JSON from grafana.com

2. Click "+" → Import

3. Upload JSON file or paste JSON

4. Select Prometheus data source

5. Click "Import"

Customizing Imported Dashboards: After importing, you can customize panels, add new metrics, adjust time ranges, and reorganize the layout to fit your specific needs. These customizations can be saved to create your own version.

Dashboard Best Practices

Practical Tips:

Organize by Rows: Group related panels into rows for better organization
Use Variables: Create dashboard variables for namespace, pod name, etc. to make dashboards reusable
Set Appropriate Time Ranges: Default to last 6 hours or 24 hours for operational dashboards
Add Descriptions: Document what each panel shows for team members
Use Consistent Units: Stick to standard units (MiB/GiB for memory, % for CPU)
Avoid Over-crowding: 8-12 panels per dashboard is optimal
Move Panels Easily: Drag and drop panels to reorganize your dashboard layout

Lesson 4: Best Practices & Alerting

Effective monitoring requires more than just collecting metrics. You need actionable alerts, proper retention policies, and standardized practices.

Essential Dashboards for Kubernetes

1. Cluster Overview Dashboard

Purpose: High-level cluster health at a glance

Key Metrics:

Total nodes and their status (Ready/NotReady)
Total pods by phase (Running/Pending/Failed)
Cluster-wide CPU and memory utilization
API server request rate and latency
etcd health and performance

2. Node Dashboard

Purpose: Individual node resource monitoring

Key Metrics:

CPU usage per node
Memory usage per node
Disk I/O and space
Network traffic (in/out)
System load and processes

3. Workload Dashboard

Purpose: Application and pod performance

Key Metrics:

Pod CPU and memory usage by namespace
Container restart counts
Deployment replica status
Pod network traffic
Pod state transitions

Introduction to Alerting

Alerts are Critical: Dashboards are great for investigation, but you can't watch them 24/7. Alerts proactively notify you when something goes wrong.

Alerting Flow:

Prometheus evaluates alerting rules at regular intervals
When conditions are met, alert enters "Pending" state
After configured duration, alert moves to "Firing" state
Prometheus sends alert to Alertmanager
Alertmanager groups, deduplicates, and routes alerts
Notifications sent via configured channels (email, Slack, PagerDuty)

Common Alert Rules

# Alert Rules YAML Configuration
# File: prometheus-alerts.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: monitoring
data:
  alerts.yml: |
    groups:
    - name: kubernetes-nodes
      interval: 30s
      rules:

      # Alert when node is down
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node has been unreachable for more than 5 minutes"

      # Alert on high CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 10 minutes"

      # Alert on high memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85%"

    - name: kubernetes-pods
      interval: 30s
      rules:

      # Alert when pods are crashing
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in the last 15 minutes"

      # Alert when pods are pending too long
      - alert: PodPendingTooLong
        expr: kube_pod_status_phase{phase="Pending"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending"
          description: "Pod has been in Pending state for more than 15 minutes"

      # Alert on deployment replica mismatch
      - alert: DeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch"
          description: "Desired replicas != available replicas for more than 10 minutes"

Alert Severity Levels

Severity	When to Use	Response Time	Notification
Critical	Service is down, data loss imminent	Immediate (wake up on-call)	PagerDuty, Phone, SMS
Warning	Service degraded, issue developing	Within hours (during business hours)	Slack, Email
Info	Informational, no action required yet	When convenient	Email, Dashboard

Monitoring Best Practices

1. Define Clear SLOs (Service Level Objectives)

Before monitoring, define what "healthy" means for your services:

Availability: 99.9% uptime (43 minutes downtime/month)
Latency: 95% of requests under 200ms
Error Rate: Less than 0.1% errors

2. Alert on Symptoms, Not Causes

Alert when users are impacted, not on every component failure:

Good: "API response time > 1s for 5 minutes"
Bad: "Redis CPU usage > 70%"

3. Reduce Alert Fatigue

Every alert should be actionable
Use appropriate thresholds and durations
Group similar alerts
Implement maintenance windows
Review and tune alerts regularly

4. Plan for Data Retention

Short-term (15 days): High-resolution data in Prometheus
Long-term (1+ year): Downsampled data in external storage (Thanos, Cortex)
Balance storage costs vs. query needs

5. Monitor the Monitoring System

Alert when Prometheus is down
Monitor Prometheus scrape failures
Track Prometheus resource usage
Ensure Grafana is accessible

Practical Monitoring Workflow

Day 1: Setup Foundation

Install Prometheus and Grafana, verify metrics collection, import community dashboards

Week 1: Create Custom Dashboards

Build dashboards based on public templates, customize for your applications, organize by team/service

Week 2-4: Establish Baselines

Observe normal behavior, identify patterns, determine appropriate thresholds

Month 2: Implement Alerting

Create alert rules, configure notification channels, test alert flow, document runbooks

Ongoing: Refine and Optimize

Tune alert thresholds, reduce false positives, add new metrics as needed, review and update quarterly

Remember: Monitoring is an iterative process. Start with basic dashboards and alerts, then continuously improve based on real-world experience and incidents. The goal is actionable insights, not just collecting data.

Final Quiz

Test your knowledge of Kubernetes cluster monitoring!

Question 1: What are the Four Golden Signals in monitoring?

a) CPU, Memory, Disk, Network

b) Latency, Traffic, Errors, Saturation

c) Nodes, Pods, Services, Deployments

d) Metrics, Logs, Traces, Alerts

Question 2: How does Prometheus collect metrics?

a) Applications push metrics to Prometheus

b) Prometheus scrapes (pulls) metrics from targets at regular intervals

c) Metrics are stored directly in etcd

d) Grafana collects and sends metrics to Prometheus

Question 3: What metric type should be used for total HTTP requests?

a) Gauge (value can go up and down)

b) Counter (cumulative value that only increases)

c) Histogram (samples in buckets)

d) Summary (provides quantiles)

Question 4: What is a major benefit of using public Grafana dashboards?

a) They are the only way to visualize Prometheus data

b) You can import ready-made, well-tested dashboards and customize them

c) They automatically configure Prometheus

d) They replace the need for custom metrics

Question 5: How should you set appropriate units in Grafana panels?

a) Always use raw bytes regardless of value size

b) Define values in meaningful units (MiB for memory, % for CPU) for readability

c) Units don't matter, only the graph shape

d) Let Grafana choose automatically without customization

Question 6: What's the best practice for alert severity?

a) Mark everything as critical to ensure quick response

b) Use critical only when service is down or data loss imminent; warning for degraded service

c) Never use critical alerts, only warnings

d) Severity doesn't matter as long as notifications are sent

Question 7: What is PromQL used for?

a) Configuring Kubernetes resources

b) Querying and aggregating time-series data in Prometheus

c) Creating Docker containers

d) Deploying Grafana dashboards

Question 8: Why should you alert on symptoms rather than causes?

a) Symptoms are easier to measure

b) Alert when users are actually impacted, not on every component issue that may not affect service

c) Causes don't generate metrics

d) Symptoms require less configuration

Quiz Complete!
All correct answers are option 'b'. These monitoring principles will help you build effective observability for your Kubernetes clusters. Remember: monitoring is a vast topic worthy of its own course, but this provides a solid foundation to get started!