Lesson 1: Introduction to Observability
Observability goes beyond simple monitoring. It's the ability to understand the internal state of your system by examining its outputs. In modern distributed systems like Kubernetes, observability is essential for debugging, performance optimization, and maintaining reliability.
Monitoring vs. Observability
| Aspect | Monitoring | Observability |
|---|---|---|
| Definition | Watching known metrics and alerts | Understanding system behavior from outputs |
| Questions Answered | "Is the system up?" "Are metrics normal?" | "Why did this happen?" "What caused this?" |
| Approach | Predefined dashboards and alerts | Exploratory investigation and correlation |
| Use Case | Detecting known issues | Debugging unknown issues |
| Scope | What you know to check | What you didn't anticipate needing |
The Three Pillars of Observability
Logs
What: Discrete events with timestamps
Purpose:
- Detailed event history
- Error messages and stack traces
- Debugging specific issues
- Audit trails
Example:
2025-01-15 10:30:45 ERROR: Database connection timeout
Metrics
What: Numerical measurements over time
Purpose:
- System health trends
- Performance tracking
- Alerting on thresholds
- Capacity planning
Example:
cpu_usage{pod="app-1"} = 75%
Traces
What: Request journey across services
Purpose:
- End-to-end request flow
- Performance bottlenecks
- Service dependencies
- Distributed debugging
Example:
API → Auth → DB (120ms)
Why Observability Matters in Kubernetes
- Pods constantly starting, stopping, and moving between nodes
- Multiple replicas of the same service running simultaneously
- Services communicating across network boundaries
- Failures that cascade across multiple components
Real-World Scenario: Debugging a Slow Request
Problem: Users report that API requests are occasionally taking 5+ seconds
Using Only Monitoring:
- Dashboard shows average response time = 200ms (looks normal!)
- CPU and memory metrics look fine
- No alerts firing
- Result: Cannot identify the issue ❌
Using Observability (All Three Pillars):
- Metrics: Create a p99 latency graph → See spikes every 2 minutes
- Logs: Filter logs during spike times → Find "database connection pool exhausted"
- Traces: Follow slow request traces → Identify specific DB query taking 5 seconds
- Result: Found the root cause: unoptimized query + small connection pool ✓
The Golden Rule of Observability
The primary principle is to capture all data that you might need to understand the life and health of your system. You cannot retroactively add observability data after an incident occurs.
Ask yourself: "If this system fails at 3 AM, what information would I need to debug it?"
Connecting the Pillars
- Start with Metrics: Notice a spike in error rate
- Drill into Logs: Find specific error messages during that timeframe
- Follow Traces: See the complete request path that caused the error
- Correlate Back: Understand which metric spike corresponds to which trace/log
# Key to connecting the pillars: Correlation IDs
# Every request gets a unique ID that appears in:
# - Metrics (as a label)
# - Logs (as a field)
# - Traces (as trace_id)
# Example flow:
# 1. Trace starts with trace_id = "abc123"
# 2. Each span in trace has span_id = "span-001", "span-002", etc.
# 3. Logs include both:
{
"trace_id": "abc123",
"span_id": "span-001",
"message": "Database query started"
}
# Now you can:
# - See trace "abc123" took 5 seconds (distributed tracing)
# - Find all logs for trace_id "abc123" (logging system)
# - See metrics tagged with trace_id "abc123" (metrics system)
# - Follow the complete story across all three pillars!Observability Tools Ecosystem
| Pillar | Popular Tools | Kubernetes Native |
|---|---|---|
| Logs | Loki, Elasticsearch, Fluentd, Fluent Bit | kubectl logs (basic) |
| Metrics | Prometheus, Grafana, Datadog, New Relic | Metrics Server, kubectl top |
| Traces | Jaeger, Zipkin, Tempo, OpenTelemetry | None (requires external tools) |
| Unified | Grafana Stack (Loki+Prometheus+Tempo+Grafana) | - |
Lesson 2: Advanced Logging Techniques
Effective logging is more than just printing messages. Structured logging with the right fields enables powerful analysis, correlation with traces, and actionable insights.
Structured Logging with JSON
- Machine-readable and easily parsed
- Searchable by individual fields
- Aggregatable and analyzable at scale
- Compatible with modern logging systems (Loki, Elasticsearch)
# Bad: Unstructured text log
"2025-01-15 10:30:45 User john requested /api/users from 192.168.1.100 - took 250ms - status 200"
# Parsing this requires complex regex patterns
# Cannot easily filter by user, endpoint, or response time
# Difficult to aggregate metrics from logs
# Good: Structured JSON log
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "INFO",
"message": "HTTP request processed",
"user": "john",
"endpoint": "/api/users",
"method": "GET",
"ip": "192.168.1.100",
"response_time_ms": 250,
"status_code": 200,
"handler": "UserController.getUsers"
}
# Now you can easily:
# - Find all requests by user: user="john"
# - Find slow requests: response_time_ms > 1000
# - Calculate average response time per endpoint
# - Build graphs of status codes over timeEssential Log Fields
1. Severity/Level
Purpose: Filter logs by importance and build alerting rules
Standard Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
# Use case: Alert on error spikes
{
"level": "ERROR",
"message": "Database connection failed",
"error": "Connection timeout after 30s"
}
# Prometheus query to count errors:
sum(rate(log_entries{level="ERROR"}[5m]))
# Alert when error rate > 10/sec:
alert: HighErrorRate
expr: sum(rate(log_entries{level="ERROR"}[5m])) > 102. Response Time
Purpose: Track performance and identify slow requests
Field: response_time_ms or duration_ms
# For HTTP services, always log request duration
{
"level": "INFO",
"endpoint": "/api/orders",
"method": "POST",
"response_time_ms": 1250,
"status_code": 201
}
# Analysis:
# - Calculate p50, p95, p99 percentiles
# - Identify slow endpoints
# - Correlate slow requests with errors
# - Build SLO dashboards (e.g., 95% of requests < 500ms)3. Handler/Endpoint
Purpose: Identify which component or function processed the request
Importance: Pinpoints the exact code path for debugging
{
"level": "ERROR",
"handler": "OrderController.createOrder",
"endpoint": "/api/orders",
"method": "POST",
"message": "Validation failed",
"validation_errors": ["Missing required field: customer_id"]
}
# With handler information, you can:
# - Quickly identify which code is failing
# - Track error rates per controller/function
# - Prioritize fixes based on most-failing handlers4. User Information
Purpose: Track user activity and debug user-specific issues
Fields: user_id, username, tenant_id (for multi-tenant apps)
{
"level": "WARNING",
"user_id": "user-12345",
"username": "john@example.com",
"tenant_id": "company-abc",
"message": "Rate limit exceeded",
"requests_per_minute": 150,
"limit": 100
}
# Use cases:
# - Find all actions by a specific user
# - Identify abusive users (rate limiting, suspicious activity)
# - Debug user-reported issues: "Show me all logs for user X"
# - Track tenant-specific behavior in multi-tenant systemsCritical: Span ID for Distributed Tracing
span_id (and trace_id) in your logs. This connects individual log events to a complete distributed trace, allowing you to follow a single request across multiple services.
# Service 1 (API Gateway) receives request
{
"timestamp": "2025-01-15T10:30:45.000Z",
"level": "INFO",
"service": "api-gateway",
"trace_id": "abc123def456",
"span_id": "span-001",
"message": "Received request",
"endpoint": "/api/orders",
"method": "POST"
}
# Service 2 (Order Service) processes request
{
"timestamp": "2025-01-15T10:30:45.050Z",
"level": "INFO",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "span-002",
"parent_span_id": "span-001",
"message": "Creating order",
"order_id": "order-789"
}
# Service 3 (Payment Service) is called
{
"timestamp": "2025-01-15T10:30:45.100Z",
"level": "INFO",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span-003",
"parent_span_id": "span-002",
"message": "Processing payment",
"amount": 99.99
}
# Service 3 encounters error
{
"timestamp": "2025-01-15T10:30:45.150Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span-003",
"message": "Payment gateway timeout",
"error": "Connection timeout after 5s"
}
# Now you can:
# 1. Query logs for trace_id "abc123def456"
# 2. See the complete request journey across all services
# 3. Identify exactly where it failed (payment-service)
# 4. View the distributed trace visualization in Jaeger/Tempo
# 5. Correlate metrics for all services involved in this traceComplete Log Structure Template
# Comprehensive JSON log structure
{
// Timestamp (ISO 8601 format with timezone)
"timestamp": "2025-01-15T10:30:45.123Z",
// Severity level
"level": "INFO",
// Human-readable message
"message": "User login successful",
// Service/Application identification
"service": "auth-service",
"version": "v2.3.1",
"environment": "production",
// Distributed tracing
"trace_id": "abc123def456",
"span_id": "span-001",
"parent_span_id": null,
// Kubernetes metadata (automatically added by logging agent)
"kubernetes": {
"namespace": "production",
"pod_name": "auth-service-7f9c4b8d-xk2pq",
"container_name": "auth",
"node_name": "worker-node-2",
"labels": {
"app": "auth-service",
"version": "v2.3.1"
}
},
// Request context
"request": {
"method": "POST",
"endpoint": "/api/auth/login",
"ip": "192.168.1.100",
"user_agent": "Mozilla/5.0..."
},
// Response context
"response": {
"status_code": 200,
"response_time_ms": 125
},
// User context
"user": {
"user_id": "user-12345",
"username": "john@example.com",
"tenant_id": "company-abc"
},
// Business context
"handler": "AuthController.login",
"session_id": "sess-xyz789",
// Additional custom fields
"custom": {
"login_method": "password",
"two_factor_enabled": true
}
}Implementing Structured Logging
# Example: Python with structlog
import structlog
logger = structlog.get_logger()
# Add global context
logger = logger.bind(
service="auth-service",
version="v2.3.1",
environment="production"
)
# Log with structured fields
logger.info(
"user_login_successful",
user_id="user-12345",
username="john@example.com",
endpoint="/api/auth/login",
response_time_ms=125,
trace_id=request.trace_id,
span_id=request.span_id
)
# Example: Go with logrus
import (
"github.com/sirupsen/logrus"
)
log := logrus.WithFields(logrus.Fields{
"service": "auth-service",
"trace_id": traceID,
"span_id": spanID,
"user_id": userID,
"endpoint": "/api/auth/login",
"response_ms": 125,
})
log.Info("User login successful")
# Example: Node.js with winston
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: {
service: 'auth-service',
version: 'v2.3.1'
}
});
logger.info('User login successful', {
trace_id: traceId,
span_id: spanId,
user_id: userId,
endpoint: '/api/auth/login',
response_time_ms: 125
});Lesson 3: Metrics & Custom HPA
Metrics provide quantitative measurements of your system's health and performance. Advanced use cases include using custom metrics, including those derived from logs, to drive autoscaling decisions.
Types of Metrics in Kubernetes
Resource Metrics
Source: Metrics Server
Metrics:
- CPU usage
- Memory usage
Used By:
- kubectl top
- Basic HPA
- Scheduler
Custom Metrics
Source: Custom Metrics API
Metrics:
- Request rate
- Queue length
- Response time
Used By:
- Advanced HPA
- Custom dashboards
External Metrics
Source: External Metrics API
Metrics:
- Cloud provider metrics
- SaaS metrics
- Business metrics
Used By:
- HPA with external data
Standard HPA (Horizontal Pod Autoscaler)
# Basic HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scales when average CPU across all pods > 70%Advanced: Custom Metrics HPA
Step 1: Expose Custom Metrics
Your application must expose metrics in Prometheus format (typically via /metrics endpoint)
Step 2: Prometheus Scrapes Metrics
Configure Prometheus to scrape your application's /metrics endpoint and store the data
Step 3: Install Prometheus Adapter
The Prometheus Adapter exposes Prometheus metrics via the Kubernetes Custom Metrics API
Step 4: Configure HPA
Create HPA resource that references the custom metric exposed by the adapter
Example: Scaling Based on Request Rate
# Step 1: Application exposes Prometheus metrics
# Endpoint: http://my-app:8080/metrics
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users"} 150000
http_requests_total{method="POST",endpoint="/api/users"} 45000
# HELP http_requests_per_second Current request rate
# TYPE http_requests_per_second gauge
http_requests_per_second 250
# Step 2: Prometheus scrapes the metrics
# ServiceMonitor or PodMonitor configured to scrape /metrics
# Step 3: Install Prometheus Adapter
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus.monitoring.svc \
--set prometheus.port=9090
# Step 4: Configure adapter to expose custom metric
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
# Step 5: Create HPA using custom metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
# Scales when average requests/sec per pod > 100Advanced: Log-Derived Metrics for HPA
How Log-Derived Metrics Work
- Structured Logs: Application logs response time in JSON format
- Log Aggregation: Promtail/Fluent Bit sends logs to Loki
- LogQL to Metrics: Loki can convert logs to metrics using LogQL queries
- Prometheus Scrapes: Prometheus scrapes metrics from Loki
- Adapter Exposes: Prometheus Adapter exposes to Kubernetes
- HPA Uses: HPA scales based on these log-derived metrics
# Example: Scale based on P95 response time from logs
# 1. Application logs include response_time_ms
{
"level": "INFO",
"endpoint": "/api/orders",
"response_time_ms": 350,
"status_code": 200
}
# 2. Loki query to extract response time as metric
# In Loki, create a recording rule:
sum(rate({namespace="production"}
| json
| response_time_ms > 0 [5m])) by (pod)
# 3. Prometheus scrapes this metric from Loki
# Configure Prometheus to scrape Loki's metrics endpoint
# 4. Create HPA based on P95 response time
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: response-time-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_response_time_p95_milliseconds
target:
type: AverageValue
averageValue: "500" # Scale when P95 > 500ms
# When response times increase due to load:
# 1. Logs capture slow requests
# 2. Metric derived from logs shows increased P95
# 3. HPA sees metric > 500ms
# 4. HPA adds more replicas
# 5. Load distributes, response times improve- Scale based on actual user experience (response time)
- Leverage existing log data without separate instrumentation
- React to application-level issues, not just resource usage
- Can extract any metric from structured logs
- Structured logging (JSON) with relevant fields
- Loki or similar log aggregation system
- Prometheus integration
- Prometheus Adapter for Kubernetes metrics API
- Proper configuration of recording rules
Lesson 4: Kubernetes Jobs Best Practices
Kubernetes Jobs and CronJobs have specific configuration requirements that are often misconfigured, leading to unexpected behavior and difficult-to-debug issues.
Understanding restartPolicy
| Policy | Behavior | Use Case |
|---|---|---|
| Always | Always restart container after exit | Long-running services (Deployments, StatefulSets) |
| OnFailure | Restart only if exit code ≠ 0 | Jobs (NOT recommended - see below) |
| Never | Never restart container | Jobs, CronJobs (RECOMMENDED) |
The Problem with OnFailure
restartPolicy: OnFailure for Jobs, thinking it makes sense to retry failed tasks. However, this creates several problems.
Why OnFailure is Problematic for Jobs
1. Kubelet Handles Restarts (Not Job Controller)
When you use restartPolicy: OnFailure, the kubelet on the node manages the container restarts, completely bypassing the Job controller.
2. Job Controller Settings Are Ignored
Critical Job-level configurations don't work properly:
activeDeadlineSeconds: Maximum execution time is not enforcedbackoffLimit: Retry count limit is bypassed- Job completion tracking becomes unreliable
- Cleanup behavior is unpredictable
3. Node-Local Behavior
The kubelet restarts the container on the same node, which means:
- If node fails, job is stuck
- Node-specific issues (disk full, network) cause repeated failures
- Cannot leverage cluster-wide scheduling
Best Practice: Use restartPolicy: Never
restartPolicy: Never for Job and CronJob pod templates. This ensures the Job controller manages retries and all job-level settings work correctly.
# ❌ WRONG: Using OnFailure
apiVersion: batch/v1
kind: Job
metadata:
name: data-import-job
spec:
backoffLimit: 3 # This will be IGNORED!
activeDeadlineSeconds: 600 # This will be IGNORED!
template:
spec:
restartPolicy: OnFailure # ← DON'T USE THIS
containers:
- name: importer
image: data-importer:v1
command: ["python", "import.py"]
# What happens:
# 1. Container fails (exit code 1)
# 2. Kubelet restarts container on same node
# 3. Job controller has no control
# 4. backoffLimit is ignored (could retry 100 times!)
# 5. activeDeadlineSeconds is ignored (could run forever!)
# 6. If node fails, job is stuck# ✅ CORRECT: Using Never
apiVersion: batch/v1
kind: Job
metadata:
name: data-import-job
spec:
backoffLimit: 3 # Maximum 3 retries
activeDeadlineSeconds: 600 # Job must complete within 10 minutes
template:
spec:
restartPolicy: Never # ← CORRECT
containers:
- name: importer
image: data-importer:v1
command: ["python", "import.py"]
# What happens:
# 1. Container fails (exit code 1)
# 2. Pod is marked as Failed
# 3. Job controller sees the failure
# 4. Job controller creates a NEW pod (retry #1)
# 5. If it fails again, creates another pod (retry #2)
# 6. After 3 failures, Job is marked as Failed
# 7. activeDeadlineSeconds is enforced (job killed after 600s)
# 8. New pods can be scheduled on different nodes (better reliability)Complete Job Configuration Example
# Production-ready Job configuration
apiVersion: batch/v1
kind: Job
metadata:
name: database-backup
namespace: production
spec:
# Maximum time for job to complete (2 hours)
activeDeadlineSeconds: 7200
# Maximum retries before marking as failed
backoffLimit: 3
# Number of pod completions required
completions: 1
# Number of pods to run in parallel
parallelism: 1
# Cleanup policy (automatically delete after completion)
ttlSecondsAfterFinished: 86400 # Delete after 24 hours
template:
metadata:
labels:
app: database-backup
component: backup
spec:
restartPolicy: Never # ← CRITICAL
# Service account for authentication
serviceAccountName: backup-sa
containers:
- name: backup
image: postgres:14
command:
- /bin/bash
- -c
- |
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
gzip > /backup/db-backup-$(date +%Y%m%d-%H%M%S).sql.gz
env:
- name: DB_HOST
value: postgres.production.svc
- name: DB_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: DB_NAME
value: production_db
volumeMounts:
- name: backup-storage
mountPath: /backup
# Resource limits
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvcCronJob Best Practices
# Production-ready CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-cleanup
namespace: production
spec:
# Run every day at 2:00 AM
schedule: "0 2 * * *"
# Maximum time to start job if it misses scheduled time
startingDeadlineSeconds: 600
# How to handle overlapping jobs
concurrencyPolicy: Forbid # Don't start if previous still running
# Keep history of successful and failed jobs
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
# Job-level settings
activeDeadlineSeconds: 3600 # 1 hour max
backoffLimit: 2
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: Never # ← CRITICAL
containers:
- name: cleanup
image: alpine:3.18
command:
- /bin/sh
- -c
- |
echo "Starting cleanup..."
find /data -type f -mtime +30 -delete
echo "Cleanup completed"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: app-dataObservability for Jobs
Monitoring Jobs
Apply the same observability principles to Jobs:
# Job with structured logging and metrics
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
backoffLimit: 3
template:
spec:
restartPolicy: Never
containers:
- name: processor
image: my-processor:v1
env:
# Enable structured logging
- name: LOG_FORMAT
value: "json"
# Add trace context
- name: TRACE_ENABLED
value: "true"
# Expose metrics
- name: METRICS_PORT
value: "9090"
command:
- python
- process.py
# Expose metrics endpoint
ports:
- name: metrics
containerPort: 9090
# Application code includes:
# 1. Structured logging with job context
logger.info(
"job_started",
job_name="data-processing",
job_id=os.environ.get("JOB_ID"),
trace_id=generate_trace_id()
)
# 2. Prometheus metrics
job_duration_seconds.observe(duration)
job_records_processed.inc(count)
job_errors_total.inc()
# 3. Final status log
logger.info(
"job_completed",
duration_seconds=duration,
records_processed=count,
status="success"
)- Always use
restartPolicy: Neverfor Jobs and CronJobs - Let the Job controller manage retries, not the kubelet
- Set appropriate
backoffLimitandactiveDeadlineSeconds - Use
ttlSecondsAfterFinishedfor automatic cleanup - Apply observability principles (logs, metrics, traces) to Jobs
- Monitor job success/failure rates with Prometheus
Final Quiz
Test your knowledge of Kubernetes observability!
Question 1: What are the three pillars of observability?
Question 2: Why must logs include span_id and trace_id?
Question 3: What is the primary principle for logging?
Question 4: What enables using log-derived metrics for HPA?
Question 5: Why should Jobs use restartPolicy: Never?
Question 6: What happens when using restartPolicy: OnFailure for Jobs?
Question 7: Which essential fields should be in structured JSON logs?
Question 8: What's the difference between monitoring and observability?
All correct answers are option 'b'. These observability principles are essential for operating production Kubernetes clusters. Remember: proper observability with logs, metrics, and traces (all connected via correlation IDs) is critical for understanding and debugging complex distributed systems!