Kubernetes Observability

Lesson 1: Introduction to Observability

Observability goes beyond simple monitoring. It's the ability to understand the internal state of your system by examining its outputs. In modern distributed systems like Kubernetes, observability is essential for debugging, performance optimization, and maintaining reliability.

Monitoring vs. Observability

Aspect	Monitoring	Observability
Definition	Watching known metrics and alerts	Understanding system behavior from outputs
Questions Answered	"Is the system up?" "Are metrics normal?"	"Why did this happen?" "What caused this?"
Approach	Predefined dashboards and alerts	Exploratory investigation and correlation
Use Case	Detecting known issues	Debugging unknown issues
Scope	What you know to check	What you didn't anticipate needing

Observability Definition: A system is observable when you can understand its internal state and behavior by examining the data it produces (logs, metrics, traces) without needing to deploy new code or instrumentation.

The Three Pillars of Observability

Logs

What: Discrete events with timestamps

Purpose:

Detailed event history
Error messages and stack traces
Debugging specific issues
Audit trails

Example:

2025-01-15 10:30:45 ERROR: Database connection timeout

Metrics

What: Numerical measurements over time

Purpose:

System health trends
Performance tracking
Alerting on thresholds
Capacity planning

Example:

cpu_usage{pod="app-1"} = 75%

Traces

What: Request journey across services

Purpose:

End-to-end request flow
Performance bottlenecks
Service dependencies
Distributed debugging

Example:

API → Auth → DB (120ms)

Why Observability Matters in Kubernetes

Complex Distributed Systems: Kubernetes clusters are highly dynamic with:

Pods constantly starting, stopping, and moving between nodes
Multiple replicas of the same service running simultaneously
Services communicating across network boundaries
Failures that cascade across multiple components

Without proper observability, debugging is nearly impossible.

Real-World Scenario: Debugging a Slow Request

Problem: Users report that API requests are occasionally taking 5+ seconds

Using Only Monitoring:

Dashboard shows average response time = 200ms (looks normal!)
CPU and memory metrics look fine
No alerts firing
Result: Cannot identify the issue ❌

Using Observability (All Three Pillars):

Metrics: Create a p99 latency graph → See spikes every 2 minutes
Logs: Filter logs during spike times → Find "database connection pool exhausted"
Traces: Follow slow request traces → Identify specific DB query taking 5 seconds
Result: Found the root cause: unoptimized query + small connection pool ✓

The Golden Rule of Observability

Log Everything You'll Need to Query Later:

The primary principle is to capture all data that you might need to understand the life and health of your system. You cannot retroactively add observability data after an incident occurs.

Ask yourself: "If this system fails at 3 AM, what information would I need to debug it?"

Connecting the Pillars

Critical Concept: The three pillars are most powerful when connected together. This allows you to:

Start with Metrics: Notice a spike in error rate
Drill into Logs: Find specific error messages during that timeframe
Follow Traces: See the complete request path that caused the error
Correlate Back: Understand which metric spike corresponds to which trace/log

# Key to connecting the pillars: Correlation IDs

# Every request gets a unique ID that appears in:
# - Metrics (as a label)
# - Logs (as a field)
# - Traces (as trace_id)

# Example flow:
# 1. Trace starts with trace_id = "abc123"
# 2. Each span in trace has span_id = "span-001", "span-002", etc.
# 3. Logs include both:
{
  "trace_id": "abc123",
  "span_id": "span-001",
  "message": "Database query started"
}

# Now you can:
# - See trace "abc123" took 5 seconds (distributed tracing)
# - Find all logs for trace_id "abc123" (logging system)
# - See metrics tagged with trace_id "abc123" (metrics system)
# - Follow the complete story across all three pillars!

Observability Tools Ecosystem

Pillar	Popular Tools	Kubernetes Native
Logs	Loki, Elasticsearch, Fluentd, Fluent Bit	kubectl logs (basic)
Metrics	Prometheus, Grafana, Datadog, New Relic	Metrics Server, kubectl top
Traces	Jaeger, Zipkin, Tempo, OpenTelemetry	None (requires external tools)
Unified	Grafana Stack (Loki+Prometheus+Tempo+Grafana)	-

Lesson 2: Advanced Logging Techniques

Effective logging is more than just printing messages. Structured logging with the right fields enables powerful analysis, correlation with traces, and actionable insights.

Structured Logging with JSON

Why JSON? JSON-formatted logs are:

Machine-readable and easily parsed
Searchable by individual fields
Aggregatable and analyzable at scale
Compatible with modern logging systems (Loki, Elasticsearch)

# Bad: Unstructured text log
"2025-01-15 10:30:45 User john requested /api/users from 192.168.1.100 - took 250ms - status 200"

# Parsing this requires complex regex patterns
# Cannot easily filter by user, endpoint, or response time
# Difficult to aggregate metrics from logs

# Good: Structured JSON log
{
  "timestamp": "2025-01-15T10:30:45Z",
  "level": "INFO",
  "message": "HTTP request processed",
  "user": "john",
  "endpoint": "/api/users",
  "method": "GET",
  "ip": "192.168.1.100",
  "response_time_ms": 250,
  "status_code": 200,
  "handler": "UserController.getUsers"
}

# Now you can easily:
# - Find all requests by user: user="john"
# - Find slow requests: response_time_ms > 1000
# - Calculate average response time per endpoint
# - Build graphs of status codes over time

Essential Log Fields

1. Severity/Level

Purpose: Filter logs by importance and build alerting rules

Standard Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

# Use case: Alert on error spikes
{
  "level": "ERROR",
  "message": "Database connection failed",
  "error": "Connection timeout after 30s"
}

# Prometheus query to count errors:
sum(rate(log_entries{level="ERROR"}[5m]))

# Alert when error rate > 10/sec:
alert: HighErrorRate
expr: sum(rate(log_entries{level="ERROR"}[5m])) > 10

2. Response Time

Purpose: Track performance and identify slow requests

Field: response_time_ms or duration_ms

# For HTTP services, always log request duration
{
  "level": "INFO",
  "endpoint": "/api/orders",
  "method": "POST",
  "response_time_ms": 1250,
  "status_code": 201
}

# Analysis:
# - Calculate p50, p95, p99 percentiles
# - Identify slow endpoints
# - Correlate slow requests with errors
# - Build SLO dashboards (e.g., 95% of requests < 500ms)

3. Handler/Endpoint

Purpose: Identify which component or function processed the request

Importance: Pinpoints the exact code path for debugging

{
  "level": "ERROR",
  "handler": "OrderController.createOrder",
  "endpoint": "/api/orders",
  "method": "POST",
  "message": "Validation failed",
  "validation_errors": ["Missing required field: customer_id"]
}

# With handler information, you can:
# - Quickly identify which code is failing
# - Track error rates per controller/function
# - Prioritize fixes based on most-failing handlers

4. User Information

Purpose: Track user activity and debug user-specific issues

Fields: user_id, username, tenant_id (for multi-tenant apps)

{
  "level": "WARNING",
  "user_id": "user-12345",
  "username": "john@example.com",
  "tenant_id": "company-abc",
  "message": "Rate limit exceeded",
  "requests_per_minute": 150,
  "limit": 100
}

# Use cases:
# - Find all actions by a specific user
# - Identify abusive users (rate limiting, suspicious activity)
# - Debug user-reported issues: "Show me all logs for user X"
# - Track tenant-specific behavior in multi-tenant systems

Critical: Span ID for Distributed Tracing

Most Important Field: Always include a span_id (and trace_id) in your logs. This connects individual log events to a complete distributed trace, allowing you to follow a single request across multiple services.

# Service 1 (API Gateway) receives request
{
  "timestamp": "2025-01-15T10:30:45.000Z",
  "level": "INFO",
  "service": "api-gateway",
  "trace_id": "abc123def456",
  "span_id": "span-001",
  "message": "Received request",
  "endpoint": "/api/orders",
  "method": "POST"
}

# Service 2 (Order Service) processes request
{
  "timestamp": "2025-01-15T10:30:45.050Z",
  "level": "INFO",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "span-002",
  "parent_span_id": "span-001",
  "message": "Creating order",
  "order_id": "order-789"
}

# Service 3 (Payment Service) is called
{
  "timestamp": "2025-01-15T10:30:45.100Z",
  "level": "INFO",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span-003",
  "parent_span_id": "span-002",
  "message": "Processing payment",
  "amount": 99.99
}

# Service 3 encounters error
{
  "timestamp": "2025-01-15T10:30:45.150Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span-003",
  "message": "Payment gateway timeout",
  "error": "Connection timeout after 5s"
}

# Now you can:
# 1. Query logs for trace_id "abc123def456"
# 2. See the complete request journey across all services
# 3. Identify exactly where it failed (payment-service)
# 4. View the distributed trace visualization in Jaeger/Tempo
# 5. Correlate metrics for all services involved in this trace

Complete Log Structure Template

# Comprehensive JSON log structure
{
  // Timestamp (ISO 8601 format with timezone)
  "timestamp": "2025-01-15T10:30:45.123Z",

  // Severity level
  "level": "INFO",

  // Human-readable message
  "message": "User login successful",

  // Service/Application identification
  "service": "auth-service",
  "version": "v2.3.1",
  "environment": "production",

  // Distributed tracing
  "trace_id": "abc123def456",
  "span_id": "span-001",
  "parent_span_id": null,

  // Kubernetes metadata (automatically added by logging agent)
  "kubernetes": {
    "namespace": "production",
    "pod_name": "auth-service-7f9c4b8d-xk2pq",
    "container_name": "auth",
    "node_name": "worker-node-2",
    "labels": {
      "app": "auth-service",
      "version": "v2.3.1"
    }
  },

  // Request context
  "request": {
    "method": "POST",
    "endpoint": "/api/auth/login",
    "ip": "192.168.1.100",
    "user_agent": "Mozilla/5.0..."
  },

  // Response context
  "response": {
    "status_code": 200,
    "response_time_ms": 125
  },

  // User context
  "user": {
    "user_id": "user-12345",
    "username": "john@example.com",
    "tenant_id": "company-abc"
  },

  // Business context
  "handler": "AuthController.login",
  "session_id": "sess-xyz789",

  // Additional custom fields
  "custom": {
    "login_method": "password",
    "two_factor_enabled": true
  }
}

Implementing Structured Logging

# Example: Python with structlog
import structlog

logger = structlog.get_logger()

# Add global context
logger = logger.bind(
    service="auth-service",
    version="v2.3.1",
    environment="production"
)

# Log with structured fields
logger.info(
    "user_login_successful",
    user_id="user-12345",
    username="john@example.com",
    endpoint="/api/auth/login",
    response_time_ms=125,
    trace_id=request.trace_id,
    span_id=request.span_id
)

# Example: Go with logrus
import (
    "github.com/sirupsen/logrus"
)

log := logrus.WithFields(logrus.Fields{
    "service":     "auth-service",
    "trace_id":    traceID,
    "span_id":     spanID,
    "user_id":     userID,
    "endpoint":    "/api/auth/login",
    "response_ms": 125,
})
log.Info("User login successful")

# Example: Node.js with winston
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  defaultMeta: {
    service: 'auth-service',
    version: 'v2.3.1'
  }
});

logger.info('User login successful', {
  trace_id: traceId,
  span_id: spanId,
  user_id: userId,
  endpoint: '/api/auth/login',
  response_time_ms: 125
});

Lesson 3: Metrics & Custom HPA

Metrics provide quantitative measurements of your system's health and performance. Advanced use cases include using custom metrics, including those derived from logs, to drive autoscaling decisions.

Types of Metrics in Kubernetes

Resource Metrics

Source: Metrics Server

Metrics:

CPU usage
Memory usage

Used By:

kubectl top
Basic HPA
Scheduler

Custom Metrics

Source: Custom Metrics API

Metrics:

Request rate
Queue length
Response time

Used By:

Advanced HPA
Custom dashboards

External Metrics

Source: External Metrics API

Metrics:

Cloud provider metrics
SaaS metrics
Business metrics

Used By:

HPA with external data

Standard HPA (Horizontal Pod Autoscaler)

# Basic HPA based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

# Scales when average CPU across all pods > 70%

Advanced: Custom Metrics HPA

Use Case: Scale based on application-specific metrics like request rate, queue length, or response time instead of just CPU/memory.

Step 1: Expose Custom Metrics

Your application must expose metrics in Prometheus format (typically via /metrics endpoint)

Step 2: Prometheus Scrapes Metrics

Configure Prometheus to scrape your application's /metrics endpoint and store the data

Step 3: Install Prometheus Adapter

The Prometheus Adapter exposes Prometheus metrics via the Kubernetes Custom Metrics API

Step 4: Configure HPA

Create HPA resource that references the custom metric exposed by the adapter

Example: Scaling Based on Request Rate

# Step 1: Application exposes Prometheus metrics
# Endpoint: http://my-app:8080/metrics

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users"} 150000
http_requests_total{method="POST",endpoint="/api/users"} 45000

# HELP http_requests_per_second Current request rate
# TYPE http_requests_per_second gauge
http_requests_per_second 250

# Step 2: Prometheus scrapes the metrics
# ServiceMonitor or PodMonitor configured to scrape /metrics

# Step 3: Install Prometheus Adapter
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus.monitoring.svc \
  --set prometheus.port=9090

# Step 4: Configure adapter to expose custom metric
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "${1}_per_second"
      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

# Step 5: Create HPA using custom metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

# Scales when average requests/sec per pod > 100

Advanced: Log-Derived Metrics for HPA

Complex but Powerful: You can use metrics derived from logs (such as response time extracted from log entries) as a scaling factor for Kubernetes HPA.

How Log-Derived Metrics Work

Structured Logs: Application logs response time in JSON format
Log Aggregation: Promtail/Fluent Bit sends logs to Loki
LogQL to Metrics: Loki can convert logs to metrics using LogQL queries
Prometheus Scrapes: Prometheus scrapes metrics from Loki
Adapter Exposes: Prometheus Adapter exposes to Kubernetes
HPA Uses: HPA scales based on these log-derived metrics

# Example: Scale based on P95 response time from logs

# 1. Application logs include response_time_ms
{
  "level": "INFO",
  "endpoint": "/api/orders",
  "response_time_ms": 350,
  "status_code": 200
}

# 2. Loki query to extract response time as metric
# In Loki, create a recording rule:
sum(rate({namespace="production"}
  | json
  | response_time_ms > 0 [5m])) by (pod)

# 3. Prometheus scrapes this metric from Loki
# Configure Prometheus to scrape Loki's metrics endpoint

# 4. Create HPA based on P95 response time
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: response-time-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_response_time_p95_milliseconds
      target:
        type: AverageValue
        averageValue: "500"  # Scale when P95 > 500ms

# When response times increase due to load:
# 1. Logs capture slow requests
# 2. Metric derived from logs shows increased P95
# 3. HPA sees metric > 500ms
# 4. HPA adds more replicas
# 5. Load distributes, response times improve

Benefits of Log-Derived Metrics:

Scale based on actual user experience (response time)
Leverage existing log data without separate instrumentation
React to application-level issues, not just resource usage
Can extract any metric from structured logs

Requirements:

Structured logging (JSON) with relevant fields
Loki or similar log aggregation system
Prometheus integration
Prometheus Adapter for Kubernetes metrics API
Proper configuration of recording rules

Lesson 4: Kubernetes Jobs Best Practices

Kubernetes Jobs and CronJobs have specific configuration requirements that are often misconfigured, leading to unexpected behavior and difficult-to-debug issues.

Understanding restartPolicy

restartPolicy: Determines what happens when a container in a pod exits. This setting has different implications for regular pods vs. Job/CronJob pods.

Policy	Behavior	Use Case
Always	Always restart container after exit	Long-running services (Deployments, StatefulSets)
OnFailure	Restart only if exit code ≠ 0	Jobs (NOT recommended - see below)
Never	Never restart container	Jobs, CronJobs (RECOMMENDED)

The Problem with OnFailure

Common Misconfiguration: Many developers use restartPolicy: OnFailure for Jobs, thinking it makes sense to retry failed tasks. However, this creates several problems.

Why OnFailure is Problematic for Jobs

1. Kubelet Handles Restarts (Not Job Controller)

When you use restartPolicy: OnFailure, the kubelet on the node manages the container restarts, completely bypassing the Job controller.

2. Job Controller Settings Are Ignored

Critical Job-level configurations don't work properly:

activeDeadlineSeconds: Maximum execution time is not enforced
backoffLimit: Retry count limit is bypassed
Job completion tracking becomes unreliable
Cleanup behavior is unpredictable

3. Node-Local Behavior

The kubelet restarts the container on the same node, which means:

If node fails, job is stuck
Node-specific issues (disk full, network) cause repeated failures
Cannot leverage cluster-wide scheduling

Best Practice: Use restartPolicy: Never

Recommended Configuration: Always use restartPolicy: Never for Job and CronJob pod templates. This ensures the Job controller manages retries and all job-level settings work correctly.

# ❌ WRONG: Using OnFailure
apiVersion: batch/v1
kind: Job
metadata:
  name: data-import-job
spec:
  backoffLimit: 3  # This will be IGNORED!
  activeDeadlineSeconds: 600  # This will be IGNORED!
  template:
    spec:
      restartPolicy: OnFailure  # ← DON'T USE THIS
      containers:
      - name: importer
        image: data-importer:v1
        command: ["python", "import.py"]

# What happens:
# 1. Container fails (exit code 1)
# 2. Kubelet restarts container on same node
# 3. Job controller has no control
# 4. backoffLimit is ignored (could retry 100 times!)
# 5. activeDeadlineSeconds is ignored (could run forever!)
# 6. If node fails, job is stuck

# ✅ CORRECT: Using Never
apiVersion: batch/v1
kind: Job
metadata:
  name: data-import-job
spec:
  backoffLimit: 3  # Maximum 3 retries
  activeDeadlineSeconds: 600  # Job must complete within 10 minutes
  template:
    spec:
      restartPolicy: Never  # ← CORRECT
      containers:
      - name: importer
        image: data-importer:v1
        command: ["python", "import.py"]

# What happens:
# 1. Container fails (exit code 1)
# 2. Pod is marked as Failed
# 3. Job controller sees the failure
# 4. Job controller creates a NEW pod (retry #1)
# 5. If it fails again, creates another pod (retry #2)
# 6. After 3 failures, Job is marked as Failed
# 7. activeDeadlineSeconds is enforced (job killed after 600s)
# 8. New pods can be scheduled on different nodes (better reliability)

Complete Job Configuration Example

# Production-ready Job configuration
apiVersion: batch/v1
kind: Job
metadata:
  name: database-backup
  namespace: production
spec:
  # Maximum time for job to complete (2 hours)
  activeDeadlineSeconds: 7200

  # Maximum retries before marking as failed
  backoffLimit: 3

  # Number of pod completions required
  completions: 1

  # Number of pods to run in parallel
  parallelism: 1

  # Cleanup policy (automatically delete after completion)
  ttlSecondsAfterFinished: 86400  # Delete after 24 hours

  template:
    metadata:
      labels:
        app: database-backup
        component: backup
    spec:
      restartPolicy: Never  # ← CRITICAL

      # Service account for authentication
      serviceAccountName: backup-sa

      containers:
      - name: backup
        image: postgres:14
        command:
        - /bin/bash
        - -c
        - |
          pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
          gzip > /backup/db-backup-$(date +%Y%m%d-%H%M%S).sql.gz

        env:
        - name: DB_HOST
          value: postgres.production.svc
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: postgres-credentials
              key: username
        - name: DB_NAME
          value: production_db

        volumeMounts:
        - name: backup-storage
          mountPath: /backup

        # Resource limits
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

      volumes:
      - name: backup-storage
        persistentVolumeClaim:
          claimName: backup-pvc

CronJob Best Practices

# Production-ready CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-cleanup
  namespace: production
spec:
  # Run every day at 2:00 AM
  schedule: "0 2 * * *"

  # Maximum time to start job if it misses scheduled time
  startingDeadlineSeconds: 600

  # How to handle overlapping jobs
  concurrencyPolicy: Forbid  # Don't start if previous still running

  # Keep history of successful and failed jobs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

  jobTemplate:
    spec:
      # Job-level settings
      activeDeadlineSeconds: 3600  # 1 hour max
      backoffLimit: 2
      ttlSecondsAfterFinished: 86400

      template:
        spec:
          restartPolicy: Never  # ← CRITICAL

          containers:
          - name: cleanup
            image: alpine:3.18
            command:
            - /bin/sh
            - -c
            - |
              echo "Starting cleanup..."
              find /data -type f -mtime +30 -delete
              echo "Cleanup completed"

            volumeMounts:
            - name: data
              mountPath: /data

          volumes:
          - name: data
            persistentVolumeClaim:
              claimName: app-data

Observability for Jobs

Monitoring Jobs

Apply the same observability principles to Jobs:

# Job with structured logging and metrics
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing
spec:
  backoffLimit: 3
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: processor
        image: my-processor:v1
        env:
        # Enable structured logging
        - name: LOG_FORMAT
          value: "json"
        # Add trace context
        - name: TRACE_ENABLED
          value: "true"
        # Expose metrics
        - name: METRICS_PORT
          value: "9090"

        command:
        - python
        - process.py

        # Expose metrics endpoint
        ports:
        - name: metrics
          containerPort: 9090

# Application code includes:
# 1. Structured logging with job context
logger.info(
    "job_started",
    job_name="data-processing",
    job_id=os.environ.get("JOB_ID"),
    trace_id=generate_trace_id()
)

# 2. Prometheus metrics
job_duration_seconds.observe(duration)
job_records_processed.inc(count)
job_errors_total.inc()

# 3. Final status log
logger.info(
    "job_completed",
    duration_seconds=duration,
    records_processed=count,
    status="success"
)

Key Takeaways:

Always use restartPolicy: Never for Jobs and CronJobs
Let the Job controller manage retries, not the kubelet
Set appropriate backoffLimit and activeDeadlineSeconds
Use ttlSecondsAfterFinished for automatic cleanup
Apply observability principles (logs, metrics, traces) to Jobs
Monitor job success/failure rates with Prometheus

Final Quiz

Test your knowledge of Kubernetes observability!

Question 1: What are the three pillars of observability?

a) CPU, Memory, Disk

b) Logs, Metrics, Traces

c) Pods, Services, Deployments

d) Prometheus, Grafana, Loki

Question 2: Why must logs include span_id and trace_id?

a) Kubernetes requires these fields

b) They connect individual log events to distributed traces, allowing you to follow requests across services

c) They make logs larger and easier to read

d) They are only needed for debugging

Question 3: What is the primary principle for logging?

a) Log only errors to save space

b) Log everything that you will later need to query to understand system life and health

c) Log every single line of code execution

d) Only log in production environments

Question 4: What enables using log-derived metrics for HPA?

a) kubectl built-in features

b) Integration with Prometheus and Prometheus Adapter to expose custom metrics to Kubernetes HPA controller

c) Docker automatically enables this

d) Loki handles this without additional configuration

Question 5: Why should Jobs use restartPolicy: Never?

a) It makes jobs run faster

b) Job controller manages restarts properly, applying settings like backoffLimit and activeDeadlineSeconds correctly

c) It prevents jobs from ever running

d) Never is the only valid option for Jobs

Question 6: What happens when using restartPolicy: OnFailure for Jobs?

a) Job controller handles restarts perfectly

b) Kubelet handles restarts, bypassing Job controller and ignoring backoffLimit and activeDeadlineSeconds

c) Jobs never restart on failure

d) Kubernetes will reject the configuration

Question 7: Which essential fields should be in structured JSON logs?

a) Only timestamp and message

b) Severity/level, response_time, handler/endpoint, user info, and span_id for distributed tracing

c) Only error messages

d) Just the application name

Question 8: What's the difference between monitoring and observability?

a) They are exactly the same thing

b) Monitoring watches known metrics; observability enables understanding system behavior and debugging unknown issues

c) Observability is just another word for logging

d) Monitoring is newer than observability

Quiz Complete!
All correct answers are option 'b'. These observability principles are essential for operating production Kubernetes clusters. Remember: proper observability with logs, metrics, and traces (all connected via correlation IDs) is critical for understanding and debugging complex distributed systems!