Kubernetes Observability

Principles & Techniques for System Observation

Lesson 1: Introduction to Observability

Observability goes beyond simple monitoring. It's the ability to understand the internal state of your system by examining its outputs. In modern distributed systems like Kubernetes, observability is essential for debugging, performance optimization, and maintaining reliability.

Monitoring vs. Observability

Aspect Monitoring Observability
Definition Watching known metrics and alerts Understanding system behavior from outputs
Questions Answered "Is the system up?" "Are metrics normal?" "Why did this happen?" "What caused this?"
Approach Predefined dashboards and alerts Exploratory investigation and correlation
Use Case Detecting known issues Debugging unknown issues
Scope What you know to check What you didn't anticipate needing
Observability Definition: A system is observable when you can understand its internal state and behavior by examining the data it produces (logs, metrics, traces) without needing to deploy new code or instrumentation.

The Three Pillars of Observability

Logs

What: Discrete events with timestamps

Purpose:

  • Detailed event history
  • Error messages and stack traces
  • Debugging specific issues
  • Audit trails

Example:

2025-01-15 10:30:45 ERROR: Database connection timeout

Metrics

What: Numerical measurements over time

Purpose:

  • System health trends
  • Performance tracking
  • Alerting on thresholds
  • Capacity planning

Example:

cpu_usage{pod="app-1"} = 75%

Traces

What: Request journey across services

Purpose:

  • End-to-end request flow
  • Performance bottlenecks
  • Service dependencies
  • Distributed debugging

Example:

API → Auth → DB (120ms)

Why Observability Matters in Kubernetes

Complex Distributed Systems: Kubernetes clusters are highly dynamic with:
  • Pods constantly starting, stopping, and moving between nodes
  • Multiple replicas of the same service running simultaneously
  • Services communicating across network boundaries
  • Failures that cascade across multiple components
Without proper observability, debugging is nearly impossible.

Real-World Scenario: Debugging a Slow Request

Problem: Users report that API requests are occasionally taking 5+ seconds

Using Only Monitoring:

  • Dashboard shows average response time = 200ms (looks normal!)
  • CPU and memory metrics look fine
  • No alerts firing
  • Result: Cannot identify the issue ❌

Using Observability (All Three Pillars):

  1. Metrics: Create a p99 latency graph → See spikes every 2 minutes
  2. Logs: Filter logs during spike times → Find "database connection pool exhausted"
  3. Traces: Follow slow request traces → Identify specific DB query taking 5 seconds
  4. Result: Found the root cause: unoptimized query + small connection pool ✓

The Golden Rule of Observability

Log Everything You'll Need to Query Later:

The primary principle is to capture all data that you might need to understand the life and health of your system. You cannot retroactively add observability data after an incident occurs.

Ask yourself: "If this system fails at 3 AM, what information would I need to debug it?"

Connecting the Pillars

Critical Concept: The three pillars are most powerful when connected together. This allows you to:
  • Start with Metrics: Notice a spike in error rate
  • Drill into Logs: Find specific error messages during that timeframe
  • Follow Traces: See the complete request path that caused the error
  • Correlate Back: Understand which metric spike corresponds to which trace/log
# Key to connecting the pillars: Correlation IDs # Every request gets a unique ID that appears in: # - Metrics (as a label) # - Logs (as a field) # - Traces (as trace_id) # Example flow: # 1. Trace starts with trace_id = "abc123" # 2. Each span in trace has span_id = "span-001", "span-002", etc. # 3. Logs include both: { "trace_id": "abc123", "span_id": "span-001", "message": "Database query started" } # Now you can: # - See trace "abc123" took 5 seconds (distributed tracing) # - Find all logs for trace_id "abc123" (logging system) # - See metrics tagged with trace_id "abc123" (metrics system) # - Follow the complete story across all three pillars!

Observability Tools Ecosystem

Pillar Popular Tools Kubernetes Native
Logs Loki, Elasticsearch, Fluentd, Fluent Bit kubectl logs (basic)
Metrics Prometheus, Grafana, Datadog, New Relic Metrics Server, kubectl top
Traces Jaeger, Zipkin, Tempo, OpenTelemetry None (requires external tools)
Unified Grafana Stack (Loki+Prometheus+Tempo+Grafana) -

Lesson 2: Advanced Logging Techniques

Effective logging is more than just printing messages. Structured logging with the right fields enables powerful analysis, correlation with traces, and actionable insights.

Structured Logging with JSON

Why JSON? JSON-formatted logs are:
  • Machine-readable and easily parsed
  • Searchable by individual fields
  • Aggregatable and analyzable at scale
  • Compatible with modern logging systems (Loki, Elasticsearch)
# Bad: Unstructured text log "2025-01-15 10:30:45 User john requested /api/users from 192.168.1.100 - took 250ms - status 200" # Parsing this requires complex regex patterns # Cannot easily filter by user, endpoint, or response time # Difficult to aggregate metrics from logs # Good: Structured JSON log { "timestamp": "2025-01-15T10:30:45Z", "level": "INFO", "message": "HTTP request processed", "user": "john", "endpoint": "/api/users", "method": "GET", "ip": "192.168.1.100", "response_time_ms": 250, "status_code": 200, "handler": "UserController.getUsers" } # Now you can easily: # - Find all requests by user: user="john" # - Find slow requests: response_time_ms > 1000 # - Calculate average response time per endpoint # - Build graphs of status codes over time

Essential Log Fields

1. Severity/Level

Purpose: Filter logs by importance and build alerting rules

Standard Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

# Use case: Alert on error spikes { "level": "ERROR", "message": "Database connection failed", "error": "Connection timeout after 30s" } # Prometheus query to count errors: sum(rate(log_entries{level="ERROR"}[5m])) # Alert when error rate > 10/sec: alert: HighErrorRate expr: sum(rate(log_entries{level="ERROR"}[5m])) > 10

2. Response Time

Purpose: Track performance and identify slow requests

Field: response_time_ms or duration_ms

# For HTTP services, always log request duration { "level": "INFO", "endpoint": "/api/orders", "method": "POST", "response_time_ms": 1250, "status_code": 201 } # Analysis: # - Calculate p50, p95, p99 percentiles # - Identify slow endpoints # - Correlate slow requests with errors # - Build SLO dashboards (e.g., 95% of requests < 500ms)

3. Handler/Endpoint

Purpose: Identify which component or function processed the request

Importance: Pinpoints the exact code path for debugging

{ "level": "ERROR", "handler": "OrderController.createOrder", "endpoint": "/api/orders", "method": "POST", "message": "Validation failed", "validation_errors": ["Missing required field: customer_id"] } # With handler information, you can: # - Quickly identify which code is failing # - Track error rates per controller/function # - Prioritize fixes based on most-failing handlers

4. User Information

Purpose: Track user activity and debug user-specific issues

Fields: user_id, username, tenant_id (for multi-tenant apps)

{ "level": "WARNING", "user_id": "user-12345", "username": "john@example.com", "tenant_id": "company-abc", "message": "Rate limit exceeded", "requests_per_minute": 150, "limit": 100 } # Use cases: # - Find all actions by a specific user # - Identify abusive users (rate limiting, suspicious activity) # - Debug user-reported issues: "Show me all logs for user X" # - Track tenant-specific behavior in multi-tenant systems

Critical: Span ID for Distributed Tracing

Most Important Field: Always include a span_id (and trace_id) in your logs. This connects individual log events to a complete distributed trace, allowing you to follow a single request across multiple services.
# Service 1 (API Gateway) receives request { "timestamp": "2025-01-15T10:30:45.000Z", "level": "INFO", "service": "api-gateway", "trace_id": "abc123def456", "span_id": "span-001", "message": "Received request", "endpoint": "/api/orders", "method": "POST" } # Service 2 (Order Service) processes request { "timestamp": "2025-01-15T10:30:45.050Z", "level": "INFO", "service": "order-service", "trace_id": "abc123def456", "span_id": "span-002", "parent_span_id": "span-001", "message": "Creating order", "order_id": "order-789" } # Service 3 (Payment Service) is called { "timestamp": "2025-01-15T10:30:45.100Z", "level": "INFO", "service": "payment-service", "trace_id": "abc123def456", "span_id": "span-003", "parent_span_id": "span-002", "message": "Processing payment", "amount": 99.99 } # Service 3 encounters error { "timestamp": "2025-01-15T10:30:45.150Z", "level": "ERROR", "service": "payment-service", "trace_id": "abc123def456", "span_id": "span-003", "message": "Payment gateway timeout", "error": "Connection timeout after 5s" } # Now you can: # 1. Query logs for trace_id "abc123def456" # 2. See the complete request journey across all services # 3. Identify exactly where it failed (payment-service) # 4. View the distributed trace visualization in Jaeger/Tempo # 5. Correlate metrics for all services involved in this trace

Complete Log Structure Template

# Comprehensive JSON log structure { // Timestamp (ISO 8601 format with timezone) "timestamp": "2025-01-15T10:30:45.123Z", // Severity level "level": "INFO", // Human-readable message "message": "User login successful", // Service/Application identification "service": "auth-service", "version": "v2.3.1", "environment": "production", // Distributed tracing "trace_id": "abc123def456", "span_id": "span-001", "parent_span_id": null, // Kubernetes metadata (automatically added by logging agent) "kubernetes": { "namespace": "production", "pod_name": "auth-service-7f9c4b8d-xk2pq", "container_name": "auth", "node_name": "worker-node-2", "labels": { "app": "auth-service", "version": "v2.3.1" } }, // Request context "request": { "method": "POST", "endpoint": "/api/auth/login", "ip": "192.168.1.100", "user_agent": "Mozilla/5.0..." }, // Response context "response": { "status_code": 200, "response_time_ms": 125 }, // User context "user": { "user_id": "user-12345", "username": "john@example.com", "tenant_id": "company-abc" }, // Business context "handler": "AuthController.login", "session_id": "sess-xyz789", // Additional custom fields "custom": { "login_method": "password", "two_factor_enabled": true } }

Implementing Structured Logging

# Example: Python with structlog import structlog logger = structlog.get_logger() # Add global context logger = logger.bind( service="auth-service", version="v2.3.1", environment="production" ) # Log with structured fields logger.info( "user_login_successful", user_id="user-12345", username="john@example.com", endpoint="/api/auth/login", response_time_ms=125, trace_id=request.trace_id, span_id=request.span_id ) # Example: Go with logrus import ( "github.com/sirupsen/logrus" ) log := logrus.WithFields(logrus.Fields{ "service": "auth-service", "trace_id": traceID, "span_id": spanID, "user_id": userID, "endpoint": "/api/auth/login", "response_ms": 125, }) log.Info("User login successful") # Example: Node.js with winston const winston = require('winston'); const logger = winston.createLogger({ format: winston.format.json(), defaultMeta: { service: 'auth-service', version: 'v2.3.1' } }); logger.info('User login successful', { trace_id: traceId, span_id: spanId, user_id: userId, endpoint: '/api/auth/login', response_time_ms: 125 });

Lesson 3: Metrics & Custom HPA

Metrics provide quantitative measurements of your system's health and performance. Advanced use cases include using custom metrics, including those derived from logs, to drive autoscaling decisions.

Types of Metrics in Kubernetes

Resource Metrics

Source: Metrics Server

Metrics:

  • CPU usage
  • Memory usage

Used By:

  • kubectl top
  • Basic HPA
  • Scheduler

Custom Metrics

Source: Custom Metrics API

Metrics:

  • Request rate
  • Queue length
  • Response time

Used By:

  • Advanced HPA
  • Custom dashboards

External Metrics

Source: External Metrics API

Metrics:

  • Cloud provider metrics
  • SaaS metrics
  • Business metrics

Used By:

  • HPA with external data

Standard HPA (Horizontal Pod Autoscaler)

# Basic HPA based on CPU apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scales when average CPU across all pods > 70%

Advanced: Custom Metrics HPA

Use Case: Scale based on application-specific metrics like request rate, queue length, or response time instead of just CPU/memory.

Step 1: Expose Custom Metrics

Your application must expose metrics in Prometheus format (typically via /metrics endpoint)

Step 2: Prometheus Scrapes Metrics

Configure Prometheus to scrape your application's /metrics endpoint and store the data

Step 3: Install Prometheus Adapter

The Prometheus Adapter exposes Prometheus metrics via the Kubernetes Custom Metrics API

Step 4: Configure HPA

Create HPA resource that references the custom metric exposed by the adapter

Example: Scaling Based on Request Rate

# Step 1: Application exposes Prometheus metrics # Endpoint: http://my-app:8080/metrics # HELP http_requests_total Total HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",endpoint="/api/users"} 150000 http_requests_total{method="POST",endpoint="/api/users"} 45000 # HELP http_requests_per_second Current request rate # TYPE http_requests_per_second gauge http_requests_per_second 250 # Step 2: Prometheus scrapes the metrics # ServiceMonitor or PodMonitor configured to scrape /metrics # Step 3: Install Prometheus Adapter helm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring \ --set prometheus.url=http://prometheus.monitoring.svc \ --set prometheus.port=9090 # Step 4: Configure adapter to expose custom metric apiVersion: v1 kind: ConfigMap metadata: name: prometheus-adapter-config data: config.yaml: | rules: - seriesQuery: 'http_requests_total{namespace!="",pod!=""}' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total$" as: "${1}_per_second" metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])' # Step 5: Create HPA using custom metric apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 20 metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "100" # Scales when average requests/sec per pod > 100

Advanced: Log-Derived Metrics for HPA

Complex but Powerful: You can use metrics derived from logs (such as response time extracted from log entries) as a scaling factor for Kubernetes HPA.

How Log-Derived Metrics Work

  1. Structured Logs: Application logs response time in JSON format
  2. Log Aggregation: Promtail/Fluent Bit sends logs to Loki
  3. LogQL to Metrics: Loki can convert logs to metrics using LogQL queries
  4. Prometheus Scrapes: Prometheus scrapes metrics from Loki
  5. Adapter Exposes: Prometheus Adapter exposes to Kubernetes
  6. HPA Uses: HPA scales based on these log-derived metrics
# Example: Scale based on P95 response time from logs # 1. Application logs include response_time_ms { "level": "INFO", "endpoint": "/api/orders", "response_time_ms": 350, "status_code": 200 } # 2. Loki query to extract response time as metric # In Loki, create a recording rule: sum(rate({namespace="production"} | json | response_time_ms > 0 [5m])) by (pod) # 3. Prometheus scrapes this metric from Loki # Configure Prometheus to scrape Loki's metrics endpoint # 4. Create HPA based on P95 response time apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: response-time-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 3 maxReplicas: 30 metrics: - type: Pods pods: metric: name: http_response_time_p95_milliseconds target: type: AverageValue averageValue: "500" # Scale when P95 > 500ms # When response times increase due to load: # 1. Logs capture slow requests # 2. Metric derived from logs shows increased P95 # 3. HPA sees metric > 500ms # 4. HPA adds more replicas # 5. Load distributes, response times improve
Benefits of Log-Derived Metrics:
  • Scale based on actual user experience (response time)
  • Leverage existing log data without separate instrumentation
  • React to application-level issues, not just resource usage
  • Can extract any metric from structured logs
Requirements:
  • Structured logging (JSON) with relevant fields
  • Loki or similar log aggregation system
  • Prometheus integration
  • Prometheus Adapter for Kubernetes metrics API
  • Proper configuration of recording rules

Lesson 4: Kubernetes Jobs Best Practices

Kubernetes Jobs and CronJobs have specific configuration requirements that are often misconfigured, leading to unexpected behavior and difficult-to-debug issues.

Understanding restartPolicy

restartPolicy: Determines what happens when a container in a pod exits. This setting has different implications for regular pods vs. Job/CronJob pods.
Policy Behavior Use Case
Always Always restart container after exit Long-running services (Deployments, StatefulSets)
OnFailure Restart only if exit code ≠ 0 Jobs (NOT recommended - see below)
Never Never restart container Jobs, CronJobs (RECOMMENDED)

The Problem with OnFailure

Common Misconfiguration: Many developers use restartPolicy: OnFailure for Jobs, thinking it makes sense to retry failed tasks. However, this creates several problems.

Why OnFailure is Problematic for Jobs

1. Kubelet Handles Restarts (Not Job Controller)

When you use restartPolicy: OnFailure, the kubelet on the node manages the container restarts, completely bypassing the Job controller.

2. Job Controller Settings Are Ignored

Critical Job-level configurations don't work properly:

  • activeDeadlineSeconds: Maximum execution time is not enforced
  • backoffLimit: Retry count limit is bypassed
  • Job completion tracking becomes unreliable
  • Cleanup behavior is unpredictable

3. Node-Local Behavior

The kubelet restarts the container on the same node, which means:

  • If node fails, job is stuck
  • Node-specific issues (disk full, network) cause repeated failures
  • Cannot leverage cluster-wide scheduling

Best Practice: Use restartPolicy: Never

Recommended Configuration: Always use restartPolicy: Never for Job and CronJob pod templates. This ensures the Job controller manages retries and all job-level settings work correctly.
# ❌ WRONG: Using OnFailure apiVersion: batch/v1 kind: Job metadata: name: data-import-job spec: backoffLimit: 3 # This will be IGNORED! activeDeadlineSeconds: 600 # This will be IGNORED! template: spec: restartPolicy: OnFailure # ← DON'T USE THIS containers: - name: importer image: data-importer:v1 command: ["python", "import.py"] # What happens: # 1. Container fails (exit code 1) # 2. Kubelet restarts container on same node # 3. Job controller has no control # 4. backoffLimit is ignored (could retry 100 times!) # 5. activeDeadlineSeconds is ignored (could run forever!) # 6. If node fails, job is stuck
# ✅ CORRECT: Using Never apiVersion: batch/v1 kind: Job metadata: name: data-import-job spec: backoffLimit: 3 # Maximum 3 retries activeDeadlineSeconds: 600 # Job must complete within 10 minutes template: spec: restartPolicy: Never # ← CORRECT containers: - name: importer image: data-importer:v1 command: ["python", "import.py"] # What happens: # 1. Container fails (exit code 1) # 2. Pod is marked as Failed # 3. Job controller sees the failure # 4. Job controller creates a NEW pod (retry #1) # 5. If it fails again, creates another pod (retry #2) # 6. After 3 failures, Job is marked as Failed # 7. activeDeadlineSeconds is enforced (job killed after 600s) # 8. New pods can be scheduled on different nodes (better reliability)

Complete Job Configuration Example

# Production-ready Job configuration apiVersion: batch/v1 kind: Job metadata: name: database-backup namespace: production spec: # Maximum time for job to complete (2 hours) activeDeadlineSeconds: 7200 # Maximum retries before marking as failed backoffLimit: 3 # Number of pod completions required completions: 1 # Number of pods to run in parallel parallelism: 1 # Cleanup policy (automatically delete after completion) ttlSecondsAfterFinished: 86400 # Delete after 24 hours template: metadata: labels: app: database-backup component: backup spec: restartPolicy: Never # ← CRITICAL # Service account for authentication serviceAccountName: backup-sa containers: - name: backup image: postgres:14 command: - /bin/bash - -c - | pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \ gzip > /backup/db-backup-$(date +%Y%m%d-%H%M%S).sql.gz env: - name: DB_HOST value: postgres.production.svc - name: DB_USER valueFrom: secretKeyRef: name: postgres-credentials key: username - name: DB_NAME value: production_db volumeMounts: - name: backup-storage mountPath: /backup # Resource limits resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" volumes: - name: backup-storage persistentVolumeClaim: claimName: backup-pvc

CronJob Best Practices

# Production-ready CronJob apiVersion: batch/v1 kind: CronJob metadata: name: nightly-cleanup namespace: production spec: # Run every day at 2:00 AM schedule: "0 2 * * *" # Maximum time to start job if it misses scheduled time startingDeadlineSeconds: 600 # How to handle overlapping jobs concurrencyPolicy: Forbid # Don't start if previous still running # Keep history of successful and failed jobs successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 5 jobTemplate: spec: # Job-level settings activeDeadlineSeconds: 3600 # 1 hour max backoffLimit: 2 ttlSecondsAfterFinished: 86400 template: spec: restartPolicy: Never # ← CRITICAL containers: - name: cleanup image: alpine:3.18 command: - /bin/sh - -c - | echo "Starting cleanup..." find /data -type f -mtime +30 -delete echo "Cleanup completed" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: app-data

Observability for Jobs

Monitoring Jobs

Apply the same observability principles to Jobs:

# Job with structured logging and metrics apiVersion: batch/v1 kind: Job metadata: name: data-processing spec: backoffLimit: 3 template: spec: restartPolicy: Never containers: - name: processor image: my-processor:v1 env: # Enable structured logging - name: LOG_FORMAT value: "json" # Add trace context - name: TRACE_ENABLED value: "true" # Expose metrics - name: METRICS_PORT value: "9090" command: - python - process.py # Expose metrics endpoint ports: - name: metrics containerPort: 9090 # Application code includes: # 1. Structured logging with job context logger.info( "job_started", job_name="data-processing", job_id=os.environ.get("JOB_ID"), trace_id=generate_trace_id() ) # 2. Prometheus metrics job_duration_seconds.observe(duration) job_records_processed.inc(count) job_errors_total.inc() # 3. Final status log logger.info( "job_completed", duration_seconds=duration, records_processed=count, status="success" )
Key Takeaways:
  • Always use restartPolicy: Never for Jobs and CronJobs
  • Let the Job controller manage retries, not the kubelet
  • Set appropriate backoffLimit and activeDeadlineSeconds
  • Use ttlSecondsAfterFinished for automatic cleanup
  • Apply observability principles (logs, metrics, traces) to Jobs
  • Monitor job success/failure rates with Prometheus

Final Quiz

Test your knowledge of Kubernetes observability!

Question 1: What are the three pillars of observability?

a) CPU, Memory, Disk
b) Logs, Metrics, Traces
c) Pods, Services, Deployments
d) Prometheus, Grafana, Loki

Question 2: Why must logs include span_id and trace_id?

a) Kubernetes requires these fields
b) They connect individual log events to distributed traces, allowing you to follow requests across services
c) They make logs larger and easier to read
d) They are only needed for debugging

Question 3: What is the primary principle for logging?

a) Log only errors to save space
b) Log everything that you will later need to query to understand system life and health
c) Log every single line of code execution
d) Only log in production environments

Question 4: What enables using log-derived metrics for HPA?

a) kubectl built-in features
b) Integration with Prometheus and Prometheus Adapter to expose custom metrics to Kubernetes HPA controller
c) Docker automatically enables this
d) Loki handles this without additional configuration

Question 5: Why should Jobs use restartPolicy: Never?

a) It makes jobs run faster
b) Job controller manages restarts properly, applying settings like backoffLimit and activeDeadlineSeconds correctly
c) It prevents jobs from ever running
d) Never is the only valid option for Jobs

Question 6: What happens when using restartPolicy: OnFailure for Jobs?

a) Job controller handles restarts perfectly
b) Kubelet handles restarts, bypassing Job controller and ignoring backoffLimit and activeDeadlineSeconds
c) Jobs never restart on failure
d) Kubernetes will reject the configuration

Question 7: Which essential fields should be in structured JSON logs?

a) Only timestamp and message
b) Severity/level, response_time, handler/endpoint, user info, and span_id for distributed tracing
c) Only error messages
d) Just the application name

Question 8: What's the difference between monitoring and observability?

a) They are exactly the same thing
b) Monitoring watches known metrics; observability enables understanding system behavior and debugging unknown issues
c) Observability is just another word for logging
d) Monitoring is newer than observability
Quiz Complete!
All correct answers are option 'b'. These observability principles are essential for operating production Kubernetes clusters. Remember: proper observability with logs, metrics, and traces (all connected via correlation IDs) is critical for understanding and debugging complex distributed systems!