Kubernetes Logging

Persistence, Aggregation & Modern Solutions (Loki, EFKB Stack)

Lesson 1: Core Principles of Kubernetes Logging

Logging in Kubernetes is fundamentally different from traditional server environments due to the ephemeral nature of containers. Understanding these core principles is essential for building a robust logging system.

The Container Logging Model

Standard Output Requirement: Applications in containerized environments must write logs to standard output (stdout) and standard error (stderr). This is a fundamental requirement for container-based logging.
# Traditional logging (WRONG for containers) # Application writes to /var/log/app.log echo "Log message" >> /var/log/app.log # Container logging (CORRECT) # Application writes to stdout/stderr echo "Log message" # Goes to stdout echo "Error message" >&2 # Goes to stderr # Docker daemon captures these streams # and handles them with configured log drivers

Docker Log Drivers

Log Driver Role: Docker's daemon aggregates the stdout and stderr streams from containers using configured log drivers. Common drivers include json-file (default) and journald.
Log Driver Storage Best For
json-file JSON files on disk Default, simple deployments
journald systemd journal systemd-based systems
syslog Syslog daemon Integration with existing syslog infrastructure
none Disabled When using external log collectors

The Five Critical Principles

1. Persistence

Problem: Containers are ephemeral. When a container restarts, its logs are lost unless persisted externally.

Solution: Logs must be collected and saved outside the container runtime to survive restarts and terminations.

kubectl logs Limitation: The built-in kubectl logs --previous command only shows logs from the immediately previous container instance. It cannot trace a chain of multiple restarts, making it insufficient for production debugging.
# kubectl logs limitations # View current container logs kubectl logs my-pod # View previous container logs (only one restart back) kubectl logs my-pod --previous # Problem: If pod has restarted 5 times, you can only see the last restart # Restarts 1, 2, 3, 4 are LOST forever without external log persistence # Solution: Use a centralized logging system that captures logs # from all containers before they restart

2. Aggregation and Centralization

Problem: In a Kubernetes cluster, logs are scattered across multiple nodes, pods, and containers.

Solution: Collect logs from all sources into a central location:

  • Application logs from all pods
  • System logs from worker nodes (OS-level)
  • Control plane component logs (API server, scheduler, controller manager)
  • kubelet and container runtime logs

3. Metadata Enrichment

Problem: Raw container logs don't include context about which pod, namespace, or replica generated them.

Solution: Add crucial metadata to each log entry:

  • Pod Name: my-app-deployment-abc123
  • Namespace: production
  • Container Name: nginx
  • Node Name: worker-node-1
  • Labels: app=my-app, version=v2.1, env=prod
Without Metadata: If you have 10 replicas of the same application, their logs are completely indistinguishable. You cannot debug which specific replica had an issue, making troubleshooting impossible.
# Log WITHOUT metadata (useless in production) 2025-01-15 10:30:45 ERROR: Database connection failed # Log WITH metadata (actionable) { "timestamp": "2025-01-15T10:30:45Z", "level": "ERROR", "message": "Database connection failed", "kubernetes": { "namespace": "production", "pod_name": "my-app-deployment-7f9c4b8d-xk2pq", "container_name": "app", "node_name": "worker-node-2", "labels": { "app": "my-app", "version": "v2.1", "environment": "production" } } } # Now you can: # - Identify the exact pod that failed # - Correlate with metrics from that specific pod # - Check if other pods on the same node are affected # - Filter by version to see if it's a deployment issue

4. Parsing and Structuring

Problem: Many applications output unstructured text logs that are difficult to search and analyze.

Solution: Parse raw text into structured formats (typically JSON) to enable powerful queries and filtering.

# Unstructured log (hard to query) "2025-01-15 10:30:45 [ERROR] user=john action=login ip=192.168.1.100 result=failed" # Structured log (easily queryable) { "timestamp": "2025-01-15T10:30:45Z", "level": "ERROR", "user": "john", "action": "login", "ip": "192.168.1.100", "result": "failed" } # Now you can easily query: # - All failed logins: level=ERROR AND action=login AND result=failed # - All actions by user: user=john # - All traffic from IP: ip=192.168.1.100

5. Filtering and Optimization

Problem: Collecting every single log message can overwhelm your logging system and generate massive costs.

Solution: Implement strict filtering to collect only relevant logs (WARNING, ERROR, CRITICAL levels) and drop verbose DEBUG/INFO messages in production.

Cost Warning: Without proper filtering, logging infrastructure costs can exceed the cost of running your actual application. Filter logs before they reach storage to prevent resource exhaustion.
# Example filtering strategy # Development environment: Collect everything log_level: DEBUG # Staging environment: Drop DEBUG log_level: INFO # Production environment: Only warnings and errors log_level: WARNING # Example filter configuration (Fluent Bit) [FILTER] Name grep Match * Regex level (WARNING|ERROR|CRITICAL) # This drops all DEBUG and INFO logs before sending to storage # Reduces log volume by 80-90% in typical applications

Lesson 2: Logging Architectures & Challenges

Understanding the different approaches to collecting logs in Kubernetes and the challenges each presents.

Three Logging Patterns

Node-Level Logging Agent

Approach: Run a logging agent on each node as a DaemonSet

Pros:

  • Most common and recommended pattern
  • One agent per node (low overhead)
  • Automatic for all pods on node
  • No app modifications needed

Cons:

  • Only works with stdout/stderr
  • Agent must be privileged

Sidecar Container Pattern

Approach: Add a logging container to each pod

Pros:

  • Can handle multiple log streams
  • App-specific log processing
  • Works with file-based logs

Cons:

  • High resource overhead (many agents)
  • More complex configuration
  • Logs written to disk first

Direct Application Logging

Approach: Application sends logs directly to logging backend

Pros:

  • No intermediate agent needed
  • Full control over log format
  • Can add custom application context

Cons:

  • Requires application code changes
  • Tight coupling to logging backend
  • No logs if backend is unavailable
  • kubectl logs doesn't work

Recommended: Node-Level DaemonSet

Best Practice: Use a node-level logging agent deployed as a DaemonSet. This provides the best balance of simplicity, efficiency, and automatic coverage for all pods.
# Example: Fluent Bit DaemonSet apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: logging spec: selector: matchLabels: app: fluent-bit template: metadata: labels: app: fluent-bit spec: serviceAccountName: fluent-bit containers: - name: fluent-bit image: fluent/fluent-bit:2.0 volumeMounts: # Read container logs - name: varlog mountPath: /var/log # Read pod metadata - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true # Configuration - name: fluent-bit-config mountPath: /fluent-bit/etc/ volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config

The Ephemeral Nature Challenge

Core Challenge: Kubernetes pods and containers are designed to be ephemeral - they start, stop, restart, and terminate frequently. This makes logging significantly more complex than traditional server environments.

What Happens During a Pod Lifecycle

Pod Starts

Container begins writing logs to stdout/stderr → Docker stores in /var/log/containers/

Application Runs

Logs accumulate on disk → Logging agent reads and forwards to central storage

Container Crashes (OOM, panic, etc.)

Container terminates → Log files remain temporarily on disk

Pod Restarts

New container starts → Old log files eventually deleted by kubelet → Without external logging, restart history is LOST

Kubelet Log Retention: By default, kubelet keeps logs from terminated containers for a limited time and with size limits:
  • containerLogMaxSize: Default 10Mi per container log file
  • containerLogMaxFiles: Default 5 rotated files
  • Maximum ~50Mi of logs per container before rotation/deletion
Once these limits are exceeded, older logs are permanently lost without external collection.

Log Volume Challenges

Scale Problem: In a production Kubernetes cluster, log volume can quickly become overwhelming:
# Example calculation for a medium cluster: # Cluster size: - 50 nodes - 500 pods (average 10 pods per node) - Each pod generates 100 lines/second # Log volume: 500 pods × 100 lines/sec = 50,000 lines/second = 3,000,000 lines/minute = 180,000,000 lines/hour = 4,320,000,000 lines/day (4.3 billion!) # At ~500 bytes per log line: 4.3 billion × 500 bytes = 2.15 TB/day of raw logs # With metadata enrichment (+30% size): 2.15 TB × 1.3 = 2.8 TB/day # Storage cost (at $0.10/GB/month): 2.8 TB/day × 30 days × $0.10 = $8,400/month JUST for storage # This is why filtering is critical!
Resource Consumption Warning: Without proper filtering and optimization, your logging infrastructure can easily consume more resources (CPU, memory, storage, network) and cost more money than your actual application workloads!

Multi-Source Log Collection

Complete Kubernetes Logging Sources

Application Logs:

  • Container stdout/stderr (primary source)
  • Application-specific log files (if using sidecar)

System Logs:

  • kubelet logs (systemd journal or /var/log/)
  • Container runtime logs (Docker/containerd)
  • Operating system logs (syslog, kernel)

Control Plane Logs:

  • kube-apiserver (API requests, authentication, authorization)
  • kube-scheduler (scheduling decisions)
  • kube-controller-manager (reconciliation loops)
  • etcd (cluster state database)

Audit Logs:

  • Kubernetes audit events (who did what, when)
  • Security and compliance tracking
Best Practice: Start by collecting application logs (stdout/stderr) from all pods. Add system and control plane logs once your basic logging infrastructure is stable. Audit logs can be added last for compliance requirements.

Lesson 3: Loki - Modern Logging for Kubernetes

Grafana Loki is a modern, lightweight logging system designed specifically for cloud-native environments. It offers a simpler, more cost-effective alternative to traditional logging stacks.

What is Loki?

Loki: An open-source log aggregation system from Grafana Labs, often described as "Prometheus for logs." It's designed to be efficient, cost-effective, and integrate seamlessly with existing Prometheus and Grafana deployments.
Key Philosophy: Unlike traditional logging systems that index the full text of every log line, Loki only indexes metadata (labels). This makes it significantly more efficient and less resource-intensive.

Why "Prometheus for Logs"?

Prometheus (Metrics)

  • Indexes metrics by labels
  • Stores time-series data in TSDB
  • Uses PromQL for queries
  • Pull-based scraping model
  • Lightweight and efficient
  • Short-term retention (weeks)

Loki (Logs)

  • Indexes logs by labels (not content!)
  • Stores log data in TSDB
  • Uses LogQL for queries
  • Push-based ingestion model
  • Lightweight and efficient
  • Short-medium retention (weeks)

Loki Architecture

Three-Component Design

1. Promtail (Log Collector Agent):

  • Runs as DaemonSet on each node
  • Discovers log files automatically
  • Adds Kubernetes metadata as labels
  • Pushes logs to Loki server

2. Loki (Storage & Query Engine):

  • Receives logs from Promtail
  • Indexes only labels (not log content)
  • Stores log data in chunks (compressed)
  • Serves queries via HTTP API

3. Grafana (Visualization):

  • Queries Loki using LogQL
  • Displays logs alongside metrics
  • Provides correlation between logs and metrics
  • Same dashboard for everything

Installing Loki Stack

# Install Loki stack using Helm helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Create namespace kubectl create namespace logging # Install Loki stack (includes Loki, Promtail, and Grafana) helm install loki grafana/loki-stack \ --namespace logging \ --set grafana.enabled=true \ --set prometheus.enabled=true \ --set prometheus.alertmanager.persistentVolume.enabled=false \ --set prometheus.server.persistentVolume.enabled=false \ --set loki.persistence.enabled=true \ --set loki.persistence.size=10Gi # Get Grafana password kubectl get secret -n logging loki-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode ; echo # Port forward to Grafana kubectl port-forward -n logging svc/loki-grafana 3000:80 # Open http://localhost:3000 # Username: admin

LogQL Query Language

LogQL: Loki's query language, inspired by PromQL. It allows you to filter, parse, and aggregate log data using label selectors and pipeline operators.
# Basic LogQL queries # Select all logs from a specific namespace {namespace="production"} # Filter by multiple labels {namespace="production", app="my-app"} # Filter by pod name pattern (regex) {namespace="production", pod=~"my-app-.*"} # Filter log content (search for text) {namespace="production"} |= "error" # Exclude log content {namespace="production"} != "debug" # Case-insensitive search {namespace="production"} |~ "(?i)error" # Parse JSON logs {namespace="production"} | json # Extract specific field from JSON {namespace="production"} | json | line_format "{{.message}}" # Count log lines per second rate({namespace="production"}[5m]) # Count errors per minute sum(rate({namespace="production"} |= "ERROR" [1m])) # Top 10 pods by log volume topk(10, sum by (pod) (rate({namespace="production"}[5m])) )

Labels: The Key to Efficiency

Loki's Secret Sauce: By indexing only labels (not log content), Loki achieves massive efficiency gains:
  • 10-50x less storage than Elasticsearch
  • Significantly lower memory and CPU usage
  • Faster query performance for label-based searches
  • Lower operational costs
# Loki automatically adds Kubernetes labels from Promtail: { "job": "kubernetes-pods", "namespace": "production", "pod": "my-app-7f9c4b8d-xk2pq", "container": "app", "node_name": "worker-node-2", "app": "my-app", "version": "v2.1" } # These labels are indexed for fast filtering # The actual log content is stored compressed, not indexed

Loki vs. Elasticsearch

Feature Loki Elasticsearch
Indexing Labels only (metadata) Full-text indexing of all log content
Storage 10-50x less (TSDB, compressed chunks) High (inverted indices for all text)
Resource Usage Low (single binary, minimal memory) High (Java, large heap, multiple nodes)
Query Speed Fast for label queries, slower for text search Fast for full-text search
Setup Complexity Simple (like Prometheus) Complex (cluster, sharding, replicas)
Integration Native Grafana (with metrics) Kibana (separate tool)
Best For Kubernetes, cloud-native, cost-sensitive Complex text search, compliance, large teams

Limitations of Loki

TSDB Retention Limitation: Loki stores log data in a Time Series Database (TSDB), which is not ideal for very long-term retention (2+ weeks). TSDB performance degrades with age of data.
Workarounds for Long-Term Storage:
  • Use Loki for recent logs (1-2 weeks)
  • Export older logs to object storage (S3, GCS)
  • Use Elasticsearch or similar for compliance/audit logs requiring long retention
  • Configure Loki with object storage backend (S3, GCS, Azure Blob) for better retention

Grafana Integration Benefits

Unified Observability: The biggest advantage of Loki is seamless integration with Grafana, allowing you to view logs alongside metrics in the same dashboard.
# Example: Correlating logs and metrics in Grafana # Scenario: High error rate alert fired # Panel 1: Error rate metric (Prometheus) sum(rate(http_requests_total{status=~"5.."}[5m])) # Panel 2: Error logs (Loki) - same time range {namespace="production", app="my-app"} |= "ERROR" # You can: # 1. See the error rate spike in the metric graph # 2. Immediately view the actual error messages below # 3. Click on a log line to see full context # 4. All in the same dashboard, same time range # 5. No context switching between tools!

Lesson 4: EFKB Stack (Fluent Bit, Elasticsearch, Kibana)

The EFKB stack represents the evolution of traditional logging systems, offering powerful full-text search capabilities and extensive customization for complex logging requirements.

Evolution: ELK → EFK → EFKB

ELK Stack (Original)

Components: Elasticsearch, Logstash, Kibana

Problem: Logstash is Java-based and extremely resource-intensive (high CPU and memory usage), making it expensive to run at scale.

EFK Stack (First Evolution)

Components: Elasticsearch, Fluentd, Kibana

Improvement: Replaced Logstash with Fluentd (Ruby-based), which is significantly lighter and more efficient.

Remaining Issue: Fluentd still had non-trivial resource overhead for very large deployments.

EFKB Stack (Modern)

Components: Elasticsearch, Fluent Bit, Kibana (+ Optional Backend)

Breakthrough: Fluent Bit (written in C) is extremely lightweight and high-performance, making it ideal for Kubernetes environments where it runs on every node.

Why Fluent Bit?

Performance Champion: Fluent Bit is significantly more efficient than its predecessors:
  • Memory: ~450KB footprint (vs. 40-60MB for Fluentd, 200MB+ for Logstash)
  • CPU: Minimal CPU usage due to C implementation
  • Speed: Can process millions of logs per second per node
  • Written in C: Direct system calls, no interpreter overhead
Log Collector Language Memory Footprint Best Use Case
Logstash Java (JRuby) 200MB - 1GB+ Complex transformations, legacy systems
Fluentd Ruby (CRuby) 40-60MB Moderate scale, plugin ecosystem
Fluent Bit C ~450KB High scale, Kubernetes, edge devices

Fluent Bit Architecture

Modular Pipeline Design

Fluent Bit processes logs through a series of modular stages:

1. Input (Collection)

Collects logs from various sources:

  • tail: Read from log files (like tail -f)
  • systemd: Read from systemd journal
  • tcp/udp: Listen on network ports
  • kubernetes: Automatically discover and read pod logs
  • docker: Read from Docker container logs

2. Parser (Structuring)

Converts raw text into structured data:

  • json: Parse JSON-formatted logs
  • regex: Custom parsing with regular expressions
  • apache/nginx: Parse web server logs
  • docker: Parse Docker JSON logs
  • syslog: Parse syslog format
# Example: Parsing Nginx logs with Fluent Bit # Raw log line: 192.168.1.100 - - [15/Jan/2025:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234 # Fluent Bit parser configuration: [PARSER] Name nginx Format regex Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \[(?

3. Filter (Processing & Enrichment)

Modifies, enriches, or drops logs:

  • kubernetes: Add pod name, namespace, labels
  • grep: Filter logs based on content (keep/drop)
  • modify: Add, remove, or rename fields
  • nest: Restructure nested JSON
  • throttle: Rate-limit log output
# Example filter configuration # Add Kubernetes metadata [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Merge_Log On K8S-Logging.Parser On # Drop debug logs [FILTER] Name grep Match * Exclude level DEBUG # Add custom field [FILTER] Name modify Match * Add environment production Add cluster us-east-1

4. Buffer (Reliability)

Handles temporary storage for reliability:

  • Memory buffer: Fast, but lost on crash
  • Filesystem buffer: Persistent, survives restarts
  • Backpressure handling: Slows input when backend is unavailable
Buffer Configuration Trade-off:
  • Memory buffer: Faster but logs lost if Fluent Bit crashes
  • Filesystem buffer: Survives crashes but slower and uses disk I/O
For production, use filesystem buffer to prevent log loss.

5. Routing/Output (Delivery)

Sends processed logs to destination(s):

  • elasticsearch: Send to Elasticsearch cluster
  • kafka: Send to Apache Kafka
  • http: POST to HTTP endpoint
  • s3: Store in AWS S3
  • stdout: Print to console (debugging)
  • loki: Send to Grafana Loki

Complete Fluent Bit Configuration Example

# Fluent Bit ConfigMap for Kubernetes apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: logging data: fluent-bit.conf: | [SERVICE] Flush 5 Daemon Off Log_Level info [INPUT] Name tail Path /var/log/containers/*.log Parser docker Tag kube.* Refresh_Interval 5 Mem_Buf_Limit 5MB [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Merge_Log On K8S-Logging.Parser On [FILTER] Name grep Match * Regex level (WARNING|ERROR|CRITICAL) [OUTPUT] Name es Match * Host elasticsearch.logging.svc Port 9200 Index kubernetes-logs Type _doc Logstash_Format On Retry_Limit 5

Elasticsearch & Kibana

Elasticsearch: A distributed search and analytics engine that stores and indexes logs. Provides powerful full-text search and aggregation capabilities.
Kibana: The web interface for Elasticsearch. Allows you to search logs, create visualizations, and build dashboards.
Kibana Features:
  • Discover: Search and filter logs with advanced queries
  • Visualize: Create charts, graphs, and metrics
  • Dashboard: Combine visualizations into comprehensive dashboards
  • Dev Tools: Run Elasticsearch queries directly
  • Alerting: Set up alerts based on log patterns

When to Use EFKB vs. Loki

Use Case Recommended Solution Reason
Small-medium Kubernetes clusters Loki Lower resource usage, simpler setup, Grafana integration
Full-text search requirements EFKB Elasticsearch excels at complex text queries
Long-term retention (months/years) EFKB Elasticsearch handles long retention better than TSDB
Compliance & audit logs EFKB Better for retention, immutability, and compliance features
Cost-sensitive environments Loki 10-50x less storage and compute costs
Already using Prometheus/Grafana Loki Unified observability in single tool
Multiple data sources beyond logs EFKB Elasticsearch handles diverse data types
Hybrid Approach: Many organizations use both:
  • Loki: For recent application logs (1-2 weeks) and debugging
  • EFKB: For long-term retention, compliance, and audit logs
Fluent Bit can send logs to multiple destinations simultaneously!

Final Quiz

Test your knowledge of Kubernetes logging!

Question 1: Where must containerized applications write their logs?

a) To /var/log/ directory inside the container
b) To standard output (stdout) and standard error (stderr)
c) Directly to Elasticsearch
d) To a mounted volume shared with the host

Question 2: Why is external log persistence critical in Kubernetes?

a) kubectl logs automatically backs up all logs
b) Containers are ephemeral and logs are lost when containers restart without external collection
c) Kubernetes doesn't support internal logging
d) Docker doesn't capture stdout/stderr

Question 3: Why is metadata enrichment essential for Kubernetes logs?

a) It makes logs larger and easier to see
b) Without pod name, namespace, and labels, logs from multiple replicas are indistinguishable and debugging is impossible
c) Kubernetes requires metadata for compliance
d) Metadata is only needed for billing purposes

Question 4: Why is Loki described as "Prometheus for logs"?

a) It's written by the same people
b) It indexes logs by labels (like Prometheus metrics), uses TSDB storage, and has similar architecture with LogQL mirroring PromQL
c) It replaces Prometheus
d) It only works with Prometheus

Question 5: What is Fluent Bit's main advantage over Logstash and Fluentd?

a) It has more features
b) Written in C, it's significantly more lightweight and high-performance (~450KB vs 40-200MB+ footprint)
c) It only works with Kubernetes
d) It's easier to configure

Question 6: What does the Filter stage in Fluent Bit do?

a) Sends logs to Elasticsearch
b) Selectively drops unwanted logs, enriches with metadata, and modifies log fields before sending to storage
c) Parses JSON logs
d) Stores logs temporarily

Question 7: What is a key limitation of Loki's TSDB storage?

a) It cannot store JSON logs
b) TSDB is not ideal for very long-term retention (2+ weeks) due to performance degradation with age
c) It requires Elasticsearch backend
d) It only works with Grafana

Question 8: Why is strict log filtering critical in production?

a) To make logs more colorful
b) Without filtering (collecting only WARNING/ERROR), logging infrastructure costs and resource consumption can exceed the actual application
c) Filtering is only for compliance
d) Kubernetes requires filtering
Quiz Complete!
All correct answers are option 'b'. These logging principles are essential for operating production Kubernetes clusters. Remember: proper logging with persistence, aggregation, metadata, parsing, and filtering is critical for debugging and maintaining cloud-native applications!