Lesson 1: Core Principles of Kubernetes Logging
Logging in Kubernetes is fundamentally different from traditional server environments due to the ephemeral nature of containers. Understanding these core principles is essential for building a robust logging system.
The Container Logging Model
# Traditional logging (WRONG for containers)
# Application writes to /var/log/app.log
echo "Log message" >> /var/log/app.log
# Container logging (CORRECT)
# Application writes to stdout/stderr
echo "Log message" # Goes to stdout
echo "Error message" >&2 # Goes to stderr
# Docker daemon captures these streams
# and handles them with configured log driversDocker Log Drivers
| Log Driver | Storage | Best For |
|---|---|---|
| json-file | JSON files on disk | Default, simple deployments |
| journald | systemd journal | systemd-based systems |
| syslog | Syslog daemon | Integration with existing syslog infrastructure |
| none | Disabled | When using external log collectors |
The Five Critical Principles
1. Persistence
Problem: Containers are ephemeral. When a container restarts, its logs are lost unless persisted externally.
Solution: Logs must be collected and saved outside the container runtime to survive restarts and terminations.
kubectl logs --previous command only shows logs from the immediately previous container instance. It cannot trace a chain of multiple restarts, making it insufficient for production debugging.
# kubectl logs limitations
# View current container logs
kubectl logs my-pod
# View previous container logs (only one restart back)
kubectl logs my-pod --previous
# Problem: If pod has restarted 5 times, you can only see the last restart
# Restarts 1, 2, 3, 4 are LOST forever without external log persistence
# Solution: Use a centralized logging system that captures logs
# from all containers before they restart2. Aggregation and Centralization
Problem: In a Kubernetes cluster, logs are scattered across multiple nodes, pods, and containers.
Solution: Collect logs from all sources into a central location:
- Application logs from all pods
- System logs from worker nodes (OS-level)
- Control plane component logs (API server, scheduler, controller manager)
- kubelet and container runtime logs
3. Metadata Enrichment
Problem: Raw container logs don't include context about which pod, namespace, or replica generated them.
Solution: Add crucial metadata to each log entry:
- Pod Name: my-app-deployment-abc123
- Namespace: production
- Container Name: nginx
- Node Name: worker-node-1
- Labels: app=my-app, version=v2.1, env=prod
# Log WITHOUT metadata (useless in production)
2025-01-15 10:30:45 ERROR: Database connection failed
# Log WITH metadata (actionable)
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "ERROR",
"message": "Database connection failed",
"kubernetes": {
"namespace": "production",
"pod_name": "my-app-deployment-7f9c4b8d-xk2pq",
"container_name": "app",
"node_name": "worker-node-2",
"labels": {
"app": "my-app",
"version": "v2.1",
"environment": "production"
}
}
}
# Now you can:
# - Identify the exact pod that failed
# - Correlate with metrics from that specific pod
# - Check if other pods on the same node are affected
# - Filter by version to see if it's a deployment issue4. Parsing and Structuring
Problem: Many applications output unstructured text logs that are difficult to search and analyze.
Solution: Parse raw text into structured formats (typically JSON) to enable powerful queries and filtering.
# Unstructured log (hard to query)
"2025-01-15 10:30:45 [ERROR] user=john action=login ip=192.168.1.100 result=failed"
# Structured log (easily queryable)
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "ERROR",
"user": "john",
"action": "login",
"ip": "192.168.1.100",
"result": "failed"
}
# Now you can easily query:
# - All failed logins: level=ERROR AND action=login AND result=failed
# - All actions by user: user=john
# - All traffic from IP: ip=192.168.1.1005. Filtering and Optimization
Problem: Collecting every single log message can overwhelm your logging system and generate massive costs.
Solution: Implement strict filtering to collect only relevant logs (WARNING, ERROR, CRITICAL levels) and drop verbose DEBUG/INFO messages in production.
# Example filtering strategy
# Development environment: Collect everything
log_level: DEBUG
# Staging environment: Drop DEBUG
log_level: INFO
# Production environment: Only warnings and errors
log_level: WARNING
# Example filter configuration (Fluent Bit)
[FILTER]
Name grep
Match *
Regex level (WARNING|ERROR|CRITICAL)
# This drops all DEBUG and INFO logs before sending to storage
# Reduces log volume by 80-90% in typical applicationsLesson 2: Logging Architectures & Challenges
Understanding the different approaches to collecting logs in Kubernetes and the challenges each presents.
Three Logging Patterns
Node-Level Logging Agent
Approach: Run a logging agent on each node as a DaemonSet
Pros:
- Most common and recommended pattern
- One agent per node (low overhead)
- Automatic for all pods on node
- No app modifications needed
Cons:
- Only works with stdout/stderr
- Agent must be privileged
Sidecar Container Pattern
Approach: Add a logging container to each pod
Pros:
- Can handle multiple log streams
- App-specific log processing
- Works with file-based logs
Cons:
- High resource overhead (many agents)
- More complex configuration
- Logs written to disk first
Direct Application Logging
Approach: Application sends logs directly to logging backend
Pros:
- No intermediate agent needed
- Full control over log format
- Can add custom application context
Cons:
- Requires application code changes
- Tight coupling to logging backend
- No logs if backend is unavailable
- kubectl logs doesn't work
Recommended: Node-Level DaemonSet
# Example: Fluent Bit DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.0
volumeMounts:
# Read container logs
- name: varlog
mountPath: /var/log
# Read pod metadata
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
# Configuration
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-configThe Ephemeral Nature Challenge
What Happens During a Pod Lifecycle
Pod Starts
Container begins writing logs to stdout/stderr → Docker stores in /var/log/containers/
Application Runs
Logs accumulate on disk → Logging agent reads and forwards to central storage
Container Crashes (OOM, panic, etc.)
Container terminates → Log files remain temporarily on disk
Pod Restarts
New container starts → Old log files eventually deleted by kubelet → Without external logging, restart history is LOST
containerLogMaxSize: Default 10Mi per container log filecontainerLogMaxFiles: Default 5 rotated files- Maximum ~50Mi of logs per container before rotation/deletion
Log Volume Challenges
# Example calculation for a medium cluster:
# Cluster size:
- 50 nodes
- 500 pods (average 10 pods per node)
- Each pod generates 100 lines/second
# Log volume:
500 pods × 100 lines/sec = 50,000 lines/second
= 3,000,000 lines/minute
= 180,000,000 lines/hour
= 4,320,000,000 lines/day (4.3 billion!)
# At ~500 bytes per log line:
4.3 billion × 500 bytes = 2.15 TB/day of raw logs
# With metadata enrichment (+30% size):
2.15 TB × 1.3 = 2.8 TB/day
# Storage cost (at $0.10/GB/month):
2.8 TB/day × 30 days × $0.10 = $8,400/month JUST for storage
# This is why filtering is critical!Multi-Source Log Collection
Complete Kubernetes Logging Sources
Application Logs:
- Container stdout/stderr (primary source)
- Application-specific log files (if using sidecar)
System Logs:
- kubelet logs (systemd journal or /var/log/)
- Container runtime logs (Docker/containerd)
- Operating system logs (syslog, kernel)
Control Plane Logs:
- kube-apiserver (API requests, authentication, authorization)
- kube-scheduler (scheduling decisions)
- kube-controller-manager (reconciliation loops)
- etcd (cluster state database)
Audit Logs:
- Kubernetes audit events (who did what, when)
- Security and compliance tracking
Lesson 3: Loki - Modern Logging for Kubernetes
Grafana Loki is a modern, lightweight logging system designed specifically for cloud-native environments. It offers a simpler, more cost-effective alternative to traditional logging stacks.
What is Loki?
Why "Prometheus for Logs"?
Prometheus (Metrics)
- Indexes metrics by labels
- Stores time-series data in TSDB
- Uses PromQL for queries
- Pull-based scraping model
- Lightweight and efficient
- Short-term retention (weeks)
Loki (Logs)
- Indexes logs by labels (not content!)
- Stores log data in TSDB
- Uses LogQL for queries
- Push-based ingestion model
- Lightweight and efficient
- Short-medium retention (weeks)
Loki Architecture
Three-Component Design
1. Promtail (Log Collector Agent):
- Runs as DaemonSet on each node
- Discovers log files automatically
- Adds Kubernetes metadata as labels
- Pushes logs to Loki server
2. Loki (Storage & Query Engine):
- Receives logs from Promtail
- Indexes only labels (not log content)
- Stores log data in chunks (compressed)
- Serves queries via HTTP API
3. Grafana (Visualization):
- Queries Loki using LogQL
- Displays logs alongside metrics
- Provides correlation between logs and metrics
- Same dashboard for everything
Installing Loki Stack
# Install Loki stack using Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Create namespace
kubectl create namespace logging
# Install Loki stack (includes Loki, Promtail, and Grafana)
helm install loki grafana/loki-stack \
--namespace logging \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set prometheus.alertmanager.persistentVolume.enabled=false \
--set prometheus.server.persistentVolume.enabled=false \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi
# Get Grafana password
kubectl get secret -n logging loki-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode ; echo
# Port forward to Grafana
kubectl port-forward -n logging svc/loki-grafana 3000:80
# Open http://localhost:3000
# Username: adminLogQL Query Language
# Basic LogQL queries
# Select all logs from a specific namespace
{namespace="production"}
# Filter by multiple labels
{namespace="production", app="my-app"}
# Filter by pod name pattern (regex)
{namespace="production", pod=~"my-app-.*"}
# Filter log content (search for text)
{namespace="production"} |= "error"
# Exclude log content
{namespace="production"} != "debug"
# Case-insensitive search
{namespace="production"} |~ "(?i)error"
# Parse JSON logs
{namespace="production"} | json
# Extract specific field from JSON
{namespace="production"} | json | line_format "{{.message}}"
# Count log lines per second
rate({namespace="production"}[5m])
# Count errors per minute
sum(rate({namespace="production"} |= "ERROR" [1m]))
# Top 10 pods by log volume
topk(10,
sum by (pod) (rate({namespace="production"}[5m]))
)Labels: The Key to Efficiency
- 10-50x less storage than Elasticsearch
- Significantly lower memory and CPU usage
- Faster query performance for label-based searches
- Lower operational costs
# Loki automatically adds Kubernetes labels from Promtail:
{
"job": "kubernetes-pods",
"namespace": "production",
"pod": "my-app-7f9c4b8d-xk2pq",
"container": "app",
"node_name": "worker-node-2",
"app": "my-app",
"version": "v2.1"
}
# These labels are indexed for fast filtering
# The actual log content is stored compressed, not indexedLoki vs. Elasticsearch
| Feature | Loki | Elasticsearch |
|---|---|---|
| Indexing | Labels only (metadata) | Full-text indexing of all log content |
| Storage | 10-50x less (TSDB, compressed chunks) | High (inverted indices for all text) |
| Resource Usage | Low (single binary, minimal memory) | High (Java, large heap, multiple nodes) |
| Query Speed | Fast for label queries, slower for text search | Fast for full-text search |
| Setup Complexity | Simple (like Prometheus) | Complex (cluster, sharding, replicas) |
| Integration | Native Grafana (with metrics) | Kibana (separate tool) |
| Best For | Kubernetes, cloud-native, cost-sensitive | Complex text search, compliance, large teams |
Limitations of Loki
- Use Loki for recent logs (1-2 weeks)
- Export older logs to object storage (S3, GCS)
- Use Elasticsearch or similar for compliance/audit logs requiring long retention
- Configure Loki with object storage backend (S3, GCS, Azure Blob) for better retention
Grafana Integration Benefits
# Example: Correlating logs and metrics in Grafana
# Scenario: High error rate alert fired
# Panel 1: Error rate metric (Prometheus)
sum(rate(http_requests_total{status=~"5.."}[5m]))
# Panel 2: Error logs (Loki) - same time range
{namespace="production", app="my-app"} |= "ERROR"
# You can:
# 1. See the error rate spike in the metric graph
# 2. Immediately view the actual error messages below
# 3. Click on a log line to see full context
# 4. All in the same dashboard, same time range
# 5. No context switching between tools!Lesson 4: EFKB Stack (Fluent Bit, Elasticsearch, Kibana)
The EFKB stack represents the evolution of traditional logging systems, offering powerful full-text search capabilities and extensive customization for complex logging requirements.
Evolution: ELK → EFK → EFKB
ELK Stack (Original)
Components: Elasticsearch, Logstash, Kibana
Problem: Logstash is Java-based and extremely resource-intensive (high CPU and memory usage), making it expensive to run at scale.
EFK Stack (First Evolution)
Components: Elasticsearch, Fluentd, Kibana
Improvement: Replaced Logstash with Fluentd (Ruby-based), which is significantly lighter and more efficient.
Remaining Issue: Fluentd still had non-trivial resource overhead for very large deployments.
EFKB Stack (Modern)
Components: Elasticsearch, Fluent Bit, Kibana (+ Optional Backend)
Breakthrough: Fluent Bit (written in C) is extremely lightweight and high-performance, making it ideal for Kubernetes environments where it runs on every node.
Why Fluent Bit?
- Memory: ~450KB footprint (vs. 40-60MB for Fluentd, 200MB+ for Logstash)
- CPU: Minimal CPU usage due to C implementation
- Speed: Can process millions of logs per second per node
- Written in C: Direct system calls, no interpreter overhead
| Log Collector | Language | Memory Footprint | Best Use Case |
|---|---|---|---|
| Logstash | Java (JRuby) | 200MB - 1GB+ | Complex transformations, legacy systems |
| Fluentd | Ruby (CRuby) | 40-60MB | Moderate scale, plugin ecosystem |
| Fluent Bit | C | ~450KB | High scale, Kubernetes, edge devices |
Fluent Bit Architecture
Modular Pipeline Design
Fluent Bit processes logs through a series of modular stages:
1. Input (Collection)
Collects logs from various sources:
- tail: Read from log files (like tail -f)
- systemd: Read from systemd journal
- tcp/udp: Listen on network ports
- kubernetes: Automatically discover and read pod logs
- docker: Read from Docker container logs
2. Parser (Structuring)
Converts raw text into structured data:
- json: Parse JSON-formatted logs
- regex: Custom parsing with regular expressions
- apache/nginx: Parse web server logs
- docker: Parse Docker JSON logs
- syslog: Parse syslog format
# Example: Parsing Nginx logs with Fluent Bit
# Raw log line:
192.168.1.100 - - [15/Jan/2025:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234
# Fluent Bit parser configuration:
[PARSER]
Name nginx
Format regex
Regex ^(?[^ ]*) [^ ]* (?[^ ]*) \[(? 3. Filter (Processing & Enrichment)
Modifies, enriches, or drops logs:
- kubernetes: Add pod name, namespace, labels
- grep: Filter logs based on content (keep/drop)
- modify: Add, remove, or rename fields
- nest: Restructure nested JSON
- throttle: Rate-limit log output
# Example filter configuration
# Add Kubernetes metadata
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
K8S-Logging.Parser On
# Drop debug logs
[FILTER]
Name grep
Match *
Exclude level DEBUG
# Add custom field
[FILTER]
Name modify
Match *
Add environment production
Add cluster us-east-14. Buffer (Reliability)
Handles temporary storage for reliability:
- Memory buffer: Fast, but lost on crash
- Filesystem buffer: Persistent, survives restarts
- Backpressure handling: Slows input when backend is unavailable
- Memory buffer: Faster but logs lost if Fluent Bit crashes
- Filesystem buffer: Survives crashes but slower and uses disk I/O
5. Routing/Output (Delivery)
Sends processed logs to destination(s):
- elasticsearch: Send to Elasticsearch cluster
- kafka: Send to Apache Kafka
- http: POST to HTTP endpoint
- s3: Store in AWS S3
- stdout: Print to console (debugging)
- loki: Send to Grafana Loki
Complete Fluent Bit Configuration Example
# Fluent Bit ConfigMap for Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Daemon Off
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 5MB
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
K8S-Logging.Parser On
[FILTER]
Name grep
Match *
Regex level (WARNING|ERROR|CRITICAL)
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc
Port 9200
Index kubernetes-logs
Type _doc
Logstash_Format On
Retry_Limit 5Elasticsearch & Kibana
- Discover: Search and filter logs with advanced queries
- Visualize: Create charts, graphs, and metrics
- Dashboard: Combine visualizations into comprehensive dashboards
- Dev Tools: Run Elasticsearch queries directly
- Alerting: Set up alerts based on log patterns
When to Use EFKB vs. Loki
| Use Case | Recommended Solution | Reason |
|---|---|---|
| Small-medium Kubernetes clusters | Loki | Lower resource usage, simpler setup, Grafana integration |
| Full-text search requirements | EFKB | Elasticsearch excels at complex text queries |
| Long-term retention (months/years) | EFKB | Elasticsearch handles long retention better than TSDB |
| Compliance & audit logs | EFKB | Better for retention, immutability, and compliance features |
| Cost-sensitive environments | Loki | 10-50x less storage and compute costs |
| Already using Prometheus/Grafana | Loki | Unified observability in single tool |
| Multiple data sources beyond logs | EFKB | Elasticsearch handles diverse data types |
- Loki: For recent application logs (1-2 weeks) and debugging
- EFKB: For long-term retention, compliance, and audit logs
Final Quiz
Test your knowledge of Kubernetes logging!
Question 1: Where must containerized applications write their logs?
Question 2: Why is external log persistence critical in Kubernetes?
Question 3: Why is metadata enrichment essential for Kubernetes logs?
Question 4: Why is Loki described as "Prometheus for logs"?
Question 5: What is Fluent Bit's main advantage over Logstash and Fluentd?
Question 6: What does the Filter stage in Fluent Bit do?
Question 7: What is a key limitation of Loki's TSDB storage?
Question 8: Why is strict log filtering critical in production?
All correct answers are option 'b'. These logging principles are essential for operating production Kubernetes clusters. Remember: proper logging with persistence, aggregation, metadata, parsing, and filtering is critical for debugging and maintaining cloud-native applications!