Lesson 1: Categorizing Kubernetes Failures
Understanding how Kubernetes can fail is the first step toward building a resilient cluster. Failures can be categorized into three distinct levels.
The Three Levels of Failure
Physical Level
Infrastructure issues:
- Poor data center quality
- Hardware failures
- Incorrect hardware selection
- Poor architecture planning
- Complex multi-datacenter setups
Organizational Level
Human and management factors:
- Lack of documented processes
- Unclear responsibilities
- Poor topology/documentation
- Inadequate access controls
- Constant "firefighting" mode
Applied (Technical) Level
Technical errors:
- Kubernetes complexity
- Reliance on "hope" or shortcuts
- Configuration errors
- Human factor mistakes
- Missing best practices
Physical Level Issues
- Poor Data Center Quality: Unreliable power, cooling, or network connectivity
- Hardware Failure: Disk failures, memory errors, network interface problems
- Incorrect Hardware Selection: Under-provisioned servers that can't handle the workload
- Poor Load Planning: Not accounting for peak traffic or growth
- Complex Multi-Datacenter: Over-engineered setups that are hard to maintain
Organizational Level Issues
- No Documented Processes: Knowledge exists only in people's heads
- Unclear Responsibilities: No one knows who's responsible for what
- Poor Documentation: Cluster topology and architecture not documented
- Inadequate Access Controls: Junior engineer accidentally deletes critical namespace
- Firefighting Mode: Team always reacting to emergencies instead of preventing them
kubectl delete namespace production, taking down the entire production environment because there were no safeguards in place.
Applied (Technical) Level Issues
- Kubernetes Complexity: The platform has a steep learning curve with many moving parts
- Hope-Driven Development: Deploying changes with fingers crossed, hoping nothing breaks
- Shortcuts: Skipping proper testing or validation to save time
- Configuration Mistakes: Wrong resource limits, missing health checks, incorrect networking
- Human Error: Typos in YAML, copy-paste mistakes, miscommunication
The Structured Approach
The video presents a systematic approach to disaster recovery:
- Identify common failure modes - Understand what can go wrong
- Structure problems into categories - Organize by level (Physical, Organizational, Technical)
- Create a checklist - Build actionable items to address each category
- Review practical solutions - Implement tools and best practices
Lesson 2: Fault Tolerance & High Availability Setup
Building a bulletproof cluster starts with proper fault tolerance configuration of core Kubernetes components.
etcd and Control Plane High Availability
Recommended Control Plane Setup
Distributed key-value store
Maintains quorum with 2/3 alive
Kubernetes API endpoint
Load balanced
Manages controllers
Leader election
Pod placement decisions
Leader election
Load Balancer for API Server
Use an external load balancer (or a simple Nginx proxy) in front of the API servers so that kubelet and kube-proxy can handle master node failover.
# Example: Simple Nginx load balancer for API servers
upstream kubernetes {
server master1.example.com:6443 max_fails=3 fail_timeout=30s;
server master2.example.com:6443 max_fails=3 fail_timeout=30s;
server master3.example.com:6443 max_fails=3 fail_timeout=30s;
}
server {
listen 6443;
proxy_pass kubernetes;
proxy_timeout 10s;
proxy_connect_timeout 1s;
}
# Kubeconfig points to load balancer
apiVersion: v1
clusters:
- cluster:
server: https://k8s-lb.example.com:6443
name: kubernetesIngress Controller Redundancy
1. Multiple Ingress Controllers
Deploy more than one Ingress Controller on dedicated worker nodes. Use node selectors or node affinity to ensure they run on different physical hosts.
2. External Load Balancer or VRRP
Use an external load balancer or Virtual Router Redundancy Protocol (VRRP) like Keepalived to ensure traffic reaches an available Ingress Controller host.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx-ingress-controller
namespace: ingress-nginx
spec:
selector:
matchLabels:
app: ingress-nginx
template:
metadata:
labels:
app: ingress-nginx
spec:
nodeSelector:
node-role: ingress # Deploy only on dedicated ingress nodes
containers:
- name: nginx-ingress-controller
image: k8s.gcr.io/ingress-nginx/controller:v1.8.1
args:
- /nginx-ingress-controller
- --election-id=ingress-controller-leader
ports:
- name: http
containerPort: 80
hostPort: 80 # Bind to host network
- name: https
containerPort: 443
hostPort: 443Certificate Management
kubeadm for installation, which handles longer-lived certificates and provides commands for renewal.
# Check certificate expiration
kubeadm certs check-expiration
# Output:
# CERTIFICATE EXPIRES RESIDUAL TIME
# admin.conf Jan 15, 2025 12:00 UTC 364d
# apiserver Jan 15, 2025 12:00 UTC 364d
# apiserver-etcd-client Jan 15, 2025 12:00 UTC 364d
# Renew all certificates
kubeadm certs renew all
# Renew specific certificate
kubeadm certs renew apiserverHigh Availability Checklist
| Component | Minimum HA Setup | Failure Tolerance |
|---|---|---|
| etcd | 3 instances | 1 node failure |
| API Server | 3 instances + LB | 2 nodes can fail |
| Controller Manager | 3 instances (leader election) | 2 nodes can fail |
| Scheduler | 3 instances (leader election) | 2 nodes can fail |
| Ingress Controller | 2+ instances + VRRP/LB | 1+ nodes can fail |
Lesson 3: Updates, Maintenance & Backups
Proper procedures for updates and comprehensive backups are essential for disaster recovery.
Host OS Updates (Worker Nodes)
Before updating an operating system, ensure the node's running applications have more than one replica to maintain availability during maintenance.
# Step 1: Verify replicas are running on other nodes
kubectl get pods -o wide | grep my-app
# Step 2: Cordon the node (prevent new pods from scheduling)
kubectl cordon worker-node-1
# Step 3: Drain the node (safely evacuate all pods)
kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data
# Output shows pods being evicted and rescheduled to other nodes
# Step 4: Perform maintenance (SSH to the node)
ssh worker-node-1
sudo apt update && sudo apt upgrade -y
sudo reboot
# Step 5: After reboot, uncordon the node
kubectl uncordon worker-node-1
# Pods will naturally rebalance during future schedulingkubectl drain without checking if applications have sufficient replicas on other nodes can cause downtime. Always verify replica distribution first!
kubectl drain from causing downtime by blocking the drain if it would violate minimum availability requirements.
Cluster Version Updates
- Always test the update process first on a dedicated development/staging cluster that mirrors production
- Always read the changelog (release notes) between versions
- Key names and behavior can change between versions
- Never skip versions (upgrade sequentially: 1.25 → 1.26 → 1.27)
- Back up everything before starting
# Example: Upgrading kubeadm cluster from 1.27 to 1.28
# 1. Read release notes
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md
# 2. Upgrade control plane (first master node)
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm=1.28.0-00 && \
sudo apt-mark hold kubeadm
# Plan the upgrade
sudo kubeadm upgrade plan
# Apply the upgrade
sudo kubeadm upgrade apply v1.28.0
# 3. Upgrade kubelet and kubectl on master
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00 kubectl=1.28.0-00 && \
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# 4. Repeat for other master nodes
# sudo kubeadm upgrade node
# 5. Upgrade worker nodes (one at a time)
kubectl drain worker-node-1 --ignore-daemonsets
# SSH to worker, upgrade kubeadm, kubelet, kubectl
# sudo kubeadm upgrade node
kubectl uncordon worker-node-1Backups - Critical for Disaster Recovery
Backups are critical. While Kubernetes manifests are often in Git, backing up all state is vital.
What to Back Up
1. Manifests & Helm Charts
Method: Git repository (GitOps)
Store all YAML manifests, Helm charts, and Kustomize configurations in version control. This provides history and rollback capability.
git clone git@github.com:company/k8s-manifests.git
cd k8s-manifests
git log --oneline # View deployment history2. Secrets & Certificates
Method: Encrypted backups or secret management tools
Secrets contain sensitive data (passwords, API keys, TLS certs). Never store in plain text Git!
# Export secrets (for backup only, encrypt!)
kubectl get secrets --all-namespaces -o yaml > secrets-backup.yaml
# Better: Use secret management
# - Sealed Secrets
# - External Secrets Operator
# - HashiCorp Vault3. Container Images
Method: Backup the container registry
If your registry goes down, you can't deploy or scale. Back up registry storage or use a mirrored/replicated registry.
# Registry backup strategies:
# 1. S3/Object storage replication
# 2. Harbor registry replication
# 3. Mirror to multiple registries
# 4. Export critical images
docker save myapp:v1.0 | gzip > myapp-v1.0.tar.gz4. Persistent Volumes / Stateful Data
Method: Volume snapshots and database backups
Application data in databases, file storage, etc. This is often the most critical backup.
# CSI Volume Snapshots (covered in storage lesson)
kubectl create -f snapshot.yaml
# Database-specific backups
kubectl exec postgres-0 -- pg_dump mydb > backup.sql
# Velero for automated backups (see below)Velero - Complete Backup Solution
# Install Velero CLI
brew install velero # or download from GitHub
# Install Velero in cluster (AWS S3 example)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
# Create a backup of entire cluster
velero backup create full-backup-2024-01-15
# Create a backup of specific namespace
velero backup create prod-backup --include-namespaces=production
# Schedule automatic daily backups
velero schedule create daily-backup --schedule="0 2 * * *"
# Restore from backup
velero restore create --from-backup full-backup-2024-01-15
# List backups
velero backup get- Backs up entire cluster or specific namespaces
- Supports persistent volume snapshots (CSI)
- Scheduled automated backups
- Disaster recovery and cluster migration
- Works with AWS, GCP, Azure, and on-premises (S3-compatible)
Lesson 4: Minimizing Human Error - Essential Kubernetes Features
Built-in Kubernetes features can help increase security and minimize the impact of human error.
1. Pod Disruption Budget (PDB)
A PDB limits the number of concurrent voluntary disruptions (like kubectl drain), ensuring critical applications maintain a minimum required replica count during maintenance.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2 # At least 2 pods must remain available
selector:
matchLabels:
app: web-app
---
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: database-pdb
spec:
maxUnavailable: 1 # At most 1 pod can be unavailable
selector:
matchLabels:
app: databasekubectl drain action is blocked or paused until more replicas are available on other nodes.
kubectl delete pod.
2. LimitRange and ResourceQuota
These two features work together at the namespace level to enforce resource management, critical for cluster stability.
LimitRange - Default Resources
apiVersion: v1
kind: LimitRange
metadata:
name: default-limit-range
namespace: development
spec:
limits:
- default: # Default limits if not specified
cpu: 500m
memory: 512Mi
defaultRequest: # Default requests if not specified
cpu: 100m
memory: 128Mi
max: # Maximum allowed
cpu: 2
memory: 2Gi
min: # Minimum allowed
cpu: 50m
memory: 64Mi
type: ContainerResourceQuota - Namespace Limits
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: development
spec:
hard:
# Compute resources
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
# Object counts
pods: "50"
services: "10"
configmaps: "20"
persistentvolumeclaims: "10"
secrets: "20"
# Storage
requests.storage: 100Gi3. PriorityClass - Critical Workload Protection
PriorityClass is used by the Scheduler to make decisions during resource contention, especially when a cluster is under stress.
# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-priority
value: 1000000 # Higher number = higher priority
globalDefault: false
description: "Critical production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000
description: "Important production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 10000
description: "Staging and QA workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
globalDefault: true
description: "Development and batch jobs"
---
# Use in Pod spec
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
priorityClassName: critical-priority
containers:
- name: app
image: myapp:v14. NetworkPolicy - Internal Firewall
NetworkPolicy acts as an internal firewall, controlling network traffic flow between different pods and namespaces.
# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # Apply to all pods in namespace
policyTypes:
- Ingress
---
# Allow specific traffic: frontend → backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
---
# Allow database access only from backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-backend-to-db
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: backend
ports:
- protocol: TCP
port: 5432Lesson 5: Debugging & Recovery When Cluster Breaks
When an outage occurs, follow a systematic debugging approach to identify and resolve issues quickly.
The Debugging Workflow
Step 1: Check Object Status
Start with high-level Kubernetes object inspection
# Get overview of all resources
kubectl get all -n production
# Check pod status
kubectl get pods -n production
# Describe specific pod for events
kubectl describe pod my-app-xyz -n production
# View recent events in namespace
kubectl get events -n production --sort-by='.lastTimestamp'
# Check pod logs
kubectl logs my-app-xyz -n production
# Previous container logs (if pod restarted)
kubectl logs my-app-xyz -n production --previousStep 2: Check Lower Levels (Container & Host)
Dig deeper into container runtime and node-level issues
# Check node status
kubectl get nodes
kubectl describe node worker-node-1
# SSH to node and check kubelet logs
ssh worker-node-1
sudo journalctl -u kubelet -f
# Check container runtime
sudo crictl ps
sudo crictl logs
# Check system resources
top
df -h
free -m
# Check network connectivity
ping 8.8.8.8
curl https://kubernetes.default.svc.cluster.local Step 3: Use Debugging Tools
Leverage kubectl debug and ephemeral containers
# Create debugging container attached to pod
kubectl debug my-app-xyz -it --image=nicolaka/netshoot
# Inside debug container, you have networking tools:
# - nslookup (DNS debugging)
# - curl (HTTP testing)
# - netstat (network connections)
# - tcpdump (packet capture)
# - ping, traceroute, etc.
# Debug by creating a copy of the pod
kubectl debug my-app-xyz --copy-to=my-app-debug --container=app
# Debug node issues
kubectl debug node/worker-node-1 -it --image=ubuntuCommon Issues and Solutions
| Symptom | Possible Cause | Debugging Steps |
|---|---|---|
Pods stuck in Pending |
Insufficient resources or scheduling constraints | kubectl describe pod - Check events for scheduling failures |
Pods in CrashLoopBackOff |
Application error or misconfiguration | kubectl logs - Check application logs for errors |
Pods in ImagePullBackOff |
Can't pull container image | Check image name, registry credentials, network connectivity |
| Service not accessible | Networking issue or misconfigured service | kubectl get endpoints - Verify endpoints exist |
| High latency or timeouts | Resource exhaustion or network issues | Check node resources, DNS resolution, network policies |
Complete Debugging Example
# Scenario: Web app is down, users getting 502 errors
# 1. Check pod status
kubectl get pods -l app=web-app
# NAME READY STATUS RESTARTS AGE
# web-app-abc 0/1 CrashLoopBackOff 5 3m
# 2. Check logs
kubectl logs web-app-abc
# Error: Cannot connect to database at db-service:5432
# 3. Check database pods
kubectl get pods -l app=database
# NAME READY STATUS RESTARTS AGE
# database-0 1/1 Running 0 10m
# 4. Check service and endpoints
kubectl get svc db-service
kubectl get endpoints db-service
# NAME ENDPOINTS
# db-service 10.244.1.15:5432
# 5. Test connectivity from web app pod
kubectl debug web-app-abc -it --image=nicolaka/netshoot
# Inside debug container:
nslookup db-service
# Success - DNS works
telnet 10.244.1.15 5432
# Connection refused!
# 6. Check NetworkPolicy
kubectl get networkpolicy
# Found restrictive policy blocking web-app → database
# 7. Fix NetworkPolicy or add exception
kubectl apply -f allow-web-to-db-policy.yaml
# 8. Verify fix
kubectl logs web-app-abc
# Successfully connected to databaseDisaster Recovery Checklist Summary
- ✅ High Availability: 3+ etcd, multiple API servers with LB, redundant Ingress
- ✅ Certificate Management: Use kubeadm, automate renewal, monitor expiration
- ✅ Maintenance Procedures: Proper drain/uncordon workflow, test updates on staging
- ✅ Comprehensive Backups: Git for manifests, Velero for cluster state, database backups
- ✅ Resource Management: LimitRange defaults, ResourceQuota limits
- ✅ Disruption Protection: PodDisruptionBudgets for critical apps
- ✅ Priority Management: PriorityClasses for workload importance
- ✅ Network Security: NetworkPolicies for traffic control
- ✅ Monitoring & Alerting: Prometheus, Grafana, log aggregation
- ✅ Documentation: Architecture diagrams, runbooks, incident procedures
- ✅ Testing: Regular DR drills, chaos engineering (optional)
Final Quiz
Test your knowledge of Kubernetes Disaster Recovery!
Question 1: Why should etcd instances always be an odd number?
Question 2: What is the primary purpose of Pod Disruption Budget (PDB)?
Question 3: What does LimitRange accomplish in a namespace?
Question 4: What happens when a PriorityClass is used during resource scarcity?
Question 5: What should you always do before performing cluster version updates?
Question 6: What is Velero used for?
Question 7: What does kubectl drain do?
Question 8: What is the purpose of NetworkPolicy?
All correct answers are option 'b'. Review the lessons above to understand why these are the best answers.