Kubernetes Disaster Recovery

Building Bulletproof Clusters & Preventing Failures

Lesson 1: Categorizing Kubernetes Failures

Understanding how Kubernetes can fail is the first step toward building a resilient cluster. Failures can be categorized into three distinct levels.

The Three Levels of Failure

Physical Level

Infrastructure issues:

  • Poor data center quality
  • Hardware failures
  • Incorrect hardware selection
  • Poor architecture planning
  • Complex multi-datacenter setups

Organizational Level

Human and management factors:

  • Lack of documented processes
  • Unclear responsibilities
  • Poor topology/documentation
  • Inadequate access controls
  • Constant "firefighting" mode

Applied (Technical) Level

Technical errors:

  • Kubernetes complexity
  • Reliance on "hope" or shortcuts
  • Configuration errors
  • Human factor mistakes
  • Missing best practices

Physical Level Issues

Infrastructure Failures:
  • Poor Data Center Quality: Unreliable power, cooling, or network connectivity
  • Hardware Failure: Disk failures, memory errors, network interface problems
  • Incorrect Hardware Selection: Under-provisioned servers that can't handle the workload
  • Poor Load Planning: Not accounting for peak traffic or growth
  • Complex Multi-Datacenter: Over-engineered setups that are hard to maintain

Organizational Level Issues

Human and Management Problems:
  • No Documented Processes: Knowledge exists only in people's heads
  • Unclear Responsibilities: No one knows who's responsible for what
  • Poor Documentation: Cluster topology and architecture not documented
  • Inadequate Access Controls: Junior engineer accidentally deletes critical namespace
  • Firefighting Mode: Team always reacting to emergencies instead of preventing them
Common Scenario: A junior developer with excessive permissions accidentally runs kubectl delete namespace production, taking down the entire production environment because there were no safeguards in place.

Applied (Technical) Level Issues

Technical Configuration Errors:
  • Kubernetes Complexity: The platform has a steep learning curve with many moving parts
  • Hope-Driven Development: Deploying changes with fingers crossed, hoping nothing breaks
  • Shortcuts: Skipping proper testing or validation to save time
  • Configuration Mistakes: Wrong resource limits, missing health checks, incorrect networking
  • Human Error: Typos in YAML, copy-paste mistakes, miscommunication

The Structured Approach

The video presents a systematic approach to disaster recovery:

  1. Identify common failure modes - Understand what can go wrong
  2. Structure problems into categories - Organize by level (Physical, Organizational, Technical)
  3. Create a checklist - Build actionable items to address each category
  4. Review practical solutions - Implement tools and best practices
Focus of This Course: While physical and organizational issues are important, this course focuses primarily on the Applied (Technical) Level - the concrete steps you can take to make your Kubernetes cluster more resilient through configuration, tooling, and best practices.

Lesson 2: Fault Tolerance & High Availability Setup

Building a bulletproof cluster starts with proper fault tolerance configuration of core Kubernetes components.

etcd and Control Plane High Availability

Critical Requirement: To achieve a fault-tolerant setup, ensure you have an odd number of etcd instances (3 or more) to maintain a quorum. The API Server, Controller Manager, and Scheduler should match this count.

Recommended Control Plane Setup

3x etcd
Distributed key-value store
Maintains quorum with 2/3 alive
3x API Server
Kubernetes API endpoint
Load balanced
3x Controller Manager
Manages controllers
Leader election
3x Scheduler
Pod placement decisions
Leader election
Why Odd Numbers? With 3 etcd instances, you can tolerate 1 failure and still maintain quorum (2/3). With 5 instances, you can tolerate 2 failures (3/5). Even numbers don't improve fault tolerance (4 instances can still only tolerate 1 failure, same as 3).

Load Balancer for API Server

Use an external load balancer (or a simple Nginx proxy) in front of the API servers so that kubelet and kube-proxy can handle master node failover.

# Example: Simple Nginx load balancer for API servers upstream kubernetes { server master1.example.com:6443 max_fails=3 fail_timeout=30s; server master2.example.com:6443 max_fails=3 fail_timeout=30s; server master3.example.com:6443 max_fails=3 fail_timeout=30s; } server { listen 6443; proxy_pass kubernetes; proxy_timeout 10s; proxy_connect_timeout 1s; } # Kubeconfig points to load balancer apiVersion: v1 clusters: - cluster: server: https://k8s-lb.example.com:6443 name: kubernetes
Without Load Balancer: If worker nodes point directly to a single API server and that master fails, kubelet and kube-proxy lose connectivity and can't manage pods.

Ingress Controller Redundancy

1. Multiple Ingress Controllers

Deploy more than one Ingress Controller on dedicated worker nodes. Use node selectors or node affinity to ensure they run on different physical hosts.

2. External Load Balancer or VRRP

Use an external load balancer or Virtual Router Redundancy Protocol (VRRP) like Keepalived to ensure traffic reaches an available Ingress Controller host.

apiVersion: apps/v1 kind: DaemonSet metadata: name: nginx-ingress-controller namespace: ingress-nginx spec: selector: matchLabels: app: ingress-nginx template: metadata: labels: app: ingress-nginx spec: nodeSelector: node-role: ingress # Deploy only on dedicated ingress nodes containers: - name: nginx-ingress-controller image: k8s.gcr.io/ingress-nginx/controller:v1.8.1 args: - /nginx-ingress-controller - --election-id=ingress-controller-leader ports: - name: http containerPort: 80 hostPort: 80 # Bind to host network - name: https containerPort: 443 hostPort: 443

Certificate Management

Historical Issue: In older Kubernetes versions, expiring control plane certificates were a common issue. Clusters would turn into a "pumpkin" after a year when certificates expired, causing total outage.
Modern Best Practice: Use tools like kubeadm for installation, which handles longer-lived certificates and provides commands for renewal.
# Check certificate expiration kubeadm certs check-expiration # Output: # CERTIFICATE EXPIRES RESIDUAL TIME # admin.conf Jan 15, 2025 12:00 UTC 364d # apiserver Jan 15, 2025 12:00 UTC 364d # apiserver-etcd-client Jan 15, 2025 12:00 UTC 364d # Renew all certificates kubeadm certs renew all # Renew specific certificate kubeadm certs renew apiserver
Pro Tip: Set up automated monitoring and alerts for certificate expiration at least 60 days before they expire. Use cert-manager for application TLS certificates.

High Availability Checklist

Component Minimum HA Setup Failure Tolerance
etcd 3 instances 1 node failure
API Server 3 instances + LB 2 nodes can fail
Controller Manager 3 instances (leader election) 2 nodes can fail
Scheduler 3 instances (leader election) 2 nodes can fail
Ingress Controller 2+ instances + VRRP/LB 1+ nodes can fail

Lesson 3: Updates, Maintenance & Backups

Proper procedures for updates and comprehensive backups are essential for disaster recovery.

Host OS Updates (Worker Nodes)

Before updating an operating system, ensure the node's running applications have more than one replica to maintain availability during maintenance.

# Step 1: Verify replicas are running on other nodes kubectl get pods -o wide | grep my-app # Step 2: Cordon the node (prevent new pods from scheduling) kubectl cordon worker-node-1 # Step 3: Drain the node (safely evacuate all pods) kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data # Output shows pods being evicted and rescheduled to other nodes # Step 4: Perform maintenance (SSH to the node) ssh worker-node-1 sudo apt update && sudo apt upgrade -y sudo reboot # Step 5: After reboot, uncordon the node kubectl uncordon worker-node-1 # Pods will naturally rebalance during future scheduling
Common Mistake: Running kubectl drain without checking if applications have sufficient replicas on other nodes can cause downtime. Always verify replica distribution first!
Pod Disruption Budgets: We'll cover PDBs in the next lesson, but they prevent kubectl drain from causing downtime by blocking the drain if it would violate minimum availability requirements.

Cluster Version Updates

Critical Rules for Cluster Updates:
  • Always test the update process first on a dedicated development/staging cluster that mirrors production
  • Always read the changelog (release notes) between versions
  • Key names and behavior can change between versions
  • Never skip versions (upgrade sequentially: 1.25 → 1.26 → 1.27)
  • Back up everything before starting
# Example: Upgrading kubeadm cluster from 1.27 to 1.28 # 1. Read release notes # https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md # 2. Upgrade control plane (first master node) sudo apt-mark unhold kubeadm && \ sudo apt-get update && sudo apt-get install -y kubeadm=1.28.0-00 && \ sudo apt-mark hold kubeadm # Plan the upgrade sudo kubeadm upgrade plan # Apply the upgrade sudo kubeadm upgrade apply v1.28.0 # 3. Upgrade kubelet and kubectl on master sudo apt-mark unhold kubelet kubectl && \ sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00 kubectl=1.28.0-00 && \ sudo apt-mark hold kubelet kubectl sudo systemctl daemon-reload sudo systemctl restart kubelet # 4. Repeat for other master nodes # sudo kubeadm upgrade node # 5. Upgrade worker nodes (one at a time) kubectl drain worker-node-1 --ignore-daemonsets # SSH to worker, upgrade kubeadm, kubelet, kubectl # sudo kubeadm upgrade node kubectl uncordon worker-node-1

Backups - Critical for Disaster Recovery

Backups are critical. While Kubernetes manifests are often in Git, backing up all state is vital.

What to Back Up

1. Manifests & Helm Charts

Method: Git repository (GitOps)

Store all YAML manifests, Helm charts, and Kustomize configurations in version control. This provides history and rollback capability.

git clone git@github.com:company/k8s-manifests.git cd k8s-manifests git log --oneline # View deployment history

2. Secrets & Certificates

Method: Encrypted backups or secret management tools

Secrets contain sensitive data (passwords, API keys, TLS certs). Never store in plain text Git!

# Export secrets (for backup only, encrypt!) kubectl get secrets --all-namespaces -o yaml > secrets-backup.yaml # Better: Use secret management # - Sealed Secrets # - External Secrets Operator # - HashiCorp Vault

3. Container Images

Method: Backup the container registry

If your registry goes down, you can't deploy or scale. Back up registry storage or use a mirrored/replicated registry.

# Registry backup strategies: # 1. S3/Object storage replication # 2. Harbor registry replication # 3. Mirror to multiple registries # 4. Export critical images docker save myapp:v1.0 | gzip > myapp-v1.0.tar.gz

4. Persistent Volumes / Stateful Data

Method: Volume snapshots and database backups

Application data in databases, file storage, etc. This is often the most critical backup.

# CSI Volume Snapshots (covered in storage lesson) kubectl create -f snapshot.yaml # Database-specific backups kubectl exec postgres-0 -- pg_dump mydb > backup.sql # Velero for automated backups (see below)

Velero - Complete Backup Solution

Recommended Tool: For a robust solution, explore Velero, a tool designed specifically for backing up and restoring Kubernetes cluster resources and persistent volumes.
# Install Velero CLI brew install velero # or download from GitHub # Install Velero in cluster (AWS S3 example) velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backups \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 \ --secret-file ./credentials-velero # Create a backup of entire cluster velero backup create full-backup-2024-01-15 # Create a backup of specific namespace velero backup create prod-backup --include-namespaces=production # Schedule automatic daily backups velero schedule create daily-backup --schedule="0 2 * * *" # Restore from backup velero restore create --from-backup full-backup-2024-01-15 # List backups velero backup get
Velero Features:
  • Backs up entire cluster or specific namespaces
  • Supports persistent volume snapshots (CSI)
  • Scheduled automated backups
  • Disaster recovery and cluster migration
  • Works with AWS, GCP, Azure, and on-premises (S3-compatible)

Lesson 4: Minimizing Human Error - Essential Kubernetes Features

Built-in Kubernetes features can help increase security and minimize the impact of human error.

1. Pod Disruption Budget (PDB)

A PDB limits the number of concurrent voluntary disruptions (like kubectl drain), ensuring critical applications maintain a minimum required replica count during maintenance.

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: web-app-pdb spec: minAvailable: 2 # At least 2 pods must remain available selector: matchLabels: app: web-app --- # Alternative: maxUnavailable apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: database-pdb spec: maxUnavailable: 1 # At most 1 pod can be unavailable selector: matchLabels: app: database
How PDB Works: If draining a node would cause the application's available replicas to drop below the PDB's limit, the kubectl drain action is blocked or paused until more replicas are available on other nodes.
Important Limitation: PDBs only apply to voluntary disruptions (administrative actions). They do NOT protect against involuntary disruptions like node failures, network partitions, or application crashes. They also don't prevent manual kubectl delete pod.

2. LimitRange and ResourceQuota

These two features work together at the namespace level to enforce resource management, critical for cluster stability.

LimitRange - Default Resources

apiVersion: v1 kind: LimitRange metadata: name: default-limit-range namespace: development spec: limits: - default: # Default limits if not specified cpu: 500m memory: 512Mi defaultRequest: # Default requests if not specified cpu: 100m memory: 128Mi max: # Maximum allowed cpu: 2 memory: 2Gi min: # Minimum allowed cpu: 50m memory: 64Mi type: Container
Disaster Prevention: If a developer deploys a Pod without resource specifications, the LimitRange automatically injects defined defaults, preventing the Pod from potentially consuming all available node resources (a "runaway process").

ResourceQuota - Namespace Limits

apiVersion: v1 kind: ResourceQuota metadata: name: team-quota namespace: development spec: hard: # Compute resources requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi # Object counts pods: "50" services: "10" configmaps: "20" persistentvolumeclaims: "10" secrets: "20" # Storage requests.storage: 100Gi
Disaster Prevention: If an operator accidentally scales a deployment from 3 to 300 replicas, the Quota blocks creation of new Pods once the namespace limit (e.g., 50 pods) is hit, protecting the rest of the cluster from resource exhaustion.

3. PriorityClass - Critical Workload Protection

PriorityClass is used by the Scheduler to make decisions during resource contention, especially when a cluster is under stress.

# Define priority classes apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: critical-priority value: 1000000 # Higher number = higher priority globalDefault: false description: "Critical production workloads" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 100000 description: "Important production services" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: medium-priority value: 10000 description: "Staging and QA workloads" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 1000 globalDefault: true description: "Development and batch jobs" --- # Use in Pod spec apiVersion: v1 kind: Pod metadata: name: critical-app spec: priorityClassName: critical-priority containers: - name: app image: myapp:v1
Disaster Prevention: In a resource-constrained situation (like a partial node failure), the Scheduler prioritizes the highest-priority Pods. If necessary, the Scheduler can preempt (kill) lower-priority Pods to free up resources so high-priority Pods (production apps) can be scheduled and remain operational.
Real-World Example: During a node failure, the cluster has limited capacity. The Scheduler will kill development/staging Pods (low priority) to ensure production database and API Pods (critical priority) stay running.

4. NetworkPolicy - Internal Firewall

NetworkPolicy acts as an internal firewall, controlling network traffic flow between different pods and namespaces.

# Default deny all ingress traffic apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress namespace: production spec: podSelector: {} # Apply to all pods in namespace policyTypes: - Ingress --- # Allow specific traffic: frontend → backend apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-frontend-to-backend namespace: production spec: podSelector: matchLabels: app: backend policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080 --- # Allow database access only from backend apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-backend-to-db namespace: production spec: podSelector: matchLabels: app: database policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: backend ports: - protocol: TCP port: 5432
Security Benefit: Even if an attacker compromises a frontend Pod, they can't directly access the database because NetworkPolicies block that traffic. They can only communicate through allowed paths.

Lesson 5: Debugging & Recovery When Cluster Breaks

When an outage occurs, follow a systematic debugging approach to identify and resolve issues quickly.

The Debugging Workflow

Step 1: Check Object Status

Start with high-level Kubernetes object inspection

# Get overview of all resources kubectl get all -n production # Check pod status kubectl get pods -n production # Describe specific pod for events kubectl describe pod my-app-xyz -n production # View recent events in namespace kubectl get events -n production --sort-by='.lastTimestamp' # Check pod logs kubectl logs my-app-xyz -n production # Previous container logs (if pod restarted) kubectl logs my-app-xyz -n production --previous

Step 2: Check Lower Levels (Container & Host)

Dig deeper into container runtime and node-level issues

# Check node status kubectl get nodes kubectl describe node worker-node-1 # SSH to node and check kubelet logs ssh worker-node-1 sudo journalctl -u kubelet -f # Check container runtime sudo crictl ps sudo crictl logs # Check system resources top df -h free -m # Check network connectivity ping 8.8.8.8 curl https://kubernetes.default.svc.cluster.local

Step 3: Use Debugging Tools

Leverage kubectl debug and ephemeral containers

# Create debugging container attached to pod kubectl debug my-app-xyz -it --image=nicolaka/netshoot # Inside debug container, you have networking tools: # - nslookup (DNS debugging) # - curl (HTTP testing) # - netstat (network connections) # - tcpdump (packet capture) # - ping, traceroute, etc. # Debug by creating a copy of the pod kubectl debug my-app-xyz --copy-to=my-app-debug --container=app # Debug node issues kubectl debug node/worker-node-1 -it --image=ubuntu

Common Issues and Solutions

Symptom Possible Cause Debugging Steps
Pods stuck in Pending Insufficient resources or scheduling constraints kubectl describe pod - Check events for scheduling failures
Pods in CrashLoopBackOff Application error or misconfiguration kubectl logs - Check application logs for errors
Pods in ImagePullBackOff Can't pull container image Check image name, registry credentials, network connectivity
Service not accessible Networking issue or misconfigured service kubectl get endpoints - Verify endpoints exist
High latency or timeouts Resource exhaustion or network issues Check node resources, DNS resolution, network policies

Complete Debugging Example

# Scenario: Web app is down, users getting 502 errors # 1. Check pod status kubectl get pods -l app=web-app # NAME READY STATUS RESTARTS AGE # web-app-abc 0/1 CrashLoopBackOff 5 3m # 2. Check logs kubectl logs web-app-abc # Error: Cannot connect to database at db-service:5432 # 3. Check database pods kubectl get pods -l app=database # NAME READY STATUS RESTARTS AGE # database-0 1/1 Running 0 10m # 4. Check service and endpoints kubectl get svc db-service kubectl get endpoints db-service # NAME ENDPOINTS # db-service 10.244.1.15:5432 # 5. Test connectivity from web app pod kubectl debug web-app-abc -it --image=nicolaka/netshoot # Inside debug container: nslookup db-service # Success - DNS works telnet 10.244.1.15 5432 # Connection refused! # 6. Check NetworkPolicy kubectl get networkpolicy # Found restrictive policy blocking web-app → database # 7. Fix NetworkPolicy or add exception kubectl apply -f allow-web-to-db-policy.yaml # 8. Verify fix kubectl logs web-app-abc # Successfully connected to database

Disaster Recovery Checklist Summary

Complete DR Checklist:
  • High Availability: 3+ etcd, multiple API servers with LB, redundant Ingress
  • Certificate Management: Use kubeadm, automate renewal, monitor expiration
  • Maintenance Procedures: Proper drain/uncordon workflow, test updates on staging
  • Comprehensive Backups: Git for manifests, Velero for cluster state, database backups
  • Resource Management: LimitRange defaults, ResourceQuota limits
  • Disruption Protection: PodDisruptionBudgets for critical apps
  • Priority Management: PriorityClasses for workload importance
  • Network Security: NetworkPolicies for traffic control
  • Monitoring & Alerting: Prometheus, Grafana, log aggregation
  • Documentation: Architecture diagrams, runbooks, incident procedures
  • Testing: Regular DR drills, chaos engineering (optional)
Final Advice: Don't wait for a disaster to happen. Implement these practices proactively, test your recovery procedures regularly, and continuously improve your cluster's resilience based on lessons learned from incidents.

Final Quiz

Test your knowledge of Kubernetes Disaster Recovery!

Question 1: Why should etcd instances always be an odd number?

a) For better performance
b) To maintain quorum with fault tolerance (e.g., 3 instances tolerate 1 failure)
c) To use less memory
d) It's a Kubernetes requirement for any component

Question 2: What is the primary purpose of Pod Disruption Budget (PDB)?

a) To prevent all pod deletions
b) To limit voluntary disruptions ensuring minimum replicas during maintenance
c) To allocate budget for pod resources
d) To protect against node failures

Question 3: What does LimitRange accomplish in a namespace?

a) Limits the number of namespaces
b) Sets default resource requests/limits for pods preventing resource exhaustion
c) Restricts network traffic
d) Limits the number of users

Question 4: What happens when a PriorityClass is used during resource scarcity?

a) All pods are treated equally
b) Scheduler can preempt (kill) lower-priority pods to schedule higher-priority ones
c) Pods are randomly selected for eviction
d) The cluster shuts down

Question 5: What should you always do before performing cluster version updates?

a) Update production first to find issues quickly
b) Test on staging cluster and read changelog between versions
c) Skip patch versions to save time
d) Update all nodes simultaneously

Question 6: What is Velero used for?

a) Container image building
b) Backing up and restoring Kubernetes cluster resources and persistent volumes
c) Network policy enforcement
d) Pod autoscaling

Question 7: What does kubectl drain do?

a) Deletes the node permanently
b) Safely evacuates all pods from a node for maintenance
c) Drains node memory
d) Removes all logs from the node

Question 8: What is the purpose of NetworkPolicy?

a) To configure external load balancers
b) To act as internal firewall controlling traffic between pods and namespaces
c) To manage DNS records
d) To allocate IP addresses
Quiz Complete!
All correct answers are option 'b'. Review the lessons above to understand why these are the best answers.