Kubernetes Disaster Recovery

Lesson 1: Categorizing Kubernetes Failures

Understanding how Kubernetes can fail is the first step toward building a resilient cluster. Failures can be categorized into three distinct levels.

The Three Levels of Failure

Physical Level

Infrastructure issues:

Poor data center quality
Hardware failures
Incorrect hardware selection
Poor architecture planning
Complex multi-datacenter setups

Organizational Level

Human and management factors:

Lack of documented processes
Unclear responsibilities
Poor topology/documentation
Inadequate access controls
Constant "firefighting" mode

Applied (Technical) Level

Technical errors:

Kubernetes complexity
Reliance on "hope" or shortcuts
Configuration errors
Human factor mistakes
Missing best practices

Physical Level Issues

Infrastructure Failures:

Poor Data Center Quality: Unreliable power, cooling, or network connectivity
Hardware Failure: Disk failures, memory errors, network interface problems
Incorrect Hardware Selection: Under-provisioned servers that can't handle the workload
Poor Load Planning: Not accounting for peak traffic or growth
Complex Multi-Datacenter: Over-engineered setups that are hard to maintain

Organizational Level Issues

Human and Management Problems:

No Documented Processes: Knowledge exists only in people's heads
Unclear Responsibilities: No one knows who's responsible for what
Poor Documentation: Cluster topology and architecture not documented
Inadequate Access Controls: Junior engineer accidentally deletes critical namespace
Firefighting Mode: Team always reacting to emergencies instead of preventing them

Common Scenario: A junior developer with excessive permissions accidentally runs kubectl delete namespace production, taking down the entire production environment because there were no safeguards in place.

Applied (Technical) Level Issues

Technical Configuration Errors:

Kubernetes Complexity: The platform has a steep learning curve with many moving parts
Hope-Driven Development: Deploying changes with fingers crossed, hoping nothing breaks
Shortcuts: Skipping proper testing or validation to save time
Configuration Mistakes: Wrong resource limits, missing health checks, incorrect networking
Human Error: Typos in YAML, copy-paste mistakes, miscommunication

The Structured Approach

The video presents a systematic approach to disaster recovery:

Identify common failure modes - Understand what can go wrong
Structure problems into categories - Organize by level (Physical, Organizational, Technical)
Create a checklist - Build actionable items to address each category
Review practical solutions - Implement tools and best practices

Focus of This Course: While physical and organizational issues are important, this course focuses primarily on the Applied (Technical) Level - the concrete steps you can take to make your Kubernetes cluster more resilient through configuration, tooling, and best practices.

Lesson 2: Fault Tolerance & High Availability Setup

Building a bulletproof cluster starts with proper fault tolerance configuration of core Kubernetes components.

etcd and Control Plane High Availability

Critical Requirement: To achieve a fault-tolerant setup, ensure you have an odd number of etcd instances (3 or more) to maintain a quorum. The API Server, Controller Manager, and Scheduler should match this count.

Recommended Control Plane Setup

3x etcd
Distributed key-value store
Maintains quorum with 2/3 alive

3x API Server
Kubernetes API endpoint
Load balanced

3x Controller Manager
Manages controllers
Leader election

3x Scheduler
Pod placement decisions
Leader election

Why Odd Numbers? With 3 etcd instances, you can tolerate 1 failure and still maintain quorum (2/3). With 5 instances, you can tolerate 2 failures (3/5). Even numbers don't improve fault tolerance (4 instances can still only tolerate 1 failure, same as 3).

Load Balancer for API Server

Use an external load balancer (or a simple Nginx proxy) in front of the API servers so that kubelet and kube-proxy can handle master node failover.

# Example: Simple Nginx load balancer for API servers
upstream kubernetes {
    server master1.example.com:6443 max_fails=3 fail_timeout=30s;
    server master2.example.com:6443 max_fails=3 fail_timeout=30s;
    server master3.example.com:6443 max_fails=3 fail_timeout=30s;
}

server {
    listen 6443;
    proxy_pass kubernetes;
    proxy_timeout 10s;
    proxy_connect_timeout 1s;
}

# Kubeconfig points to load balancer
apiVersion: v1
clusters:
- cluster:
    server: https://k8s-lb.example.com:6443
  name: kubernetes

Without Load Balancer: If worker nodes point directly to a single API server and that master fails, kubelet and kube-proxy lose connectivity and can't manage pods.

Ingress Controller Redundancy

1. Multiple Ingress Controllers

Deploy more than one Ingress Controller on dedicated worker nodes. Use node selectors or node affinity to ensure they run on different physical hosts.

2. External Load Balancer or VRRP

Use an external load balancer or Virtual Router Redundancy Protocol (VRRP) like Keepalived to ensure traffic reaches an available Ingress Controller host.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx-ingress-controller
  namespace: ingress-nginx
spec:
  selector:
    matchLabels:
      app: ingress-nginx
  template:
    metadata:
      labels:
        app: ingress-nginx
    spec:
      nodeSelector:
        node-role: ingress  # Deploy only on dedicated ingress nodes
      containers:
      - name: nginx-ingress-controller
        image: k8s.gcr.io/ingress-nginx/controller:v1.8.1
        args:
        - /nginx-ingress-controller
        - --election-id=ingress-controller-leader
        ports:
        - name: http
          containerPort: 80
          hostPort: 80  # Bind to host network
        - name: https
          containerPort: 443
          hostPort: 443

Certificate Management

Historical Issue: In older Kubernetes versions, expiring control plane certificates were a common issue. Clusters would turn into a "pumpkin" after a year when certificates expired, causing total outage.

Modern Best Practice: Use tools like kubeadm for installation, which handles longer-lived certificates and provides commands for renewal.

# Check certificate expiration
kubeadm certs check-expiration

# Output:
# CERTIFICATE                EXPIRES                  RESIDUAL TIME
# admin.conf                 Jan 15, 2025 12:00 UTC   364d
# apiserver                  Jan 15, 2025 12:00 UTC   364d
# apiserver-etcd-client      Jan 15, 2025 12:00 UTC   364d

# Renew all certificates
kubeadm certs renew all

# Renew specific certificate
kubeadm certs renew apiserver

Pro Tip: Set up automated monitoring and alerts for certificate expiration at least 60 days before they expire. Use cert-manager for application TLS certificates.

High Availability Checklist

Component	Minimum HA Setup	Failure Tolerance
etcd	3 instances	1 node failure
API Server	3 instances + LB	2 nodes can fail
Controller Manager	3 instances (leader election)	2 nodes can fail
Scheduler	3 instances (leader election)	2 nodes can fail
Ingress Controller	2+ instances + VRRP/LB	1+ nodes can fail

Lesson 3: Updates, Maintenance & Backups

Proper procedures for updates and comprehensive backups are essential for disaster recovery.

Host OS Updates (Worker Nodes)

Before updating an operating system, ensure the node's running applications have more than one replica to maintain availability during maintenance.

# Step 1: Verify replicas are running on other nodes
kubectl get pods -o wide | grep my-app

# Step 2: Cordon the node (prevent new pods from scheduling)
kubectl cordon worker-node-1

# Step 3: Drain the node (safely evacuate all pods)
kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data

# Output shows pods being evicted and rescheduled to other nodes

# Step 4: Perform maintenance (SSH to the node)
ssh worker-node-1
sudo apt update && sudo apt upgrade -y
sudo reboot

# Step 5: After reboot, uncordon the node
kubectl uncordon worker-node-1

# Pods will naturally rebalance during future scheduling

Common Mistake: Running kubectl drain without checking if applications have sufficient replicas on other nodes can cause downtime. Always verify replica distribution first!

Pod Disruption Budgets: We'll cover PDBs in the next lesson, but they prevent kubectl drain from causing downtime by blocking the drain if it would violate minimum availability requirements.

Cluster Version Updates

Critical Rules for Cluster Updates:

Always test the update process first on a dedicated development/staging cluster that mirrors production
Always read the changelog (release notes) between versions
Key names and behavior can change between versions
Never skip versions (upgrade sequentially: 1.25 → 1.26 → 1.27)
Back up everything before starting

# Example: Upgrading kubeadm cluster from 1.27 to 1.28

# 1. Read release notes
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md

# 2. Upgrade control plane (first master node)
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm=1.28.0-00 && \
sudo apt-mark hold kubeadm

# Plan the upgrade
sudo kubeadm upgrade plan

# Apply the upgrade
sudo kubeadm upgrade apply v1.28.0

# 3. Upgrade kubelet and kubectl on master
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00 kubectl=1.28.0-00 && \
sudo apt-mark hold kubelet kubectl

sudo systemctl daemon-reload
sudo systemctl restart kubelet

# 4. Repeat for other master nodes
# sudo kubeadm upgrade node

# 5. Upgrade worker nodes (one at a time)
kubectl drain worker-node-1 --ignore-daemonsets
# SSH to worker, upgrade kubeadm, kubelet, kubectl
# sudo kubeadm upgrade node
kubectl uncordon worker-node-1

Backups - Critical for Disaster Recovery

Backups are critical. While Kubernetes manifests are often in Git, backing up all state is vital.

What to Back Up

1. Manifests & Helm Charts

Method: Git repository (GitOps)

Store all YAML manifests, Helm charts, and Kustomize configurations in version control. This provides history and rollback capability.

git clone git@github.com:company/k8s-manifests.git
cd k8s-manifests
git log --oneline  # View deployment history

2. Secrets & Certificates

Method: Encrypted backups or secret management tools

Secrets contain sensitive data (passwords, API keys, TLS certs). Never store in plain text Git!

# Export secrets (for backup only, encrypt!)
kubectl get secrets --all-namespaces -o yaml > secrets-backup.yaml

# Better: Use secret management
# - Sealed Secrets
# - External Secrets Operator
# - HashiCorp Vault

3. Container Images

Method: Backup the container registry

If your registry goes down, you can't deploy or scale. Back up registry storage or use a mirrored/replicated registry.

# Registry backup strategies:
# 1. S3/Object storage replication
# 2. Harbor registry replication
# 3. Mirror to multiple registries
# 4. Export critical images
docker save myapp:v1.0 | gzip > myapp-v1.0.tar.gz

4. Persistent Volumes / Stateful Data

Method: Volume snapshots and database backups

Application data in databases, file storage, etc. This is often the most critical backup.

# CSI Volume Snapshots (covered in storage lesson)
kubectl create -f snapshot.yaml

# Database-specific backups
kubectl exec postgres-0 -- pg_dump mydb > backup.sql

# Velero for automated backups (see below)

Velero - Complete Backup Solution

Recommended Tool: For a robust solution, explore Velero, a tool designed specifically for backing up and restoring Kubernetes cluster resources and persistent volumes.

# Install Velero CLI
brew install velero  # or download from GitHub

# Install Velero in cluster (AWS S3 example)
velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.8.0 \
    --bucket velero-backups \
    --backup-location-config region=us-east-1 \
    --snapshot-location-config region=us-east-1 \
    --secret-file ./credentials-velero

# Create a backup of entire cluster
velero backup create full-backup-2024-01-15

# Create a backup of specific namespace
velero backup create prod-backup --include-namespaces=production

# Schedule automatic daily backups
velero schedule create daily-backup --schedule="0 2 * * *"

# Restore from backup
velero restore create --from-backup full-backup-2024-01-15

# List backups
velero backup get

Velero Features:

Backs up entire cluster or specific namespaces
Supports persistent volume snapshots (CSI)
Scheduled automated backups
Disaster recovery and cluster migration
Works with AWS, GCP, Azure, and on-premises (S3-compatible)

Lesson 4: Minimizing Human Error - Essential Kubernetes Features

Built-in Kubernetes features can help increase security and minimize the impact of human error.

1. Pod Disruption Budget (PDB)

A PDB limits the number of concurrent voluntary disruptions (like kubectl drain), ensuring critical applications maintain a minimum required replica count during maintenance.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2  # At least 2 pods must remain available
  selector:
    matchLabels:
      app: web-app
---
# Alternative: maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: database-pdb
spec:
  maxUnavailable: 1  # At most 1 pod can be unavailable
  selector:
    matchLabels:
      app: database

How PDB Works: If draining a node would cause the application's available replicas to drop below the PDB's limit, the kubectl drain action is blocked or paused until more replicas are available on other nodes.

Important Limitation: PDBs only apply to voluntary disruptions (administrative actions). They do NOT protect against involuntary disruptions like node failures, network partitions, or application crashes. They also don't prevent manual kubectl delete pod.

2. LimitRange and ResourceQuota

These two features work together at the namespace level to enforce resource management, critical for cluster stability.

LimitRange - Default Resources

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limit-range
  namespace: development
spec:
  limits:
  - default:  # Default limits if not specified
      cpu: 500m
      memory: 512Mi
    defaultRequest:  # Default requests if not specified
      cpu: 100m
      memory: 128Mi
    max:  # Maximum allowed
      cpu: 2
      memory: 2Gi
    min:  # Minimum allowed
      cpu: 50m
      memory: 64Mi
    type: Container

Disaster Prevention: If a developer deploys a Pod without resource specifications, the LimitRange automatically injects defined defaults, preventing the Pod from potentially consuming all available node resources (a "runaway process").

ResourceQuota - Namespace Limits

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: development
spec:
  hard:
    # Compute resources
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi

    # Object counts
    pods: "50"
    services: "10"
    configmaps: "20"
    persistentvolumeclaims: "10"
    secrets: "20"

    # Storage
    requests.storage: 100Gi

Disaster Prevention: If an operator accidentally scales a deployment from 3 to 300 replicas, the Quota blocks creation of new Pods once the namespace limit (e.g., 50 pods) is hit, protecting the rest of the cluster from resource exhaustion.

3. PriorityClass - Critical Workload Protection

PriorityClass is used by the Scheduler to make decisions during resource contention, especially when a cluster is under stress.

# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-priority
value: 1000000  # Higher number = higher priority
globalDefault: false
description: "Critical production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
description: "Important production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium-priority
value: 10000
description: "Staging and QA workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: true
description: "Development and batch jobs"
---
# Use in Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: critical-priority
  containers:
  - name: app
    image: myapp:v1

Disaster Prevention: In a resource-constrained situation (like a partial node failure), the Scheduler prioritizes the highest-priority Pods. If necessary, the Scheduler can preempt (kill) lower-priority Pods to free up resources so high-priority Pods (production apps) can be scheduled and remain operational.

Real-World Example: During a node failure, the cluster has limited capacity. The Scheduler will kill development/staging Pods (low priority) to ensure production database and API Pods (critical priority) stay running.

4. NetworkPolicy - Internal Firewall

NetworkPolicy acts as an internal firewall, controlling network traffic flow between different pods and namespaces.

# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}  # Apply to all pods in namespace
  policyTypes:
  - Ingress
---
# Allow specific traffic: frontend → backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
---
# Allow database access only from backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-backend-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: backend
    ports:
    - protocol: TCP
      port: 5432

Security Benefit: Even if an attacker compromises a frontend Pod, they can't directly access the database because NetworkPolicies block that traffic. They can only communicate through allowed paths.

Lesson 5: Debugging & Recovery When Cluster Breaks

When an outage occurs, follow a systematic debugging approach to identify and resolve issues quickly.

The Debugging Workflow

Step 1: Check Object Status

Start with high-level Kubernetes object inspection

# Get overview of all resources
kubectl get all -n production

# Check pod status
kubectl get pods -n production

# Describe specific pod for events
kubectl describe pod my-app-xyz -n production

# View recent events in namespace
kubectl get events -n production --sort-by='.lastTimestamp'

# Check pod logs
kubectl logs my-app-xyz -n production

# Previous container logs (if pod restarted)
kubectl logs my-app-xyz -n production --previous

Step 2: Check Lower Levels (Container & Host)

Dig deeper into container runtime and node-level issues

# Check node status
kubectl get nodes
kubectl describe node worker-node-1

# SSH to node and check kubelet logs
ssh worker-node-1
sudo journalctl -u kubelet -f

# Check container runtime
sudo crictl ps
sudo crictl logs 

# Check system resources
top
df -h
free -m

# Check network connectivity
ping 8.8.8.8
curl https://kubernetes.default.svc.cluster.local

Step 3: Use Debugging Tools

Leverage kubectl debug and ephemeral containers

# Create debugging container attached to pod
kubectl debug my-app-xyz -it --image=nicolaka/netshoot

# Inside debug container, you have networking tools:
# - nslookup (DNS debugging)
# - curl (HTTP testing)
# - netstat (network connections)
# - tcpdump (packet capture)
# - ping, traceroute, etc.

# Debug by creating a copy of the pod
kubectl debug my-app-xyz --copy-to=my-app-debug --container=app

# Debug node issues
kubectl debug node/worker-node-1 -it --image=ubuntu

Common Issues and Solutions

Symptom	Possible Cause	Debugging Steps
Pods stuck in `Pending`	Insufficient resources or scheduling constraints	`kubectl describe pod` - Check events for scheduling failures
Pods in `CrashLoopBackOff`	Application error or misconfiguration	`kubectl logs` - Check application logs for errors
Pods in `ImagePullBackOff`	Can't pull container image	Check image name, registry credentials, network connectivity
Service not accessible	Networking issue or misconfigured service	`kubectl get endpoints` - Verify endpoints exist
High latency or timeouts	Resource exhaustion or network issues	Check node resources, DNS resolution, network policies

Complete Debugging Example

# Scenario: Web app is down, users getting 502 errors

# 1. Check pod status
kubectl get pods -l app=web-app
# NAME          READY   STATUS             RESTARTS   AGE
# web-app-abc   0/1     CrashLoopBackOff   5          3m

# 2. Check logs
kubectl logs web-app-abc
# Error: Cannot connect to database at db-service:5432

# 3. Check database pods
kubectl get pods -l app=database
# NAME          READY   STATUS    RESTARTS   AGE
# database-0    1/1     Running   0          10m

# 4. Check service and endpoints
kubectl get svc db-service
kubectl get endpoints db-service
# NAME          ENDPOINTS
# db-service    10.244.1.15:5432

# 5. Test connectivity from web app pod
kubectl debug web-app-abc -it --image=nicolaka/netshoot
# Inside debug container:
nslookup db-service
# Success - DNS works

telnet 10.244.1.15 5432
# Connection refused!

# 6. Check NetworkPolicy
kubectl get networkpolicy
# Found restrictive policy blocking web-app → database

# 7. Fix NetworkPolicy or add exception
kubectl apply -f allow-web-to-db-policy.yaml

# 8. Verify fix
kubectl logs web-app-abc
# Successfully connected to database

Disaster Recovery Checklist Summary

Complete DR Checklist:

✅ High Availability: 3+ etcd, multiple API servers with LB, redundant Ingress
✅ Certificate Management: Use kubeadm, automate renewal, monitor expiration
✅ Maintenance Procedures: Proper drain/uncordon workflow, test updates on staging
✅ Comprehensive Backups: Git for manifests, Velero for cluster state, database backups
✅ Resource Management: LimitRange defaults, ResourceQuota limits
✅ Disruption Protection: PodDisruptionBudgets for critical apps
✅ Priority Management: PriorityClasses for workload importance
✅ Network Security: NetworkPolicies for traffic control
✅ Monitoring & Alerting: Prometheus, Grafana, log aggregation
✅ Documentation: Architecture diagrams, runbooks, incident procedures
✅ Testing: Regular DR drills, chaos engineering (optional)

Final Advice: Don't wait for a disaster to happen. Implement these practices proactively, test your recovery procedures regularly, and continuously improve your cluster's resilience based on lessons learned from incidents.

Final Quiz

Test your knowledge of Kubernetes Disaster Recovery!

Question 1: Why should etcd instances always be an odd number?

a) For better performance

b) To maintain quorum with fault tolerance (e.g., 3 instances tolerate 1 failure)

c) To use less memory

d) It's a Kubernetes requirement for any component

Question 2: What is the primary purpose of Pod Disruption Budget (PDB)?

a) To prevent all pod deletions

b) To limit voluntary disruptions ensuring minimum replicas during maintenance

c) To allocate budget for pod resources

d) To protect against node failures

Question 3: What does LimitRange accomplish in a namespace?

a) Limits the number of namespaces

b) Sets default resource requests/limits for pods preventing resource exhaustion

c) Restricts network traffic

d) Limits the number of users

Question 4: What happens when a PriorityClass is used during resource scarcity?

a) All pods are treated equally

b) Scheduler can preempt (kill) lower-priority pods to schedule higher-priority ones

c) Pods are randomly selected for eviction

d) The cluster shuts down

Question 5: What should you always do before performing cluster version updates?

a) Update production first to find issues quickly

b) Test on staging cluster and read changelog between versions

c) Skip patch versions to save time

d) Update all nodes simultaneously

Question 6: What is Velero used for?

a) Container image building

b) Backing up and restoring Kubernetes cluster resources and persistent volumes

c) Network policy enforcement

d) Pod autoscaling

Question 7: What does kubectl drain do?

a) Deletes the node permanently

b) Safely evacuates all pods from a node for maintenance

c) Drains node memory

d) Removes all logs from the node

Question 8: What is the purpose of NetworkPolicy?

a) To configure external load balancers

b) To act as internal firewall controlling traffic between pods and namespaces

c) To manage DNS records

d) To allocate IP addresses

Quiz Complete!
All correct answers are option 'b'. Review the lessons above to understand why these are the best answers.