Kubernetes Networking & Fault Tolerance

Lesson 1 of 4

Kubernetes Service Abstraction

The Networking Challenge

In Kubernetes, Pods are ephemeral—they can be created, destroyed, and replaced at any time. Each Pod gets its own IP address, but these IPs change when Pods are recreated. This creates a fundamental problem:

The Pod IP Problem

Pod IPs are not stable—they change on restart
Multiple Pod replicas each have different IPs
Applications can't reliably connect to Pods directly
Load balancing across Pods requires external logic

What is a Service?

A Service is a Kubernetes abstraction that provides a stable, single entry point for an application within the cluster.

Service Purpose

The Service object acts as a stable network endpoint that:

Organizes a group of Pods: Selects Pods using labels
Provides a stable IP: ClusterIP remains constant
Load balances traffic: Distributes requests across Pod replicas
Enables service discovery: DNS name for the Service

Service Architecture

Service to Pods Mapping

Service: "webapp"
ClusterIP: 10.96.100.50
Port: 80
Selector: app=webapp

⬇
Load Balances Traffic To:
⬇

Pod 1: IP 10.244.1.5 | Label: app=webapp

Pod 2: IP 10.244.2.8 | Label: app=webapp

Pod 3: IP 10.244.3.12 | Label: app=webapp

How Services Work

1. Label Selectors

Services use label selectors to identify which Pods belong to them:

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  selector:
    app: webapp  # Selects all Pods with label app=webapp
  ports:
  - protocol: TCP
    port: 80        # Service port
    targetPort: 8080  # Pod container port
  type: ClusterIP  # Default type

2. Stable Entry Point

When the Service is created, it receives a stable ClusterIP:

# Create the Service
kubectl apply -f service.yaml

# Service gets a stable IP that never changes
kubectl get service webapp
NAME     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
webapp   ClusterIP   10.96.100.50           80/TCP    5m

# Other Pods can now connect to:
# - IP: 10.96.100.50:80
# - DNS: webapp.default.svc.cluster.local

3. Automatic Load Balancing

Traffic to the Service IP is automatically distributed across all matching Pods:

# Request to Service:
curl http://10.96.100.50

# Kubernetes automatically routes to one of:
# - Pod 1: 10.244.1.5:8080
# - Pod 2: 10.244.2.8:8080
# - Pod 3: 10.244.3.12:8080

# Next request might go to a different Pod (load balanced)

Service Types

1. ClusterIP (Default)

Exposes the Service on an internal IP within the cluster:

Only accessible from within the cluster
Most common type for internal services
Provides stable internal endpoint

2. NodePort

Exposes the Service on each Node's IP at a static port:

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: NodePort
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080  # Accessible on all nodes at this port

Accessible from outside cluster via <NodeIP>:30080
Port range: 30000-32767 (default)
Good for development/testing

3. LoadBalancer

Creates an external load balancer (cloud provider):

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: LoadBalancer
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080

Works with cloud providers (AWS, GCP, Azure)
Gets external IP address
Best for production external access

Service Discovery

Kubernetes provides automatic DNS for Services:

# Service DNS format:
..svc.cluster.local

# Examples:
webapp.default.svc.cluster.local
database.production.svc.cluster.local

# Within same namespace, just use service name:
curl http://webapp

# From different namespace:
curl http://webapp.default

Endpoints

Kubernetes automatically creates an Endpoints object that tracks Pod IPs:

# View Service endpoints
kubectl get endpoints webapp

NAME     ENDPOINTS                                      AGE
webapp   10.244.1.5:8080,10.244.2.8:8080,10.244.3.12:8080   5m

# When a Pod dies and is replaced:
# - Old Pod IP is removed from endpoints
# - New Pod IP is automatically added
# - Service continues to work seamlessly

Key Takeaways: Services

Services provide stable entry points for ephemeral Pods
Use label selectors to identify backend Pods
Automatically load balance traffic across Pod replicas
ClusterIP gives internal stability, LoadBalancer for external access
DNS enables service discovery by name

Lesson 2 of 4

Traffic Management with kube-proxy

The Role of kube-proxy

The kube-proxy component is the key to implementing Service networking in Kubernetes. It runs on every node and is responsible for generating the rules that direct traffic to the correct Pods.

kube-proxy Responsibilities

Watch Services: Monitor the API Server for Service changes
Watch Endpoints: Track Pod IPs behind each Service
Generate rules: Create iptables or IPVS rules for routing
Route traffic: Direct requests to appropriate Pods
Load balance: Distribute traffic across replicas

How kube-proxy Works

1. Service Created
User creates Service with ClusterIP 10.96.100.50:80

↓

2. kube-proxy Detects
kube-proxy on each node watches API Server, sees new Service

↓

3. Generate Rules
kube-proxy creates iptables/IPVS rules on the node

↓

4. Traffic Routing
Rules intercept traffic to 10.96.100.50:80 and route to Pod IPs

↓

5. Load Balancing
Rules distribute traffic across all backend Pods

Implementation: iptables Mode

By default, kube-proxy uses iptables to implement Services. This is the most common mode.

How iptables Rules Work

# When Service is created with ClusterIP 10.96.100.50:80

kube-proxy creates iptables rules:

1. PREROUTING Chain:
   "If traffic to 10.96.100.50:80, jump to custom chain"

2. Custom Chain (KUBE-SVC-WEBAPP):
   "Randomly select one of the backend Pod rules"

3. Backend Pod Rules:
   - 33% probability → DNAT to 10.244.1.5:8080
   - 33% probability → DNAT to 10.244.2.8:8080
   - 33% probability → DNAT to 10.244.3.12:8080

Result: Traffic is load balanced across 3 Pods

Complete Traffic Flow

Application makes request
curl http://10.96.100.50:80

↓

iptables intercepts packet
Recognizes traffic to ClusterIP 10.96.100.50:80

↓

Jumps to Service chain
Routes to KUBE-SVC-WEBAPP chain

↓

Load balancing decision
Randomly selects one Pod (e.g., Pod 2)

↓

DNAT (Destination NAT)
Rewrites packet destination to 10.244.2.8:8080

↓

Packet delivered to Pod
Pod processes request and sends response

Viewing iptables Rules

You can inspect the actual iptables rules created by kube-proxy:

# View all iptables rules (on a node)
sudo iptables -t nat -L -n -v

# View Service-specific rules
sudo iptables -t nat -L KUBE-SERVICES -n

# Example output:
Chain KUBE-SERVICES
target     prot opt source     destination
KUBE-SVC-ABC  tcp  --  0.0.0.0/0  10.96.100.50  /* webapp */

Chain KUBE-SVC-ABC
target     prot opt source     destination
KUBE-SEP-1  all  --  0.0.0.0/0  0.0.0.0/0  /* webapp -> pod1 */ statistic mode random probability 0.33
KUBE-SEP-2  all  --  0.0.0.0/0  0.0.0.0/0  /* webapp -> pod2 */ statistic mode random probability 0.50
KUBE-SEP-3  all  --  0.0.0.0/0  0.0.0.0/0  /* webapp -> pod3 */

Implementation: IPVS Mode

For larger clusters, IPVS (IP Virtual Server) provides better performance:

IPVS vs. iptables

Aspect	iptables	IPVS
Performance	Good for small clusters	Better for large clusters (1000+ Services)
Load Balancing	Random selection only	Multiple algorithms (round-robin, least-connection, etc.)
Rule Updates	O(n) - slower with many Services	O(1) - constant time
Complexity	More complex rules	Simpler, kernel-level load balancing
Default	Yes	No (opt-in)

Enable IPVS Mode

# Configure kube-proxy to use IPVS
kubectl edit configmap kube-proxy -n kube-system

# Change mode:
mode: "ipvs"

# Restart kube-proxy pods
kubectl delete pod -n kube-system -l k8s-app=kube-proxy

ClusterIP Assignment

When a Service is created, Kubernetes assigns it a ClusterIP from a predefined range:

# ClusterIP is assigned from service CIDR range
# Configured during cluster setup (e.g., 10.96.0.0/12)

# View Service CIDR:
kubectl cluster-info dump | grep -i service-cluster-ip-range

# Example:
# --service-cluster-ip-range=10.96.0.0/12

# This means:
# - Services get IPs from 10.96.0.0 to 10.111.255.255
# - ClusterIP is stable for Service lifetime
# - IP is virtual (no actual interface has this IP)

Traffic Distribution Details

Session Affinity

By default, each request can go to any Pod. For sticky sessions:

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080
  sessionAffinity: ClientIP  # Same client goes to same Pod
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour sticky session

External Traffic Policy

Control how external traffic is routed:

spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # Only route to Pods on same node
  # vs
  externalTrafficPolicy: Cluster  # Route to any Pod (default)

Important Considerations

iptables rules: Updated every time Pods or Services change
kube-proxy must run: Without it, Services won't work
Virtual IP: ClusterIP doesn't exist on any interface
Node-local: Each node has its own iptables rules

Key Takeaways: kube-proxy

kube-proxy generates routing rules on each node
Uses iptables or IPVS to implement Services
Intercepts traffic to ClusterIP and routes to Pod IPs
Creates chains that distribute traffic across Pods
IPVS is more efficient for large clusters

Lesson 3 of 4

Fault Tolerance: Node Failure

Building Fault-Tolerant Clusters

Kubernetes is designed to maintain availability even when nodes fail. Understanding how the cluster handles failures is crucial for running production workloads.

Fault Tolerance Goals

Detect failures: Quickly identify when nodes stop responding
Minimize disruption: Give nodes time to recover
Recover gracefully: Reschedule Pods to healthy nodes
Maintain availability: Keep applications running

Node Failure Detection

The node-controller (part of the Controller Manager) continuously monitors node health:

# Node controller responsibilities:
1. Monitor node status (via kubelet heartbeats)
2. Mark nodes as Ready/NotReady
3. Evict Pods from failed nodes
4. Update node conditions

# Kubelet sends heartbeat every 10 seconds (default)
# If node-controller doesn't receive heartbeat:
#   → Node marked as "Unknown" or "NotReady"

Node Failure Timeline

When a worker node stops responding, Kubernetes follows a controlled process:

T = 0s

Node Stops Responding
Worker node crashes or network disconnects
Kubelet stops sending heartbeats

T = 40s

Initial Grace Period Expires
node-controller waits 40 seconds (default) for kubelet to return
This prevents false positives from temporary network issues

T = 40s

Node Marked NotReady
node-controller marks node status as NotReady
Node no longer receives new Pod assignments

T = 40s to 5m 40s

Pod Eviction Timeout
node-controller waits for PodEvictionTimeout (5 minutes default)
Gives node time to recover and reconnect

T = 5m 40s

Pod Eviction Begins
node-controller evicts (removes) Pods from the failed node
Deployment/ReplicaSet controllers create replacement Pods
New Pods scheduled to healthy nodes

Configuration Parameters

These timings can be customized based on your availability requirements:

# kube-controller-manager flags:

--node-monitor-period=5s
  # How often to check node status (default: 5s)

--node-monitor-grace-period=40s
  # Grace period before marking node NotReady (default: 40s)
  # This is the "Initial Grace Period"

--pod-eviction-timeout=5m
  # Wait time before evicting Pods from NotReady node (default: 5m)
  # This is the "PodEvictionTimeout"

Customizing for Different Scenarios

Scenario	Grace Period	Eviction Timeout	Reasoning
Default	40 seconds	5 minutes	Balanced approach
Fast Recovery	20 seconds	2 minutes	Minimize downtime, accept false positives
Stable Network	60 seconds	10 minutes	Allow more time for recovery
Unreliable Network	60 seconds	15 minutes	Prevent unnecessary evictions

Pod Eviction Process

1. Node Becomes NotReady
node-controller marks node status as NotReady

↓

2. Wait for Eviction Timeout
Wait 5 minutes (default) for node to recover

↓

3. Eviction Triggered
node-controller decides to evict Pods

↓

4. Pods Marked for Deletion
API Server marks Pods as Terminating

↓

5. Controllers Detect Missing Replicas
ReplicaSet/Deployment controllers see fewer than desired replicas

↓

6. Replacement Pods Created
Controllers create new Pods to replace evicted ones

↓

7. Scheduler Assigns to Healthy Nodes
New Pods scheduled to available, healthy nodes

↓

8. Pods Start Running
Application restored on healthy nodes

Monitoring Node Status

# View node status
kubectl get nodes

NAME     STATUS     ROLES    AGE   VERSION
node-1   Ready      worker   10d   v1.25.0
node-2   NotReady   worker   10d   v1.25.0  ← Failed node
node-3   Ready      worker   10d   v1.25.0

# Detailed node information
kubectl describe node node-2

Conditions:
  Type             Status  Reason
  ----             ------  ------
  Ready            False   NodeStatusUnknown
  MemoryPressure   Unknown
  DiskPressure     Unknown

# View Pods on failed node
kubectl get pods -o wide | grep node-2

# These Pods will be evicted after timeout

What Happens to Different Workloads

1. Deployment/ReplicaSet Pods

Automatically Recovered

Evicted from failed node
ReplicaSet controller creates replacements
New Pods scheduled to healthy nodes
Desired replica count maintained

2. StatefulSet Pods

Special Handling

Not automatically evicted (to prevent data loss)
Require manual intervention or node deletion
Use PodDisruptionBudgets carefully

3. DaemonSet Pods

Node-Specific

Evicted when node fails
Recreated when node recovers
One Pod per node by design

Best Practices for Fault Tolerance

Production Recommendations

Multiple replicas: Always run 2+ replicas of critical apps
Pod anti-affinity: Spread Pods across different nodes
Health checks: Configure readiness/liveness probes
PodDisruptionBudgets: Ensure minimum availability during disruptions
Monitor node health: Alert on NotReady nodes
Tune timeouts: Adjust based on network reliability
Regular testing: Test failure scenarios (chaos engineering)

Important Warnings

Network partitions: Node might still be running but unreachable
Split-brain risk: Same Pod might run on both old and new nodes briefly
Storage concerns: PersistentVolumes might still be attached to old node
Graceful shutdown: Evicted Pods don't receive SIGTERM (forced termination)

Lesson 4 of 4

Fault Tolerance: etcd Recovery

etcd in High Availability

etcd is the single source of truth for your cluster. Understanding how it handles failures is critical for cluster reliability.

etcd Cluster Requirements

Odd number of nodes: 3, 5, or 7 members (for quorum)
Quorum-based: Uses Raft consensus algorithm
Fault tolerance: Can survive (n-1)/2 failures
3-node cluster: Survives 1 failure
5-node cluster: Survives 2 failures

etcd Quorum and Consensus

How Raft Consensus Works

# 3-node etcd cluster:

Cluster: etcd-1, etcd-2, etcd-3
Quorum needed: 2 out of 3 nodes

Scenario 1 - All healthy:
  ✓ etcd-1: Available
  ✓ etcd-2: Available
  ✓ etcd-3: Available
  Status: Cluster operational (3/3, quorum achieved)

Scenario 2 - One node fails:
  ✓ etcd-1: Available
  ✓ etcd-2: Available
  ✗ etcd-3: FAILED
  Status: Cluster operational (2/3, quorum achieved)

Scenario 3 - Two nodes fail:
  ✓ etcd-1: Available
  ✗ etcd-2: FAILED
  ✗ etcd-3: FAILED
  Status: Cluster DOWN (1/3, no quorum)

Automatic etcd Recovery

When an etcd node fails and then returns, Kubernetes handles recovery automatically:

Self-Healing etcd

If an etcd node fails for a period and then returns, the system automatically handles resynchronization of data:

Failed etcd node comes back online
Raft leader detects returning member
Leader streams missing data to returning member
Returning member catches up to current state
Cluster returns to full health automatically

etcd Recovery Timeline

T = 0: etcd-2 Fails
One etcd member becomes unavailable
Cluster still has 2/3 nodes (quorum maintained)

↓

T = 0 to Recovery: Cluster Continues
Remaining nodes (etcd-1, etcd-3) continue operating
All writes succeed (quorum present)
Missing member falls behind in log

↓

T = Recovery: etcd-2 Returns
Failed member comes back online
Rejoins cluster automatically

↓

Automatic Resynchronization
Leader streams missing log entries to etcd-2
etcd-2 replays operations to catch up
No manual intervention required

↓

Full Recovery
etcd-2 fully synchronized
Cluster returns to 3/3 healthy members

etcd Failure Scenarios

Scenario 1: Single Member Failure (Recoverable)

# 3-member cluster, 1 fails

Before failure:
  etcd-1: Leader
  etcd-2: Follower ← Fails
  etcd-3: Follower

During failure:
  ✓ Cluster continues (2/3 quorum)
  ✓ API Server works normally
  ✓ All operations succeed
  - Reduced fault tolerance (can't survive another failure)

After recovery:
  ✓ etcd-2 returns
  ✓ Automatically resynchronizes
  ✓ Full fault tolerance restored

Scenario 2: Leader Failure (Auto-Recovery)

# Leader election happens automatically

Before:
  etcd-1: Leader ← Fails
  etcd-2: Follower
  etcd-3: Follower

During failure:
  1. Followers detect leader loss
  2. New election triggered
  3. etcd-2 or etcd-3 becomes leader
  4. Cluster continues operating
  Total downtime: ~1 second

After:
  etcd-2: New Leader
  etcd-3: Follower
  (etcd-1 will catch up when it returns)

Scenario 3: Quorum Loss (Cluster Down)

Critical Failure

# 3-member cluster, 2 fail simultaneously

etcd-1: Available
etcd-2: FAILED
etcd-3: FAILED

Result:
  ✗ No quorum (1/3 < 2/3 required)
  ✗ Cluster cannot accept writes
  ✗ API Server read-only mode
  ✗ No new operations possible

Recovery:
  Must restore at least one failed member
  Or recover from backup

DO NOT Manually Modify etcd

Critical Warning

Manual intervention to modify etcd data directly is strongly advised against!

The cluster is designed to self-heal
Direct etcd modifications can cause inconsistencies
May break cluster state permanently
Bypasses Kubernetes validation and admission control
Can lead to unpredictable behavior

# ❌ NEVER DO THIS (unless you know exactly what you're doing):
etcdctl put /registry/services/specs/default/my-service "bad data"

# ✓ ALWAYS use Kubernetes API:
kubectl apply -f service.yaml

# Why?
# - Kubernetes validates changes
# - Updates are atomic and consistent
# - Controllers react appropriately
# - Audit logs are maintained

etcd Backup and Restore

The only safe way to manually interact with etcd is through backups:

Backup etcd

# Take snapshot of etcd data
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status snapshot.db

# Best practice: Automate daily backups

Restore from Backup

# Only needed for disaster recovery
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restore

# Then update etcd to use restored data directory
# Restart etcd with new data-dir

etcd Health Monitoring

# Check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Output:
# https://127.0.0.1:2379 is healthy: successfully committed proposal

# Check member list
ETCDCTL_API=3 etcdctl member list

# Check cluster status
kubectl get componentstatuses

etcd Best Practices

Production etcd Recommendations

Use 3 or 5 members: 3 for most cases, 5 for critical clusters
Dedicated nodes: Run etcd on dedicated machines (not on workers)
Fast disks: Use SSDs, etcd is I/O intensive
Low latency network: Members should be close (same datacenter)
Regular backups: Automate daily snapshots
Monitor health: Alert on member failures
Test recovery: Practice restore procedures
Never modify directly: Always use Kubernetes API
Separate from workloads: Taint Control Plane nodes

Summary: Kubernetes Fault Tolerance

Complete Fault Tolerance Picture

Services (Networking)

Stable entry points for applications
Load balance across Pod replicas
Implemented by kube-proxy with iptables/IPVS

Node Failure Handling

40s grace period before NotReady
5m eviction timeout before Pod removal
Automatic Pod rescheduling to healthy nodes

etcd Recovery

Automatic resynchronization when members return
Self-healing design (no manual intervention)
Quorum-based fault tolerance