Kubernetes Networking & Fault Tolerance

Services, kube-proxy, and High Availability

0%
Lesson 1 of 4

Kubernetes Service Abstraction

The Networking Challenge

In Kubernetes, Pods are ephemeral—they can be created, destroyed, and replaced at any time. Each Pod gets its own IP address, but these IPs change when Pods are recreated. This creates a fundamental problem:

The Pod IP Problem

  • Pod IPs are not stable—they change on restart
  • Multiple Pod replicas each have different IPs
  • Applications can't reliably connect to Pods directly
  • Load balancing across Pods requires external logic

What is a Service?

A Service is a Kubernetes abstraction that provides a stable, single entry point for an application within the cluster.

Service Purpose

The Service object acts as a stable network endpoint that:

  • Organizes a group of Pods: Selects Pods using labels
  • Provides a stable IP: ClusterIP remains constant
  • Load balances traffic: Distributes requests across Pod replicas
  • Enables service discovery: DNS name for the Service

Service Architecture

Service to Pods Mapping

Service: "webapp"
ClusterIP: 10.96.100.50
Port: 80
Selector: app=webapp

Load Balances Traffic To:
Pod 1: IP 10.244.1.5 | Label: app=webapp
Pod 2: IP 10.244.2.8 | Label: app=webapp
Pod 3: IP 10.244.3.12 | Label: app=webapp

How Services Work

1. Label Selectors

Services use label selectors to identify which Pods belong to them:

apiVersion: v1 kind: Service metadata: name: webapp spec: selector: app: webapp # Selects all Pods with label app=webapp ports: - protocol: TCP port: 80 # Service port targetPort: 8080 # Pod container port type: ClusterIP # Default type

2. Stable Entry Point

When the Service is created, it receives a stable ClusterIP:

# Create the Service kubectl apply -f service.yaml # Service gets a stable IP that never changes kubectl get service webapp NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE webapp ClusterIP 10.96.100.50 80/TCP 5m # Other Pods can now connect to: # - IP: 10.96.100.50:80 # - DNS: webapp.default.svc.cluster.local

3. Automatic Load Balancing

Traffic to the Service IP is automatically distributed across all matching Pods:

# Request to Service: curl http://10.96.100.50 # Kubernetes automatically routes to one of: # - Pod 1: 10.244.1.5:8080 # - Pod 2: 10.244.2.8:8080 # - Pod 3: 10.244.3.12:8080 # Next request might go to a different Pod (load balanced)

Service Types

1. ClusterIP (Default)

Exposes the Service on an internal IP within the cluster:

  • Only accessible from within the cluster
  • Most common type for internal services
  • Provides stable internal endpoint

2. NodePort

Exposes the Service on each Node's IP at a static port:

apiVersion: v1 kind: Service metadata: name: webapp spec: type: NodePort selector: app: webapp ports: - port: 80 targetPort: 8080 nodePort: 30080 # Accessible on all nodes at this port
  • Accessible from outside cluster via <NodeIP>:30080
  • Port range: 30000-32767 (default)
  • Good for development/testing

3. LoadBalancer

Creates an external load balancer (cloud provider):

apiVersion: v1 kind: Service metadata: name: webapp spec: type: LoadBalancer selector: app: webapp ports: - port: 80 targetPort: 8080
  • Works with cloud providers (AWS, GCP, Azure)
  • Gets external IP address
  • Best for production external access

Service Discovery

Kubernetes provides automatic DNS for Services:

# Service DNS format: ..svc.cluster.local # Examples: webapp.default.svc.cluster.local database.production.svc.cluster.local # Within same namespace, just use service name: curl http://webapp # From different namespace: curl http://webapp.default

Endpoints

Kubernetes automatically creates an Endpoints object that tracks Pod IPs:

# View Service endpoints kubectl get endpoints webapp NAME ENDPOINTS AGE webapp 10.244.1.5:8080,10.244.2.8:8080,10.244.3.12:8080 5m # When a Pod dies and is replaced: # - Old Pod IP is removed from endpoints # - New Pod IP is automatically added # - Service continues to work seamlessly

Key Takeaways: Services

  • Services provide stable entry points for ephemeral Pods
  • Use label selectors to identify backend Pods
  • Automatically load balance traffic across Pod replicas
  • ClusterIP gives internal stability, LoadBalancer for external access
  • DNS enables service discovery by name
Lesson 2 of 4

Traffic Management with kube-proxy

The Role of kube-proxy

The kube-proxy component is the key to implementing Service networking in Kubernetes. It runs on every node and is responsible for generating the rules that direct traffic to the correct Pods.

kube-proxy Responsibilities

  • Watch Services: Monitor the API Server for Service changes
  • Watch Endpoints: Track Pod IPs behind each Service
  • Generate rules: Create iptables or IPVS rules for routing
  • Route traffic: Direct requests to appropriate Pods
  • Load balance: Distribute traffic across replicas

How kube-proxy Works

1. Service Created
User creates Service with ClusterIP 10.96.100.50:80
2. kube-proxy Detects
kube-proxy on each node watches API Server, sees new Service
3. Generate Rules
kube-proxy creates iptables/IPVS rules on the node
4. Traffic Routing
Rules intercept traffic to 10.96.100.50:80 and route to Pod IPs
5. Load Balancing
Rules distribute traffic across all backend Pods

Implementation: iptables Mode

By default, kube-proxy uses iptables to implement Services. This is the most common mode.

How iptables Rules Work

# When Service is created with ClusterIP 10.96.100.50:80 kube-proxy creates iptables rules: 1. PREROUTING Chain: "If traffic to 10.96.100.50:80, jump to custom chain" 2. Custom Chain (KUBE-SVC-WEBAPP): "Randomly select one of the backend Pod rules" 3. Backend Pod Rules: - 33% probability → DNAT to 10.244.1.5:8080 - 33% probability → DNAT to 10.244.2.8:8080 - 33% probability → DNAT to 10.244.3.12:8080 Result: Traffic is load balanced across 3 Pods

Complete Traffic Flow

Application makes request
curl http://10.96.100.50:80
iptables intercepts packet
Recognizes traffic to ClusterIP 10.96.100.50:80
Jumps to Service chain
Routes to KUBE-SVC-WEBAPP chain
Load balancing decision
Randomly selects one Pod (e.g., Pod 2)
DNAT (Destination NAT)
Rewrites packet destination to 10.244.2.8:8080
Packet delivered to Pod
Pod processes request and sends response

Viewing iptables Rules

You can inspect the actual iptables rules created by kube-proxy:

# View all iptables rules (on a node) sudo iptables -t nat -L -n -v # View Service-specific rules sudo iptables -t nat -L KUBE-SERVICES -n # Example output: Chain KUBE-SERVICES target prot opt source destination KUBE-SVC-ABC tcp -- 0.0.0.0/0 10.96.100.50 /* webapp */ Chain KUBE-SVC-ABC target prot opt source destination KUBE-SEP-1 all -- 0.0.0.0/0 0.0.0.0/0 /* webapp -> pod1 */ statistic mode random probability 0.33 KUBE-SEP-2 all -- 0.0.0.0/0 0.0.0.0/0 /* webapp -> pod2 */ statistic mode random probability 0.50 KUBE-SEP-3 all -- 0.0.0.0/0 0.0.0.0/0 /* webapp -> pod3 */

Implementation: IPVS Mode

For larger clusters, IPVS (IP Virtual Server) provides better performance:

IPVS vs. iptables

Aspect iptables IPVS
Performance Good for small clusters Better for large clusters (1000+ Services)
Load Balancing Random selection only Multiple algorithms (round-robin, least-connection, etc.)
Rule Updates O(n) - slower with many Services O(1) - constant time
Complexity More complex rules Simpler, kernel-level load balancing
Default Yes No (opt-in)

Enable IPVS Mode

# Configure kube-proxy to use IPVS kubectl edit configmap kube-proxy -n kube-system # Change mode: mode: "ipvs" # Restart kube-proxy pods kubectl delete pod -n kube-system -l k8s-app=kube-proxy

ClusterIP Assignment

When a Service is created, Kubernetes assigns it a ClusterIP from a predefined range:

# ClusterIP is assigned from service CIDR range # Configured during cluster setup (e.g., 10.96.0.0/12) # View Service CIDR: kubectl cluster-info dump | grep -i service-cluster-ip-range # Example: # --service-cluster-ip-range=10.96.0.0/12 # This means: # - Services get IPs from 10.96.0.0 to 10.111.255.255 # - ClusterIP is stable for Service lifetime # - IP is virtual (no actual interface has this IP)

Traffic Distribution Details

Session Affinity

By default, each request can go to any Pod. For sticky sessions:

apiVersion: v1 kind: Service metadata: name: webapp spec: selector: app: webapp ports: - port: 80 targetPort: 8080 sessionAffinity: ClientIP # Same client goes to same Pod sessionAffinityConfig: clientIP: timeoutSeconds: 3600 # 1 hour sticky session

External Traffic Policy

Control how external traffic is routed:

spec: type: LoadBalancer externalTrafficPolicy: Local # Only route to Pods on same node # vs externalTrafficPolicy: Cluster # Route to any Pod (default)

Important Considerations

  • iptables rules: Updated every time Pods or Services change
  • kube-proxy must run: Without it, Services won't work
  • Virtual IP: ClusterIP doesn't exist on any interface
  • Node-local: Each node has its own iptables rules

Key Takeaways: kube-proxy

  • kube-proxy generates routing rules on each node
  • Uses iptables or IPVS to implement Services
  • Intercepts traffic to ClusterIP and routes to Pod IPs
  • Creates chains that distribute traffic across Pods
  • IPVS is more efficient for large clusters
Lesson 3 of 4

Fault Tolerance: Node Failure

Building Fault-Tolerant Clusters

Kubernetes is designed to maintain availability even when nodes fail. Understanding how the cluster handles failures is crucial for running production workloads.

Fault Tolerance Goals

  • Detect failures: Quickly identify when nodes stop responding
  • Minimize disruption: Give nodes time to recover
  • Recover gracefully: Reschedule Pods to healthy nodes
  • Maintain availability: Keep applications running

Node Failure Detection

The node-controller (part of the Controller Manager) continuously monitors node health:

# Node controller responsibilities: 1. Monitor node status (via kubelet heartbeats) 2. Mark nodes as Ready/NotReady 3. Evict Pods from failed nodes 4. Update node conditions # Kubelet sends heartbeat every 10 seconds (default) # If node-controller doesn't receive heartbeat: # → Node marked as "Unknown" or "NotReady"

Node Failure Timeline

When a worker node stops responding, Kubernetes follows a controlled process:

T = 0s
Node Stops Responding
Worker node crashes or network disconnects
Kubelet stops sending heartbeats
T = 40s
Initial Grace Period Expires
node-controller waits 40 seconds (default) for kubelet to return
This prevents false positives from temporary network issues
T = 40s
Node Marked NotReady
node-controller marks node status as NotReady
Node no longer receives new Pod assignments
T = 40s to 5m 40s
Pod Eviction Timeout
node-controller waits for PodEvictionTimeout (5 minutes default)
Gives node time to recover and reconnect
T = 5m 40s
Pod Eviction Begins
node-controller evicts (removes) Pods from the failed node
Deployment/ReplicaSet controllers create replacement Pods
New Pods scheduled to healthy nodes

Configuration Parameters

These timings can be customized based on your availability requirements:

# kube-controller-manager flags: --node-monitor-period=5s # How often to check node status (default: 5s) --node-monitor-grace-period=40s # Grace period before marking node NotReady (default: 40s) # This is the "Initial Grace Period" --pod-eviction-timeout=5m # Wait time before evicting Pods from NotReady node (default: 5m) # This is the "PodEvictionTimeout"

Customizing for Different Scenarios

Scenario Grace Period Eviction Timeout Reasoning
Default 40 seconds 5 minutes Balanced approach
Fast Recovery 20 seconds 2 minutes Minimize downtime, accept false positives
Stable Network 60 seconds 10 minutes Allow more time for recovery
Unreliable Network 60 seconds 15 minutes Prevent unnecessary evictions

Pod Eviction Process

1. Node Becomes NotReady
node-controller marks node status as NotReady
2. Wait for Eviction Timeout
Wait 5 minutes (default) for node to recover
3. Eviction Triggered
node-controller decides to evict Pods
4. Pods Marked for Deletion
API Server marks Pods as Terminating
5. Controllers Detect Missing Replicas
ReplicaSet/Deployment controllers see fewer than desired replicas
6. Replacement Pods Created
Controllers create new Pods to replace evicted ones
7. Scheduler Assigns to Healthy Nodes
New Pods scheduled to available, healthy nodes
8. Pods Start Running
Application restored on healthy nodes

Monitoring Node Status

# View node status kubectl get nodes NAME STATUS ROLES AGE VERSION node-1 Ready worker 10d v1.25.0 node-2 NotReady worker 10d v1.25.0 ← Failed node node-3 Ready worker 10d v1.25.0 # Detailed node information kubectl describe node node-2 Conditions: Type Status Reason ---- ------ ------ Ready False NodeStatusUnknown MemoryPressure Unknown DiskPressure Unknown # View Pods on failed node kubectl get pods -o wide | grep node-2 # These Pods will be evicted after timeout

What Happens to Different Workloads

1. Deployment/ReplicaSet Pods

Automatically Recovered

  • Evicted from failed node
  • ReplicaSet controller creates replacements
  • New Pods scheduled to healthy nodes
  • Desired replica count maintained

2. StatefulSet Pods

Special Handling

  • Not automatically evicted (to prevent data loss)
  • Require manual intervention or node deletion
  • Use PodDisruptionBudgets carefully

3. DaemonSet Pods

Node-Specific

  • Evicted when node fails
  • Recreated when node recovers
  • One Pod per node by design

Best Practices for Fault Tolerance

Production Recommendations

  • Multiple replicas: Always run 2+ replicas of critical apps
  • Pod anti-affinity: Spread Pods across different nodes
  • Health checks: Configure readiness/liveness probes
  • PodDisruptionBudgets: Ensure minimum availability during disruptions
  • Monitor node health: Alert on NotReady nodes
  • Tune timeouts: Adjust based on network reliability
  • Regular testing: Test failure scenarios (chaos engineering)

Important Warnings

  • Network partitions: Node might still be running but unreachable
  • Split-brain risk: Same Pod might run on both old and new nodes briefly
  • Storage concerns: PersistentVolumes might still be attached to old node
  • Graceful shutdown: Evicted Pods don't receive SIGTERM (forced termination)
Lesson 4 of 4

Fault Tolerance: etcd Recovery

etcd in High Availability

etcd is the single source of truth for your cluster. Understanding how it handles failures is critical for cluster reliability.

etcd Cluster Requirements

  • Odd number of nodes: 3, 5, or 7 members (for quorum)
  • Quorum-based: Uses Raft consensus algorithm
  • Fault tolerance: Can survive (n-1)/2 failures
  • 3-node cluster: Survives 1 failure
  • 5-node cluster: Survives 2 failures

etcd Quorum and Consensus

How Raft Consensus Works

# 3-node etcd cluster: Cluster: etcd-1, etcd-2, etcd-3 Quorum needed: 2 out of 3 nodes Scenario 1 - All healthy: ✓ etcd-1: Available ✓ etcd-2: Available ✓ etcd-3: Available Status: Cluster operational (3/3, quorum achieved) Scenario 2 - One node fails: ✓ etcd-1: Available ✓ etcd-2: Available ✗ etcd-3: FAILED Status: Cluster operational (2/3, quorum achieved) Scenario 3 - Two nodes fail: ✓ etcd-1: Available ✗ etcd-2: FAILED ✗ etcd-3: FAILED Status: Cluster DOWN (1/3, no quorum)

Automatic etcd Recovery

When an etcd node fails and then returns, Kubernetes handles recovery automatically:

Self-Healing etcd

If an etcd node fails for a period and then returns, the system automatically handles resynchronization of data:

  1. Failed etcd node comes back online
  2. Raft leader detects returning member
  3. Leader streams missing data to returning member
  4. Returning member catches up to current state
  5. Cluster returns to full health automatically

etcd Recovery Timeline

T = 0: etcd-2 Fails
One etcd member becomes unavailable
Cluster still has 2/3 nodes (quorum maintained)
T = 0 to Recovery: Cluster Continues
Remaining nodes (etcd-1, etcd-3) continue operating
All writes succeed (quorum present)
Missing member falls behind in log
T = Recovery: etcd-2 Returns
Failed member comes back online
Rejoins cluster automatically
Automatic Resynchronization
Leader streams missing log entries to etcd-2
etcd-2 replays operations to catch up
No manual intervention required
Full Recovery
etcd-2 fully synchronized
Cluster returns to 3/3 healthy members

etcd Failure Scenarios

Scenario 1: Single Member Failure (Recoverable)

# 3-member cluster, 1 fails Before failure: etcd-1: Leader etcd-2: Follower ← Fails etcd-3: Follower During failure: ✓ Cluster continues (2/3 quorum) ✓ API Server works normally ✓ All operations succeed - Reduced fault tolerance (can't survive another failure) After recovery: ✓ etcd-2 returns ✓ Automatically resynchronizes ✓ Full fault tolerance restored

Scenario 2: Leader Failure (Auto-Recovery)

# Leader election happens automatically Before: etcd-1: Leader ← Fails etcd-2: Follower etcd-3: Follower During failure: 1. Followers detect leader loss 2. New election triggered 3. etcd-2 or etcd-3 becomes leader 4. Cluster continues operating Total downtime: ~1 second After: etcd-2: New Leader etcd-3: Follower (etcd-1 will catch up when it returns)

Scenario 3: Quorum Loss (Cluster Down)

Critical Failure

# 3-member cluster, 2 fail simultaneously etcd-1: Available etcd-2: FAILED etcd-3: FAILED Result: ✗ No quorum (1/3 < 2/3 required) ✗ Cluster cannot accept writes ✗ API Server read-only mode ✗ No new operations possible Recovery: Must restore at least one failed member Or recover from backup

DO NOT Manually Modify etcd

Critical Warning

Manual intervention to modify etcd data directly is strongly advised against!

  • The cluster is designed to self-heal
  • Direct etcd modifications can cause inconsistencies
  • May break cluster state permanently
  • Bypasses Kubernetes validation and admission control
  • Can lead to unpredictable behavior
# ❌ NEVER DO THIS (unless you know exactly what you're doing): etcdctl put /registry/services/specs/default/my-service "bad data" # ✓ ALWAYS use Kubernetes API: kubectl apply -f service.yaml # Why? # - Kubernetes validates changes # - Updates are atomic and consistent # - Controllers react appropriately # - Audit logs are maintained

etcd Backup and Restore

The only safe way to manually interact with etcd is through backups:

Backup etcd

# Take snapshot of etcd data ETCDCTL_API=3 etcdctl snapshot save snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Verify snapshot ETCDCTL_API=3 etcdctl snapshot status snapshot.db # Best practice: Automate daily backups

Restore from Backup

# Only needed for disaster recovery ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir=/var/lib/etcd-restore # Then update etcd to use restored data directory # Restart etcd with new data-dir

etcd Health Monitoring

# Check etcd cluster health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Output: # https://127.0.0.1:2379 is healthy: successfully committed proposal # Check member list ETCDCTL_API=3 etcdctl member list # Check cluster status kubectl get componentstatuses

etcd Best Practices

Production etcd Recommendations

  • Use 3 or 5 members: 3 for most cases, 5 for critical clusters
  • Dedicated nodes: Run etcd on dedicated machines (not on workers)
  • Fast disks: Use SSDs, etcd is I/O intensive
  • Low latency network: Members should be close (same datacenter)
  • Regular backups: Automate daily snapshots
  • Monitor health: Alert on member failures
  • Test recovery: Practice restore procedures
  • Never modify directly: Always use Kubernetes API
  • Separate from workloads: Taint Control Plane nodes

Summary: Kubernetes Fault Tolerance

Complete Fault Tolerance Picture

Services (Networking)
  • Stable entry points for applications
  • Load balance across Pod replicas
  • Implemented by kube-proxy with iptables/IPVS
Node Failure Handling
  • 40s grace period before NotReady
  • 5m eviction timeout before Pod removal
  • Automatic Pod rescheduling to healthy nodes
etcd Recovery
  • Automatic resynchronization when members return
  • Self-healing design (no manual intervention)
  • Quorum-based fault tolerance
Final Assessment

Test Your Knowledge

Networking & Fault Tolerance Quiz

Question 1: What is the primary purpose of a Kubernetes Service?

Question 2: What is kube-proxy responsible for?

Question 3: How does kube-proxy typically implement Service routing?

Question 4: What is the default initial grace period before a node is marked NotReady?

Question 5: What is the default PodEvictionTimeout before Pods are evicted from a failed node?

Question 6: What happens when an etcd node fails and then returns?

Question 7: Why is manual modification of etcd data strongly advised against?

Question 8: In a 3-node etcd cluster, how many nodes can fail while maintaining quorum?