Lesson 1: Restoring Administrative Access
This lesson covers critical troubleshooting techniques for regaining control of a misconfigured Kubernetes cluster and fixing fundamental access issues.
Problem: Connection Refused Error
The connection to the server localhost:8080 was refused
This indicates missing or misconfigured kubeconfig, preventing administrative access to the cluster.
Fix 1: Restore Kubeconfig File
Solution: Copy Admin Configuration
The admin.conf file from the Kubernetes directory needs to be copied to the user's default kubeconfig location.
# Copy admin configuration to default location
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
# Set proper ownership
sudo chown $(id -u):$(id -g) ~/.kube/config
# Verify access is restored
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# master-node Ready control-plane 30d v1.28.0
# worker-node-1 Ready worker 30d v1.28.0Fix 2: Emergency "Break Glass" Method
# Edit API server manifest (static pod)
sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
# Add insecure port to command arguments
spec:
containers:
- command:
- kube-apiserver
- --insecure-port=8080
- --insecure-bind-address=0.0.0.0
# ... other arguments
# API server will automatically restart (static pod)
# Wait about 30 seconds for restart
# Now you can access without authentication
kubectl --server=http://localhost:8080 get nodes
# Create new ServiceAccount with admin privileges
kubectl --server=http://localhost:8080 create serviceaccount admin-sa -n kube-system
kubectl --server=http://localhost:8080 create clusterrolebinding admin-sa-binding \
--clusterrole=cluster-admin \
--serviceaccount=kube-system:admin-sa
# Get the token
kubectl --server=http://localhost:8080 create token admin-sa -n kube-system
# IMPORTANT: Remove insecure port after fixing access
# Edit kube-apiserver.yaml and remove the insecure-port linesCommon Kubeconfig Issues
| Issue | Cause | Solution |
|---|---|---|
| Connection to localhost:8080 refused | Missing kubeconfig file | Copy /etc/kubernetes/admin.conf to ~/.kube/config |
| x509: certificate signed by unknown authority | CA certificate mismatch | Verify certificate-authority-data in kubeconfig |
| Unable to connect to server: dial tcp: lookup | Incorrect server address | Check server URL in kubeconfig |
| Forbidden: User cannot list resource | Insufficient RBAC permissions | Create ClusterRoleBinding for user/service account |
Lesson 2: Kubelet & Component Troubleshooting
Kubelet is the primary node agent that manages pods and containers. When kubelet fails, the entire node becomes non-functional.
Problem 1: Inactive Kubelet Service
Symptom: Node Shows NotReady
When checking node status, you see nodes in NotReady state. This is often caused by the kubelet service not running.
# Check node status
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-node-1 NotReady worker 30d v1.28.0
# SSH to the problematic node
ssh worker-node-1
# Check kubelet service status
systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
# Loaded: loaded
# Active: inactive (dead)
# Check kubelet logs
journalctl -u kubelet -n 50Solution: Start and Enable Kubelet
Manually start the kubelet service and ensure it's enabled to start on boot.
# Enable kubelet to start on boot
sudo systemctl enable kubelet
# Start kubelet service
sudo systemctl start kubelet
# Verify status
systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
# Active: active (running)
# Return to master and verify node is Ready
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-node-1 Ready worker 30d v1.28.0Problem 2: Kubelet Crash Loop - Cgroup Driver Mismatch
Symptom: Kubelet Constantly Restarting
The kubelet service is enabled but keeps crashing. Logs show: failed to run Kubelet: misconfigured cgroup driver
# Check kubelet logs for errors
journalctl -u kubelet -f
# Common error message:
# failed to run Kubelet: misconfigured cgroup driver:
# "cgroupfs" does not match Docker/containerd driver: "systemd"Solution: Configure Matching Cgroup Driver
Update the kubelet configuration to match the container runtime's cgroup driver.
# Check Docker/containerd cgroup driver
docker info | grep -i cgroup
# Cgroup Driver: systemd
# Option 1: Edit kubelet extra args file
sudo vim /etc/sysconfig/kubelet
# (or /etc/default/kubelet on Debian/Ubuntu)
# Add or modify the cgroup driver setting
KUBELET_EXTRA_ARGS="--cgroup-driver=systemd"
# Option 2: Edit kubelet config.yaml
sudo vim /var/lib/kubelet/config.yaml
# Add or modify:
cgroupDriver: systemd
# Restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Verify kubelet is running
systemctl status kubelet
# ● kubelet.service
# Active: active (running)Problem 3: Kube-Scheduler Failure
Symptom: Pods Stuck in Pending State
New pods remain in Pending state indefinitely because the scheduler pod is failing to start.
# Check scheduler pod status
kubectl get pods -n kube-system | grep scheduler
# kube-scheduler-master 0/1 CrashLoopBackOff
# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-master
# Common error:
# Error: unknown flag: --insecure-portSolution: Fix Scheduler Manifest
Remove invalid or deprecated flags from the scheduler's static pod manifest.
# Edit scheduler manifest
sudo vim /etc/kubernetes/manifests/kube-scheduler.yaml
# Find and REMOVE invalid flags like:
spec:
containers:
- command:
- kube-scheduler
# - --insecure-port=8080 # REMOVE THIS LINE
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
# ... other valid arguments
# Save and exit - static pod will auto-restart
# Verify scheduler is running
kubectl get pods -n kube-system | grep scheduler
# kube-scheduler-master 1/1 Running
# Test by creating a pod
kubectl run test-nginx --image=nginx
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# test-nginx 1/1 Running 0 5sLesson 3: Node Resource & Application Issues
Resource allocation problems and application misconfigurations are common causes of deployment failures in Kubernetes clusters.
Problem: Zero Allocatable CPU
0/1 nodes are available: Insufficient cpu
Despite the node showing it has CPU capacity, no CPU is available for pod allocation.
# Check node allocatable resources
kubectl describe node worker-node-1
# Output shows the problem:
# Capacity:
# cpu: 1
# memory: 2048Mi
# Allocatable:
# cpu: 0 # ← PROBLEM!
# memory: 1800MiRoot Cause: Aggressive System Reservations
--system-reserved and --kube-reserved flags. If these reservations exceed available resources, no capacity remains for pods.
# SSH to the node and check kubelet configuration
ssh worker-node-1
cat /etc/sysconfig/kubelet
# Example problematic configuration:
KUBELET_EXTRA_ARGS="--system-reserved=cpu=1,memory=500Mi"
# This reserves the ENTIRE CPU (1 core) for system processes
# Leaving zero CPU for pods!Solution: Adjust Resource Reservations
Reduce or remove CPU reservations to free up allocatable capacity for pods.
# Edit kubelet configuration
sudo vim /etc/sysconfig/kubelet
# Option 1: Remove CPU reservation (keep memory only)
KUBELET_EXTRA_ARGS="--system-reserved=memory=500Mi"
# Option 2: Use smaller reservation (for nodes with 2+ cores)
KUBELET_EXTRA_ARGS="--system-reserved=cpu=100m,memory=500Mi"
# Restart kubelet
sudo systemctl restart kubelet
# Back on master, verify allocatable resources
kubectl describe node worker-node-1
# Allocatable:
# cpu: 1 # ← Fixed!
# memory: 1548Mi
# Pending pods should now schedule
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kube-flannel 1/1 Running 0 15sResource Reservation Best Practices
| Node Size | Recommended CPU Reserve | Recommended Memory Reserve |
|---|---|---|
| 1 CPU, 2GB RAM (test) | 0 or 100m | 256-512Mi |
| 2 CPU, 4GB RAM (small) | 100-200m | 512Mi |
| 4 CPU, 8GB RAM (medium) | 200-300m | 1Gi |
| 8+ CPU, 16GB+ RAM (large) | 500m-1000m | 2Gi |
Application Issues
Issue 1: ImagePullBackOff
Invalid Image Tag
Application pods show ImagePullBackOff status because the specified image tag doesn't exist.
# Check pod status
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-deployment-abc123 0/1 ImagePullBackOff 0 2m
# Describe pod to see error
kubectl describe pod my-deployment-abc123
# Events:
# Failed to pull image "nginx:1.12.5": rpc error:
# manifest for nginx:1.12.5 not found
# Fix: Update deployment with valid image tag
kubectl edit deployment my-deployment
# Change:
# spec:
# template:
# spec:
# containers:
# - image: nginx:1.12.5 # Invalid
# To:
# - image: nginx:1.12.1 # Valid
# Pods will automatically recreate with correct image
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-deployment-xyz789 1/1 Running 0 10sIssue 2: Incorrect Ingress Hostname
404 Not Found
Application is running but returns 404 errors due to misconfigured Ingress hostname.
# Test application access
curl http://myapp.example.com
# 404 Not Found
# Check Ingress configuration
kubectl get ingress my-ingress -o yaml
# Look for host mismatch:
spec:
rules:
- host: wrong-hostname.example.com # ← Wrong!
http:
paths:
- path: /
backend:
service:
name: my-service
port:
number: 80
# Fix the Ingress
kubectl edit ingress my-ingress
# Update host to correct value:
spec:
rules:
- host: myapp.example.com # ← Correct!
# Test again
curl http://myapp.example.com
# 200 OK - Working!Lesson 4: ConfigMap Bugs & High Availability
Subtle configuration errors and understanding the critical importance of high availability through practical examples.
Problem: Browser Downloading Instead of Displaying
Content-Type Bug
Application is accessible, but the browser prompts to download the page instead of displaying it. This is caused by incorrect Content-Type header.
application/octet-stream instead of text/html), making browsers treat the page as a downloadable file.
# Check ConfigMap for Nginx configuration
kubectl get configmap nginx-config -o yaml
# Look for syntax errors:
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
data:
nginx.conf: |
server {
listen 80;
location / {
root /usr/share/nginx/html;
index index.html
# ↑ MISSING SEMICOLON!
}
}Solution: Fix ConfigMap Syntax
Add the missing semicolon and recreate the pod to apply changes.
# Edit ConfigMap
kubectl edit configmap nginx-config
# Fix the syntax error:
data:
nginx.conf: |
server {
listen 80;
location / {
root /usr/share/nginx/html;
index index.html; # ← Added semicolon
}
}
# IMPORTANT: ConfigMap changes don't auto-reload in running pods
# You must delete the pod to pick up the new ConfigMap
# Find the pod name
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-deployment-abc123 1/1 Running 0 10m
# Delete the pod (Deployment will recreate it)
kubectl delete pod my-deployment-abc123
# New pod will mount updated ConfigMap
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-deployment-xyz789 1/1 Running 0 5s
# Test the application
curl -I http://myapp.example.com
# HTTP/1.1 200 OK
# Content-Type: text/html # ← Fixed!- Delete and recreate pods, OR
- Perform a rolling restart of the Deployment, OR
- Use a ConfigMap reloader tool
The Cost of No High Availability
Scenario: Single Replica Application
# Application running with 1 replica
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1 # ← Only one replica!
template:
spec:
containers:
- name: nginx
image: nginx:1.12.1
# Check where pod is running
kubectl get pods -o wide
# NAME READY STATUS NODE
# my-app-abc 1/1 Running worker-node-1
# Application is accessible
curl http://myapp.example.com
# 200 OK - WorkingExperiment: Node Failure
# Simulate node failure by shutting down worker-node-1
ssh worker-node-1
sudo shutdown -h now
# Immediately test application
curl http://myapp.example.com
# 503 Service Unavailable # ← Application DOWN!
# Watch the cluster detect the failure
kubectl get nodes
# NAME STATUS ROLES AGE
# master Ready master 30d
# worker-node-1 Ready worker 30d # Still shows Ready (takes time)
# After ~1 minute
kubectl get nodes
# NAME STATUS ROLES AGE
# worker-node-1 NotReady worker 30d # Now NotReady
# Check pod status
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-app-abc 1/1 Running 0 15m # Still shows Running!
# After ~5-7 minutes, pod marked as Terminating
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-app-abc 1/1 Terminating 0 20m
# my-app-xyz 0/1 Pending 0 5s # New pod pending
# After ~7 minutes total, new pod scheduled on healthy node
kubectl get pods -o wide
# NAME READY STATUS NODE
# my-app-xyz 1/1 Running master
# Application accessible again
curl http://myapp.example.com
# 200 OK - WorkingTimeline of Failure & Recovery
Time 0:00 - Node Failure
Worker node shuts down. Application immediately becomes unavailable (503 errors).
Time 0:00 - 1:00 - Detection Delay
Kubernetes hasn't detected the node failure yet. Node still shows "Ready" status.
Time 1:00 - Node Marked NotReady
Kubelet heartbeat timeout occurs. Control plane marks node as NotReady.
Time 5:00 - Pod Eviction Starts
After pod-eviction-timeout (default 5 min), pods on failed node marked for eviction.
Time 7:00 - Recovery Complete
New pod scheduled on healthy node and becomes Running. Application accessible again.
The High Availability Solution
# Proper HA configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3 # ← Multiple replicas for HA
template:
spec:
containers:
- name: nginx
image: nginx:1.12.1
# Use anti-affinity to spread across nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
# With 3 replicas across different nodes:
kubectl get pods -o wide
# NAME READY STATUS NODE
# my-app-abc 1/1 Running worker-node-1
# my-app-def 1/1 Running worker-node-2
# my-app-ghi 1/1 Running worker-node-3
# If worker-node-1 fails:
# - 2 replicas still running on worker-node-2 and worker-node-3
# - Application remains available (no downtime)
# - New replica scheduled to maintain 3 replicas totalHigh Availability Checklist
Final Quiz
Test your knowledge of Kubernetes troubleshooting!
Question 1: What does "connection to localhost:8080 refused" indicate?
Question 2: What causes kubelet crash loop with "misconfigured cgroup driver"?
Question 3: What causes "0 CPU allocatable" on a node with 1 CPU core?
Question 4: How do you apply ConfigMap changes to running pods?
Question 5: How long does it typically take Kubernetes to recover from node failure with single replica?
Question 6: What's the recommended minimum number of replicas for HA?
Question 7: What causes "unknown flag: --insecure-port" error in kube-scheduler?
Question 8: How can you emergency access the cluster when kubeconfig is lost?
All correct answers are option 'b'. These troubleshooting patterns are essential for maintaining healthy Kubernetes clusters in production.