Kubernetes Troubleshooting

Real-World Cluster Issues & Practical Solutions

Lesson 1: Restoring Administrative Access

This lesson covers critical troubleshooting techniques for regaining control of a misconfigured Kubernetes cluster and fixing fundamental access issues.

Problem: Connection Refused Error

Symptom: When trying to run kubectl commands, you get: The connection to the server localhost:8080 was refused

This indicates missing or misconfigured kubeconfig, preventing administrative access to the cluster.

Fix 1: Restore Kubeconfig File

Solution: Copy Admin Configuration

The admin.conf file from the Kubernetes directory needs to be copied to the user's default kubeconfig location.

# Copy admin configuration to default location sudo cp /etc/kubernetes/admin.conf ~/.kube/config # Set proper ownership sudo chown $(id -u):$(id -g) ~/.kube/config # Verify access is restored kubectl get nodes # NAME STATUS ROLES AGE VERSION # master-node Ready control-plane 30d v1.28.0 # worker-node-1 Ready worker 30d v1.28.0

Fix 2: Emergency "Break Glass" Method

Emergency Access Only: If the standard method doesn't work, you can temporarily enable the API server's insecure port to create new admin credentials. This should only be used as a last resort.
# Edit API server manifest (static pod) sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml # Add insecure port to command arguments spec: containers: - command: - kube-apiserver - --insecure-port=8080 - --insecure-bind-address=0.0.0.0 # ... other arguments # API server will automatically restart (static pod) # Wait about 30 seconds for restart # Now you can access without authentication kubectl --server=http://localhost:8080 get nodes # Create new ServiceAccount with admin privileges kubectl --server=http://localhost:8080 create serviceaccount admin-sa -n kube-system kubectl --server=http://localhost:8080 create clusterrolebinding admin-sa-binding \ --clusterrole=cluster-admin \ --serviceaccount=kube-system:admin-sa # Get the token kubectl --server=http://localhost:8080 create token admin-sa -n kube-system # IMPORTANT: Remove insecure port after fixing access # Edit kube-apiserver.yaml and remove the insecure-port lines
Security Warning: Always remove the insecure port configuration immediately after resolving the access issue. Leaving it enabled is a critical security vulnerability.

Common Kubeconfig Issues

Issue Cause Solution
Connection to localhost:8080 refused Missing kubeconfig file Copy /etc/kubernetes/admin.conf to ~/.kube/config
x509: certificate signed by unknown authority CA certificate mismatch Verify certificate-authority-data in kubeconfig
Unable to connect to server: dial tcp: lookup Incorrect server address Check server URL in kubeconfig
Forbidden: User cannot list resource Insufficient RBAC permissions Create ClusterRoleBinding for user/service account
Best Practice: Always keep a backup of your kubeconfig files and certificates in a secure location separate from the cluster. Document the location of these backups for emergency recovery.

Lesson 2: Kubelet & Component Troubleshooting

Kubelet is the primary node agent that manages pods and containers. When kubelet fails, the entire node becomes non-functional.

Problem 1: Inactive Kubelet Service

Symptom: Node Shows NotReady

When checking node status, you see nodes in NotReady state. This is often caused by the kubelet service not running.

# Check node status kubectl get nodes # NAME STATUS ROLES AGE VERSION # worker-node-1 NotReady worker 30d v1.28.0 # SSH to the problematic node ssh worker-node-1 # Check kubelet service status systemctl status kubelet # ● kubelet.service - kubelet: The Kubernetes Node Agent # Loaded: loaded # Active: inactive (dead) # Check kubelet logs journalctl -u kubelet -n 50

Solution: Start and Enable Kubelet

Manually start the kubelet service and ensure it's enabled to start on boot.

# Enable kubelet to start on boot sudo systemctl enable kubelet # Start kubelet service sudo systemctl start kubelet # Verify status systemctl status kubelet # ● kubelet.service - kubelet: The Kubernetes Node Agent # Active: active (running) # Return to master and verify node is Ready kubectl get nodes # NAME STATUS ROLES AGE VERSION # worker-node-1 Ready worker 30d v1.28.0

Problem 2: Kubelet Crash Loop - Cgroup Driver Mismatch

Symptom: Kubelet Constantly Restarting

The kubelet service is enabled but keeps crashing. Logs show: failed to run Kubelet: misconfigured cgroup driver

# Check kubelet logs for errors journalctl -u kubelet -f # Common error message: # failed to run Kubelet: misconfigured cgroup driver: # "cgroupfs" does not match Docker/containerd driver: "systemd"
Cgroup Driver: The cgroup driver is the interface between Kubernetes and the Linux kernel's control groups. Both kubelet and the container runtime (Docker/containerd) must use the same cgroup driver (either cgroupfs or systemd).

Solution: Configure Matching Cgroup Driver

Update the kubelet configuration to match the container runtime's cgroup driver.

# Check Docker/containerd cgroup driver docker info | grep -i cgroup # Cgroup Driver: systemd # Option 1: Edit kubelet extra args file sudo vim /etc/sysconfig/kubelet # (or /etc/default/kubelet on Debian/Ubuntu) # Add or modify the cgroup driver setting KUBELET_EXTRA_ARGS="--cgroup-driver=systemd" # Option 2: Edit kubelet config.yaml sudo vim /var/lib/kubelet/config.yaml # Add or modify: cgroupDriver: systemd # Restart kubelet sudo systemctl daemon-reload sudo systemctl restart kubelet # Verify kubelet is running systemctl status kubelet # ● kubelet.service # Active: active (running)

Problem 3: Kube-Scheduler Failure

Symptom: Pods Stuck in Pending State

New pods remain in Pending state indefinitely because the scheduler pod is failing to start.

# Check scheduler pod status kubectl get pods -n kube-system | grep scheduler # kube-scheduler-master 0/1 CrashLoopBackOff # Check scheduler logs kubectl logs -n kube-system kube-scheduler-master # Common error: # Error: unknown flag: --insecure-port

Solution: Fix Scheduler Manifest

Remove invalid or deprecated flags from the scheduler's static pod manifest.

# Edit scheduler manifest sudo vim /etc/kubernetes/manifests/kube-scheduler.yaml # Find and REMOVE invalid flags like: spec: containers: - command: - kube-scheduler # - --insecure-port=8080 # REMOVE THIS LINE - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf # ... other valid arguments # Save and exit - static pod will auto-restart # Verify scheduler is running kubectl get pods -n kube-system | grep scheduler # kube-scheduler-master 1/1 Running # Test by creating a pod kubectl run test-nginx --image=nginx kubectl get pods # NAME READY STATUS RESTARTS AGE # test-nginx 1/1 Running 0 5s
Common Mistake: Copying configuration from one component to another (e.g., copying API server flags to scheduler) often introduces invalid flags. Each component has its own set of valid command-line arguments.

Lesson 3: Node Resource & Application Issues

Resource allocation problems and application misconfigurations are common causes of deployment failures in Kubernetes clusters.

Problem: Zero Allocatable CPU

Critical Issue: Pods fail to schedule with error: 0/1 nodes are available: Insufficient cpu

Despite the node showing it has CPU capacity, no CPU is available for pod allocation.
# Check node allocatable resources kubectl describe node worker-node-1 # Output shows the problem: # Capacity: # cpu: 1 # memory: 2048Mi # Allocatable: # cpu: 0 # ← PROBLEM! # memory: 1800Mi

Root Cause: Aggressive System Reservations

System Reserved Resources: Kubelet can reserve CPU and memory for system processes using the --system-reserved and --kube-reserved flags. If these reservations exceed available resources, no capacity remains for pods.
# SSH to the node and check kubelet configuration ssh worker-node-1 cat /etc/sysconfig/kubelet # Example problematic configuration: KUBELET_EXTRA_ARGS="--system-reserved=cpu=1,memory=500Mi" # This reserves the ENTIRE CPU (1 core) for system processes # Leaving zero CPU for pods!

Solution: Adjust Resource Reservations

Reduce or remove CPU reservations to free up allocatable capacity for pods.

# Edit kubelet configuration sudo vim /etc/sysconfig/kubelet # Option 1: Remove CPU reservation (keep memory only) KUBELET_EXTRA_ARGS="--system-reserved=memory=500Mi" # Option 2: Use smaller reservation (for nodes with 2+ cores) KUBELET_EXTRA_ARGS="--system-reserved=cpu=100m,memory=500Mi" # Restart kubelet sudo systemctl restart kubelet # Back on master, verify allocatable resources kubectl describe node worker-node-1 # Allocatable: # cpu: 1 # ← Fixed! # memory: 1548Mi # Pending pods should now schedule kubectl get pods # NAME READY STATUS RESTARTS AGE # kube-flannel 1/1 Running 0 15s

Resource Reservation Best Practices

Node Size Recommended CPU Reserve Recommended Memory Reserve
1 CPU, 2GB RAM (test) 0 or 100m 256-512Mi
2 CPU, 4GB RAM (small) 100-200m 512Mi
4 CPU, 8GB RAM (medium) 200-300m 1Gi
8+ CPU, 16GB+ RAM (large) 500m-1000m 2Gi

Application Issues

Issue 1: ImagePullBackOff

Invalid Image Tag

Application pods show ImagePullBackOff status because the specified image tag doesn't exist.

# Check pod status kubectl get pods # NAME READY STATUS RESTARTS AGE # my-deployment-abc123 0/1 ImagePullBackOff 0 2m # Describe pod to see error kubectl describe pod my-deployment-abc123 # Events: # Failed to pull image "nginx:1.12.5": rpc error: # manifest for nginx:1.12.5 not found # Fix: Update deployment with valid image tag kubectl edit deployment my-deployment # Change: # spec: # template: # spec: # containers: # - image: nginx:1.12.5 # Invalid # To: # - image: nginx:1.12.1 # Valid # Pods will automatically recreate with correct image kubectl get pods # NAME READY STATUS RESTARTS AGE # my-deployment-xyz789 1/1 Running 0 10s

Issue 2: Incorrect Ingress Hostname

404 Not Found

Application is running but returns 404 errors due to misconfigured Ingress hostname.

# Test application access curl http://myapp.example.com # 404 Not Found # Check Ingress configuration kubectl get ingress my-ingress -o yaml # Look for host mismatch: spec: rules: - host: wrong-hostname.example.com # ← Wrong! http: paths: - path: / backend: service: name: my-service port: number: 80 # Fix the Ingress kubectl edit ingress my-ingress # Update host to correct value: spec: rules: - host: myapp.example.com # ← Correct! # Test again curl http://myapp.example.com # 200 OK - Working!

Lesson 4: ConfigMap Bugs & High Availability

Subtle configuration errors and understanding the critical importance of high availability through practical examples.

Problem: Browser Downloading Instead of Displaying

Content-Type Bug

Application is accessible, but the browser prompts to download the page instead of displaying it. This is caused by incorrect Content-Type header.

Root Cause: A missing semicolon in the Nginx configuration inside a ConfigMap causes Nginx to send the wrong Content-Type header (application/octet-stream instead of text/html), making browsers treat the page as a downloadable file.
# Check ConfigMap for Nginx configuration kubectl get configmap nginx-config -o yaml # Look for syntax errors: apiVersion: v1 kind: ConfigMap metadata: name: nginx-config data: nginx.conf: | server { listen 80; location / { root /usr/share/nginx/html; index index.html # ↑ MISSING SEMICOLON! } }

Solution: Fix ConfigMap Syntax

Add the missing semicolon and recreate the pod to apply changes.

# Edit ConfigMap kubectl edit configmap nginx-config # Fix the syntax error: data: nginx.conf: | server { listen 80; location / { root /usr/share/nginx/html; index index.html; # ← Added semicolon } } # IMPORTANT: ConfigMap changes don't auto-reload in running pods # You must delete the pod to pick up the new ConfigMap # Find the pod name kubectl get pods # NAME READY STATUS RESTARTS AGE # my-deployment-abc123 1/1 Running 0 10m # Delete the pod (Deployment will recreate it) kubectl delete pod my-deployment-abc123 # New pod will mount updated ConfigMap kubectl get pods # NAME READY STATUS RESTARTS AGE # my-deployment-xyz789 1/1 Running 0 5s # Test the application curl -I http://myapp.example.com # HTTP/1.1 200 OK # Content-Type: text/html # ← Fixed!
Important: ConfigMap changes are not automatically reflected in running containers. You must:
  • Delete and recreate pods, OR
  • Perform a rolling restart of the Deployment, OR
  • Use a ConfigMap reloader tool

The Cost of No High Availability

Real-World Experiment: What happens when an application runs with only a single replica and the node fails?

Scenario: Single Replica Application

# Application running with 1 replica apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 1 # ← Only one replica! template: spec: containers: - name: nginx image: nginx:1.12.1 # Check where pod is running kubectl get pods -o wide # NAME READY STATUS NODE # my-app-abc 1/1 Running worker-node-1 # Application is accessible curl http://myapp.example.com # 200 OK - Working

Experiment: Node Failure

# Simulate node failure by shutting down worker-node-1 ssh worker-node-1 sudo shutdown -h now # Immediately test application curl http://myapp.example.com # 503 Service Unavailable # ← Application DOWN! # Watch the cluster detect the failure kubectl get nodes # NAME STATUS ROLES AGE # master Ready master 30d # worker-node-1 Ready worker 30d # Still shows Ready (takes time) # After ~1 minute kubectl get nodes # NAME STATUS ROLES AGE # worker-node-1 NotReady worker 30d # Now NotReady # Check pod status kubectl get pods # NAME READY STATUS RESTARTS AGE # my-app-abc 1/1 Running 0 15m # Still shows Running! # After ~5-7 minutes, pod marked as Terminating kubectl get pods # NAME READY STATUS RESTARTS AGE # my-app-abc 1/1 Terminating 0 20m # my-app-xyz 0/1 Pending 0 5s # New pod pending # After ~7 minutes total, new pod scheduled on healthy node kubectl get pods -o wide # NAME READY STATUS NODE # my-app-xyz 1/1 Running master # Application accessible again curl http://myapp.example.com # 200 OK - Working

Timeline of Failure & Recovery

Time 0:00 - Node Failure

Worker node shuts down. Application immediately becomes unavailable (503 errors).

Time 0:00 - 1:00 - Detection Delay

Kubernetes hasn't detected the node failure yet. Node still shows "Ready" status.

Time 1:00 - Node Marked NotReady

Kubelet heartbeat timeout occurs. Control plane marks node as NotReady.

Time 5:00 - Pod Eviction Starts

After pod-eviction-timeout (default 5 min), pods on failed node marked for eviction.

Time 7:00 - Recovery Complete

New pod scheduled on healthy node and becomes Running. Application accessible again.

Result: Approximately 7 minutes of complete application downtime due to running only a single replica!

The High Availability Solution

Always Run Multiple Replicas: Deploy at least 2-3 replicas across different nodes to ensure zero-downtime during node failures.
# Proper HA configuration apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 # ← Multiple replicas for HA template: spec: containers: - name: nginx image: nginx:1.12.1 # Use anti-affinity to spread across nodes affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - my-app topologyKey: kubernetes.io/hostname # With 3 replicas across different nodes: kubectl get pods -o wide # NAME READY STATUS NODE # my-app-abc 1/1 Running worker-node-1 # my-app-def 1/1 Running worker-node-2 # my-app-ghi 1/1 Running worker-node-3 # If worker-node-1 fails: # - 2 replicas still running on worker-node-2 and worker-node-3 # - Application remains available (no downtime) # - New replica scheduled to maintain 3 replicas total

High Availability Checklist

✓ Multiple Replicas: Deploy at least 2-3 replicas for every critical application
✓ Pod Anti-Affinity: Spread replicas across different nodes using anti-affinity rules
✓ PodDisruptionBudget: Define minimum available pods during voluntary disruptions
✓ Health Checks: Configure liveness and readiness probes for automatic recovery
✓ Resource Requests: Set appropriate resource requests to ensure schedulability

Final Quiz

Test your knowledge of Kubernetes troubleshooting!

Question 1: What does "connection to localhost:8080 refused" indicate?

a) The API server is down
b) Missing or misconfigured kubeconfig file
c) The cluster is not installed
d) Network firewall blocking access

Question 2: What causes kubelet crash loop with "misconfigured cgroup driver"?

a) Insufficient memory on the node
b) Mismatch between kubelet and container runtime cgroup drivers
c) Invalid Kubernetes version
d) Missing kubelet binary

Question 3: What causes "0 CPU allocatable" on a node with 1 CPU core?

a) CPU is being used by running pods
b) System reservation (--system-reserved=cpu=1) consumes entire CPU
c) CPU is disabled in BIOS
d) Node needs to be rebooted

Question 4: How do you apply ConfigMap changes to running pods?

a) Changes automatically apply within 60 seconds
b) Delete and recreate pods, or perform rolling restart
c) Restart the kubelet service
d) Run kubectl apply on the ConfigMap again

Question 5: How long does it typically take Kubernetes to recover from node failure with single replica?

a) Instant failover (0 seconds)
b) Approximately 7 minutes due to detection and eviction timeouts
c) 30 seconds
d) 24 hours

Question 6: What's the recommended minimum number of replicas for HA?

a) 1 replica is sufficient
b) 2-3 replicas minimum for critical applications
c) 10 replicas always
d) Depends on number of users only

Question 7: What causes "unknown flag: --insecure-port" error in kube-scheduler?

a) Kubernetes version too old
b) Invalid or deprecated flag copied from another component's manifest
c) Missing scheduler binary
d) Firewall blocking scheduler

Question 8: How can you emergency access the cluster when kubeconfig is lost?

a) Reinstall the entire cluster
b) Temporarily enable API server insecure port to create new admin credentials
c) Use SSH to manually start containers
d) There is no way to recover
Quiz Complete!
All correct answers are option 'b'. These troubleshooting patterns are essential for maintaining healthy Kubernetes clusters in production.