Kubernetes: Jobs, CronJobs & RBAC

Lesson 1 of 4

Kubernetes Jobs: Run-to-Completion Workloads

Understanding Jobs

Unlike Deployments which run long-running services, Jobs are designed for applications that need to complete their work and terminate.

Job vs Deployment

Aspect	Deployment	Job
Purpose	Long-running services	Run-to-completion tasks
Lifecycle	Runs indefinitely	Runs until task completes
Restart Policy	Always restart on failure	Restart until success or retry limit
Success Criteria	None (keeps running)	Exit code 0 (successful completion)
Examples	Web servers, APIs, databases	Batch jobs, data processing, migrations

How Jobs Work

Job Completion Criteria

A Job continuously executes the Pod it creates until the Pod successfully completes its task and exits with a successful result code (exit code 0).

1. Job Created
kubectl apply -f job.yaml

↓

2. Pod Started
Job controller creates Pod to run task

↓

3. Task Executes
Container runs to completion

↓

4. Check Exit Code
Did container exit with code 0?

↓

If Exit Code = 0: SUCCESS
Job marked as Complete, Pod kept for logs

If Exit Code ≠ 0: FAILURE
Retry Pod (up to backoffLimit)

Basic Job Manifest

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  # Number of successful completions required
  completions: 1

  # Run pods in parallel (default: 1)
  parallelism: 1

  # Number of retries before marking Job failed
  backoffLimit: 3

  # Pod template
  template:
    spec:
      restartPolicy: Never  # or OnFailure
      containers:
      - name: migration
        image: my-migration-tool:1.0
        command:
        - /bin/sh
        - -c
        - |
          echo "Starting data migration..."
          # Do the actual work
          migrate-data --source=old-db --dest=new-db
          echo "Migration complete!"
          # Exit 0 for success
          exit 0

Job Parameters Explained

1. completions

spec:
  completions: 3  # Job needs 3 successful Pod completions

# Use cases:
# - Process 3 batches of data
# - Run 3 different migration tasks
# - Split work into 3 chunks

2. parallelism

spec:
  completions: 10
  parallelism: 3  # Run 3 Pods at a time

# Execution:
# - Pods 1, 2, 3 run in parallel
# - When Pod 1 completes, Pod 4 starts
# - Continue until all 10 completions achieved

3. backoffLimit

spec:
  backoffLimit: 4  # Retry up to 4 times on failure

# If Pod fails:
# - Retry 1: wait 10s
# - Retry 2: wait 20s
# - Retry 3: wait 40s
# - Retry 4: wait 80s (max 6 minutes)
# After 4 failures: Job marked as Failed

Job Operations

Create and Monitor Job

# Create Job
kubectl apply -f job.yaml

# Watch Job status
kubectl get jobs -w

NAME              COMPLETIONS   DURATION   AGE
data-migration    0/1           5s         5s
data-migration    1/1           45s        45s  ← Complete!

# View Job details
kubectl describe job data-migration

# View Job Pods
kubectl get pods -l job-name=data-migration

NAME                    READY   STATUS      RESTARTS   AGE
data-migration-abc123   0/1     Completed   0          2m

# View logs from completed Job
kubectl logs data-migration-abc123

Delete Job

# Delete Job and its Pods
kubectl delete job data-migration

# Keep completed Pods for manual cleanup
kubectl delete job data-migration --cascade=orphan

Common Job Patterns

1. Simple One-Off Task

apiVersion: batch/v1
kind: Job
metadata:
  name: database-backup
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: backup
        image: postgres:14
        command:
        - /bin/sh
        - -c
        - |
          pg_dump -h $DB_HOST -U $DB_USER $DB_NAME > /backup/db.sql
          aws s3 cp /backup/db.sql s3://backups/$(date +%Y%m%d).sql
        env:
        - name: DB_HOST
          value: "postgres.default.svc"
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: username
        volumeMounts:
        - name: backup
          mountPath: /backup
      volumes:
      - name: backup
        emptyDir: {}

2. Parallel Processing

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-processor
spec:
  completions: 100    # Process 100 items
  parallelism: 10     # Run 10 workers in parallel
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: processor
        image: batch-processor:1.0
        command:
        - python
        - process.py
        - --queue
        - work-queue

3. Work Queue Pattern

apiVersion: batch/v1
kind: Job
metadata:
  name: queue-worker
spec:
  parallelism: 5          # Run 5 workers
  completions: null       # Run until queue is empty
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: worker
        image: worker:1.0
        command:
        - /worker
        - --queue-url=$(QUEUE_URL)

Job Use Cases

Common Job Applications

Data migration: Move data between databases
Batch processing: Process large datasets
Backups: Database or file system backups
Report generation: Create periodic reports
Data transformation: ETL (Extract, Transform, Load) tasks
Testing: Run test suites
Cleanup: Delete old data or files
Imports: Load data from external sources

Jobs in CI/CD Pipelines

A significant application of Jobs is in Continuous Integration/Continuous Deployment (CI/CD) pipelines.

CI/CD Use Case: Automated Test Environments

A Job can be automatically triggered upon creation of a new branch in a repository to:

Create temporary testing environment
Store credentials in Kubernetes Secrets or ConfigMaps
Provision complete isolated environment for testing and debugging
Run automated tests
Clean up resources when done

# Example: Create test environment on new branch
apiVersion: batch/v1
kind: Job
metadata:
  name: create-test-env-feature-xyz
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: setup
        image: kubectl:latest
        command:
        - /bin/sh
        - -c
        - |
          # Create namespace for this branch
          kubectl create namespace feature-xyz

          # Create secrets
          kubectl create secret generic db-creds \
            --from-literal=password=$(generate-password) \
            -n feature-xyz

          # Deploy application
          helm install myapp ./charts/myapp \
            --namespace feature-xyz \
            --set image.tag=feature-xyz

          # Run smoke tests
          run-tests --namespace feature-xyz

          echo "Test environment ready at https://feature-xyz.example.com"

Helm Chart Hooks

Jobs are integral to customizing deployments using Helm Charts via Hooks.

# Helm Job Hook Example
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ .Release.Name }}-db-migration
  annotations:
    # Run BEFORE install/upgrade
    "helm.sh/hook": pre-install,pre-upgrade
    # Delete Job after success
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migration
        image: {{ .Values.migration.image }}
        command:
        - migrate
        - up
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: {{ .Release.Name }}-db
              key: url

Common Helm Hook Types

pre-install: Before resources are created
post-install: After all resources are created
pre-upgrade: Before upgrade is applied
post-upgrade: After upgrade completes
pre-delete: Before resources are deleted
post-delete: After resources are deleted

Job Best Practices

Set appropriate backoffLimit for retries
Use restartPolicy: Never or OnFailure
Clean up completed Jobs to avoid clutter
Set ttlSecondsAfterFinished for automatic cleanup
Use resource limits to prevent runaway Jobs
Monitor Job status and set up alerts
Store logs before Job cleanup

Lesson 2 of 4

CronJobs: Scheduled Tasks

What are CronJobs?

CronJobs provide the mechanism to schedule Jobs to run periodically, similar to the traditional Unix cron utility.

CronJob Purpose

CronJobs create Jobs on a repeating schedule:

Time-based execution: Run at specific times/intervals
Automated recurring tasks: No manual intervention
Cron syntax: Familiar scheduling format

CronJob Manifest

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  # Schedule in cron format
  schedule: "0 2 * * *"  # Every day at 2:00 AM

  # Optional: Time zone
  timeZone: "America/New_York"

  # Job template
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: backup-tool:1.0
            command:
            - /backup.sh

  # Keep last 3 successful Jobs
  successfulJobsHistoryLimit: 3

  # Keep last 1 failed Job
  failedJobsHistoryLimit: 1

  # Prevent concurrent runs
  concurrencyPolicy: Forbid

Cron Schedule Syntax

Cron Format

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday=0)
# │ │ │ │ │
# * * * * *

Common Cron Schedules

Schedule	Description	Cron Expression
Every minute	Runs every minute	`* * * * *`
Every 5 minutes	0, 5, 10, 15, etc.	`/5 * * *`
Every hour	Top of every hour	`0 * * * *`
Every day at 2 AM	Daily at 2:00	`0 2 * * *`
Every Monday at 9 AM	Weekly on Monday	`0 9 * * 1`
First day of month	Monthly at midnight	`0 0 1 * *`
Weekdays at 6 PM	Mon-Fri at 18:00	`0 18 * * 1-5`

CronJob Parameters

1. concurrencyPolicy

spec:
  concurrencyPolicy: Allow  # Default: Allow concurrent runs

# Options:
# - Allow: Allow concurrent Jobs
# - Forbid: Skip new run if previous still running
# - Replace: Cancel current and start new

Concurrency Example

# Scenario: Job scheduled every minute, takes 2 minutes to complete

concurrencyPolicy: Allow
  - Minute 1: Job 1 starts
  - Minute 2: Job 2 starts (Job 1 still running)
  - Minute 3: Job 3 starts (Job 1 & 2 still running)
  Result: 3 Jobs running simultaneously

concurrencyPolicy: Forbid
  - Minute 1: Job 1 starts
  - Minute 2: Skipped (Job 1 still running)
  - Minute 3: Job 2 starts (Job 1 finished)
  Result: Never more than 1 Job running

concurrencyPolicy: Replace
  - Minute 1: Job 1 starts
  - Minute 2: Job 1 cancelled, Job 2 starts
  - Minute 3: Job 2 cancelled, Job 3 starts
  Result: Only latest Job runs

2. startingDeadlineSeconds

spec:
  startingDeadlineSeconds: 300  # 5 minutes

# If CronJob misses scheduled time (cluster down, etc.):
# - Try to start within 300 seconds of scheduled time
# - If deadline passed, count as missed
# - Prevents backlog of old Jobs

3. suspend

spec:
  suspend: true  # Temporarily disable CronJob

# Use cases:
# - Maintenance windows
# - Debugging
# - Pause without deleting

CronJob Examples

1. Database Backup

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7  # Keep 1 week
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: postgres:14
            command:
            - /bin/sh
            - -c
            - |
              TIMESTAMP=$(date +%Y%m%d_%H%M%S)
              pg_dump -h postgres -U admin mydb | \
                gzip > /backups/backup_${TIMESTAMP}.sql.gz
              # Upload to S3
              aws s3 cp /backups/backup_${TIMESTAMP}.sql.gz \
                s3://backups/postgres/
              # Delete local copy
              rm /backups/backup_${TIMESTAMP}.sql.gz
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-creds
                  key: password
            volumeMounts:
            - name: backups
              mountPath: /backups
          volumes:
          - name: backups
            emptyDir: {}

2. Log Rotation

apiVersion: batch/v1
kind: CronJob
metadata:
  name: log-cleanup
spec:
  schedule: "0 0 * * 0"  # Weekly on Sunday midnight
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: cleanup
            image: busybox
            command:
            - /bin/sh
            - -c
            - |
              # Delete logs older than 30 days
              find /logs -name "*.log" -mtime +30 -delete
              echo "Cleanup complete"
            volumeMounts:
            - name: logs
              mountPath: /logs
          volumes:
          - name: logs
            hostPath:
              path: /var/log/myapp

3. Health Check Reporter

apiVersion: batch/v1
kind: CronJob
metadata:
  name: health-report
spec:
  schedule: "0 9 * * 1"  # Monday at 9 AM
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: reporter
            image: health-checker:1.0
            command:
            - python
            - report.py
            - --email
            - ops@example.com

4. Certificate Renewal

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cert-renewal
spec:
  schedule: "0 0 1 * *"  # First of month
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: certbot
            image: certbot/certbot
            command:
            - certbot
            - renew
            - --quiet

CronJob Operations

Create and Monitor CronJob

# Create CronJob
kubectl apply -f cronjob.yaml

# List CronJobs
kubectl get cronjobs

NAME              SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
nightly-backup    0 2 * * *     False     0        8h              5d

# Describe CronJob
kubectl describe cronjob nightly-backup

# View Jobs created by CronJob
kubectl get jobs -l cronjob=nightly-backup

# Manually trigger CronJob (create Job immediately)
kubectl create job manual-backup --from=cronjob/nightly-backup

Suspend/Resume CronJob

# Suspend CronJob
kubectl patch cronjob nightly-backup -p '{"spec":{"suspend":true}}'

# Resume CronJob
kubectl patch cronjob nightly-backup -p '{"spec":{"suspend":false}}'

CronJob Considerations

Important Notes

Idempotency: Jobs should be idempotent (safe to run multiple times)
Missed runs: CronJobs may miss schedules if cluster is down
Timezone: Default is controller manager's timezone (use timeZone field)
Concurrency: Set appropriate concurrencyPolicy
History limits: Clean up old Jobs to avoid clutter

CronJob Best Practices

Make Jobs idempotent (can run multiple times safely)
Set concurrencyPolicy: Forbid for non-overlapping tasks
Use startingDeadlineSeconds to prevent backlogs
Keep history limits reasonable (3-7 successful, 1-3 failed)
Set resource limits on Job Pods
Monitor CronJob execution and failures
Test schedule syntax before deploying
Document what each CronJob does

Lesson 3 of 4

RBAC: Role-Based Access Control

Understanding RBAC

RBAC (Role-Based Access Control) is the system implemented in Kubernetes to govern permissions, allowing cluster operators to strictly define what users and service accounts can and cannot do.

RBAC Purpose

RBAC manages the distribution of user rights and access to various components within the Kubernetes cluster:

Security: Prevent unauthorized access
Least privilege: Give minimum necessary permissions
Granular control: Fine-grained access policies
Audit: Track who can do what

RBAC Core Concepts

1. Subjects (Who)

Entities that can perform actions:

Users: Human users (developers, operators)
Groups: Collections of users
ServiceAccounts: Accounts for Pods/applications

2. Resources (What)

Kubernetes API resources:

Pods, Deployments, Services, ConfigMaps, Secrets
Namespaces, Nodes, PersistentVolumes
Custom Resources

3. Verbs (Actions)

Operations that can be performed:

get - Read individual resource
list - Read multiple resources
watch - Watch for changes
create - Create new resources
update - Modify existing resources
patch - Partially modify resources
delete - Delete resources
deletecollection - Delete multiple resources

RBAC Components

Role / ClusterRole
Defines WHAT actions are allowed on WHICH resources

RoleBinding / ClusterRoleBinding
Binds a Role to WHO (users, groups, service accounts)

Access Control
Subject can perform allowed actions

Role vs ClusterRole

Aspect	Role	ClusterRole
Scope	Namespace-specific	Cluster-wide
Resources	Namespaced resources only	All resources (including cluster-scoped)
Use Case	Team/project access	Admin access, cluster resources

Creating Roles

Example 1: Read-Only Access

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: development
rules:
- apiGroups: [""]  # "" indicates core API group
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

Example 2: Deployment Manager

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-manager
  namespace: development
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

Example 3: ClusterRole for Node Access

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["nodes/status"]
  verbs: ["get"]

Creating RoleBindings

Bind Role to User

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: development
subjects:
- kind: User
  name: jane@example.com
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Bind Role to Group

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developers-deployment-access
  namespace: development
subjects:
- kind: Group
  name: developers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: deployment-manager
  apiGroup: rbac.authorization.k8s.io

Bind ClusterRole to ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-cluster-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: cluster-reader
  apiGroup: rbac.authorization.k8s.io

Real-World RBAC Example

Scenario: Multi-Environment Access Control

Organization has development and production environments:

Developers: Full access to development, read-only to production
Testers: Access only to development
DevOps: Full access to all environments

# Role: Full access in development
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dev-full-access
  namespace: development
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

---
# Role: Read-only in production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prod-read-only
  namespace: production
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["get", "list", "watch"]

---
# Binding: Developers to development (full access)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developers-dev-access
  namespace: development
subjects:
- kind: Group
  name: developers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: dev-full-access
  apiGroup: rbac.authorization.k8s.io

---
# Binding: Developers to production (read-only)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developers-prod-access
  namespace: production
subjects:
- kind: Group
  name: developers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: prod-read-only
  apiGroup: rbac.authorization.k8s.io

---
# Binding: Testers to development only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: testers-dev-access
  namespace: development
subjects:
- kind: Group
  name: testers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: dev-full-access
  apiGroup: rbac.authorization.k8s.io

ServiceAccounts for Pods

Pods use ServiceAccounts to authenticate with the API server:

# Create ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: default

---
# Create Role for ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: configmap-reader
  namespace: default
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]

---
# Bind Role to ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-app-configmap-access
  namespace: default
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: default
roleRef:
  kind: Role
  name: configmap-reader
  apiGroup: rbac.authorization.k8s.io

---
# Use ServiceAccount in Pod
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  serviceAccountName: my-app  # Use this ServiceAccount
  containers:
  - name: app
    image: my-app:1.0

Testing RBAC

# Check if user can perform action
kubectl auth can-i create deployments --namespace=development

# Check for another user
kubectl auth can-i create deployments \
  --namespace=development \
  --as=jane@example.com

# Check all permissions for a user
kubectl auth can-i --list --as=jane@example.com

Common RBAC Patterns

1. View Role (Read-Only)

kubectl create role viewer \
  --verb=get,list,watch \
  --resource=pods,services,deployments \
  --namespace=development

2. Edit Role (Read-Write)

kubectl create role editor \
  --verb=get,list,watch,create,update,patch,delete \
  --resource=pods,services,deployments,configmaps \
  --namespace=development

3. Admin Role (Full Access)

kubectl create role admin \
  --verb=* \
  --resource=* \
  --namespace=development

RBAC Best Practices

Least privilege: Grant minimum necessary permissions
Use Roles for namespaces: Not ClusterRoles when possible
Regular audits: Review permissions periodically
Groups over individuals: Manage group memberships
ServiceAccounts for Pods: Don't use default SA
Test before applying: Use kubectl auth can-i
Document permissions: Keep track of who has what access
Avoid wildcards: Be specific about resources and verbs

Lesson 4 of 4

DNS & Service Access Best Practices

Service Access Methods

There are multiple ways for applications to access Services in Kubernetes. Understanding the trade-offs is important for performance.

Method 1: DNS Round Robin

DNS Round Robin Drawbacks

While DNS seems like a natural choice, it has operational drawbacks for service access within the cluster.

How DNS Round Robin Works

# Service with 3 Pods
kubectl get svc webapp
NAME     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)
webapp   ClusterIP   10.96.100.50           80/TCP

# DNS query returns multiple IPs (Pod IPs)
nslookup webapp.default.svc.cluster.local

Name:    webapp.default.svc.cluster.local
Address: 10.244.1.5   # Pod 1
Address: 10.244.2.8   # Pod 2
Address: 10.244.3.12  # Pod 3

# Client randomly selects one IP from list

Problem 1: System Latency

Up to 20 Seconds of Latency

When a Pod goes down, the maximum time required for all components to register the change and stop sending traffic to the failed Pod can be up to 20 seconds.

T = 0s

Pod Crashes
Pod 2 (10.244.2.8) becomes unavailable

T = 0-5s

Kubelet Detects Failure
Readiness probe fails, kubelet notifies API Server

T = 5-10s

API Server Updates Endpoints
Failed Pod removed from Service endpoints

T = 10-15s

kube-proxy Updates iptables
Network rules updated on all nodes

T = 15-20s

DNS Records Update
DNS server (CoreDNS) updates A records

Impact: During this 20-second window, clients may still try to connect to the failed Pod, resulting in connection errors.

Problem 2: DNS TTL Overhead

Constant DNS Query Load

When using DNS Round Robin, the Time-To-Live (TTL) for DNS records is often set to a short duration (e.g., 5 seconds) to quickly reflect changes.

This forces client applications to repeatedly send queries to the DNS server every 5 seconds, placing significant and unnecessary load on DNS infrastructure.

# Short TTL forces frequent DNS queries
# TTL = 5 seconds

Client Application Loop:
  1. Query DNS for webapp.default.svc.cluster.local
  2. Receive list of IPs (TTL: 5 seconds)
  3. Use IPs for requests
  4. Wait 5 seconds
  5. Query DNS again (refresh)
  6. Repeat indefinitely

Load Impact:
  - 1 client: 12 queries/minute
  - 100 clients: 1,200 queries/minute
  - 1,000 clients: 12,000 queries/minute

# This places heavy load on:
# - Local CoreDNS pods
# - Potential upstream DNS servers
# - Network bandwidth

Method 2: ClusterIP with kube-proxy (Recommended)

More Viable Scheme

Relying on the cluster's internal network address translation (NAT), typically managed by kube-proxy, is a more viable scheme for service resolution compared to the high overhead and latency risks of DNS Round Robin.

How ClusterIP/kube-proxy Works

1. Client Uses Service Name
Application connects to webapp.default.svc.cluster.local

↓

2. DNS Returns ClusterIP (Once)
DNS query returns stable ClusterIP: 10.96.100.50
TTL can be long (30s+) since IP doesn't change

↓

3. Client Sends to ClusterIP
All requests go to 10.96.100.50:80

↓

4. kube-proxy Intercepts (iptables)
iptables rules on local node intercept traffic

↓

5. Load Balanced to Healthy Pod
Traffic routed to one of the healthy Pods
Failed Pods automatically excluded

Benefits of ClusterIP/kube-proxy

Faster failover: kube-proxy updates iptables rules immediately when Endpoints change
Lower DNS load: ClusterIP is stable, so DNS queries can have long TTL
Automatic load balancing: iptables rules distribute traffic
No client-side logic: Transparent to application
Better performance: Kernel-level routing (iptables/IPVS)

Comparison: DNS Round Robin vs ClusterIP

Aspect	DNS Round Robin	ClusterIP (kube-proxy)
Failover Time	Up to 20 seconds	Near-instant (seconds)
DNS Queries	Frequent (every TTL)	Infrequent (stable IP)
Load on DNS	High (short TTL)	Low (long TTL)
Load Balancing	Client-side random	Kernel-level iptables/IPVS
Health Awareness	Delayed (TTL dependent)	Immediate (Endpoints)
Recommendation	Not recommended	✓ Recommended

Best Practices for Service Access

Recommended Approach

Use Service DNS names in application config
Let kube-proxy handle routing via ClusterIP
Configure readiness probes for fast failure detection
Use headless Services only when you need direct Pod access
Monitor Service health and Pod availability

When to Use Each Method

# Use ClusterIP (Default - Recommended)
apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: ClusterIP  # or omit (default)
  selector:
    app: webapp
  ports:
  - port: 80
    targetPort: 8080

# Application connects to: webapp.default.svc.cluster.local
# kube-proxy handles load balancing automatically

---
# Use Headless Service (Special Cases Only)
apiVersion: v1
kind: Service
metadata:
  name: database
spec:
  clusterIP: None  # Headless
  selector:
    app: database
  ports:
  - port: 5432

# Use when:
# - StatefulSet needs direct Pod access
# - Custom client-side load balancing required
# - Service discovery without kube-proxy

Performance Optimization

# 1. Configure longer DNS TTL for stable Services
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           ttl 30  # Longer TTL since ClusterIPs are stable
        }
        forward . /etc/resolv.conf
        cache 30
    }

# 2. Use IPVS instead of iptables for better performance
kubectl edit configmap kube-proxy -n kube-system
# Set mode: ipvs

# 3. Configure application connection pools properly
# - Reuse connections
# - Handle connection failures gracefully
# - Implement retry logic with exponential backoff

Summary: Service Access

Use ClusterIP Services for internal service-to-service communication
Let kube-proxy handle load balancing and failover
Avoid DNS Round Robin for performance reasons
Configure readiness probes for fast failure detection
Use headless Services only when necessary (StatefulSets)
Monitor DNS and network performance

Kubernetes Jobs & Security