Unlike Deployments which run long-running services, Jobs are designed for applications that need to complete their work and terminate.
Job vs Deployment
Aspect
Deployment
Job
Purpose
Long-running services
Run-to-completion tasks
Lifecycle
Runs indefinitely
Runs until task completes
Restart Policy
Always restart on failure
Restart until success or retry limit
Success Criteria
None (keeps running)
Exit code 0 (successful completion)
Examples
Web servers, APIs, databases
Batch jobs, data processing, migrations
How Jobs Work
Job Completion Criteria
A Job continuously executes the Pod it creates until the Pod successfully completes its task and exits with a successful result code (exit code 0).
1. Job Created
kubectl apply -f job.yaml
↓
2. Pod Started
Job controller creates Pod to run task
↓
3. Task Executes
Container runs to completion
↓
4. Check Exit Code
Did container exit with code 0?
↓
If Exit Code = 0: SUCCESS
Job marked as Complete, Pod kept for logs
If Exit Code ≠ 0: FAILURE
Retry Pod (up to backoffLimit)
Basic Job Manifest
apiVersion: batch/v1
kind: Job
metadata:
name: data-migration
spec:
# Number of successful completions required
completions: 1
# Run pods in parallel (default: 1)
parallelism: 1
# Number of retries before marking Job failed
backoffLimit: 3
# Pod template
template:
spec:
restartPolicy: Never # or OnFailure
containers:
- name: migration
image: my-migration-tool:1.0
command:
- /bin/sh
- -c
- |
echo "Starting data migration..."
# Do the actual work
migrate-data --source=old-db --dest=new-db
echo "Migration complete!"
# Exit 0 for success
exit 0
Job Parameters Explained
1. completions
spec:
completions: 3 # Job needs 3 successful Pod completions
# Use cases:
# - Process 3 batches of data
# - Run 3 different migration tasks
# - Split work into 3 chunks
2. parallelism
spec:
completions: 10
parallelism: 3 # Run 3 Pods at a time
# Execution:
# - Pods 1, 2, 3 run in parallel
# - When Pod 1 completes, Pod 4 starts
# - Continue until all 10 completions achieved
3. backoffLimit
spec:
backoffLimit: 4 # Retry up to 4 times on failure
# If Pod fails:
# - Retry 1: wait 10s
# - Retry 2: wait 20s
# - Retry 3: wait 40s
# - Retry 4: wait 80s (max 6 minutes)
# After 4 failures: Job marked as Failed
Job Operations
Create and Monitor Job
# Create Job
kubectl apply -f job.yaml
# Watch Job status
kubectl get jobs -w
NAME COMPLETIONS DURATION AGE
data-migration 0/1 5s 5s
data-migration 1/1 45s 45s ← Complete!
# View Job details
kubectl describe job data-migration
# View Job Pods
kubectl get pods -l job-name=data-migration
NAME READY STATUS RESTARTS AGE
data-migration-abc123 0/1 Completed 0 2m
# View logs from completed Job
kubectl logs data-migration-abc123
Delete Job
# Delete Job and its Pods
kubectl delete job data-migration
# Keep completed Pods for manual cleanup
kubectl delete job data-migration --cascade=orphan
spec:
concurrencyPolicy: Allow # Default: Allow concurrent runs
# Options:
# - Allow: Allow concurrent Jobs
# - Forbid: Skip new run if previous still running
# - Replace: Cancel current and start new
spec:
startingDeadlineSeconds: 300 # 5 minutes
# If CronJob misses scheduled time (cluster down, etc.):
# - Try to start within 300 seconds of scheduled time
# - If deadline passed, count as missed
# - Prevents backlog of old Jobs
3. suspend
spec:
suspend: true # Temporarily disable CronJob
# Use cases:
# - Maintenance windows
# - Debugging
# - Pause without deleting
Idempotency: Jobs should be idempotent (safe to run multiple times)
Missed runs: CronJobs may miss schedules if cluster is down
Timezone: Default is controller manager's timezone (use timeZone field)
Concurrency: Set appropriate concurrencyPolicy
History limits: Clean up old Jobs to avoid clutter
CronJob Best Practices
Make Jobs idempotent (can run multiple times safely)
Set concurrencyPolicy: Forbid for non-overlapping tasks
Use startingDeadlineSeconds to prevent backlogs
Keep history limits reasonable (3-7 successful, 1-3 failed)
Set resource limits on Job Pods
Monitor CronJob execution and failures
Test schedule syntax before deploying
Document what each CronJob does
Lesson 3 of 4
RBAC: Role-Based Access Control
Understanding RBAC
RBAC (Role-Based Access Control) is the system implemented in Kubernetes to govern permissions, allowing cluster operators to strictly define what users and service accounts can and cannot do.
RBAC Purpose
RBAC manages the distribution of user rights and access to various components within the Kubernetes cluster:
Security: Prevent unauthorized access
Least privilege: Give minimum necessary permissions
Granular control: Fine-grained access policies
Audit: Track who can do what
RBAC Core Concepts
1. Subjects (Who)
Entities that can perform actions:
Users: Human users (developers, operators)
Groups: Collections of users
ServiceAccounts: Accounts for Pods/applications
2. Resources (What)
Kubernetes API resources:
Pods, Deployments, Services, ConfigMaps, Secrets
Namespaces, Nodes, PersistentVolumes
Custom Resources
3. Verbs (Actions)
Operations that can be performed:
get - Read individual resource
list - Read multiple resources
watch - Watch for changes
create - Create new resources
update - Modify existing resources
patch - Partially modify resources
delete - Delete resources
deletecollection - Delete multiple resources
RBAC Components
Role / ClusterRole
Defines WHAT actions are allowed on WHICH resources
+
RoleBinding / ClusterRoleBinding
Binds a Role to WHO (users, groups, service accounts)
=
Access Control
Subject can perform allowed actions
Role vs ClusterRole
Aspect
Role
ClusterRole
Scope
Namespace-specific
Cluster-wide
Resources
Namespaced resources only
All resources (including cluster-scoped)
Use Case
Team/project access
Admin access, cluster resources
Creating Roles
Example 1: Read-Only Access
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: development
rules:
- apiGroups: [""] # "" indicates core API group
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
Organization has development and production environments:
Developers: Full access to development, read-only to production
Testers: Access only to development
DevOps: Full access to all environments
# Role: Full access in development
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: dev-full-access
namespace: development
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
---
# Role: Read-only in production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prod-read-only
namespace: production
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["get", "list", "watch"]
---
# Binding: Developers to development (full access)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developers-dev-access
namespace: development
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: dev-full-access
apiGroup: rbac.authorization.k8s.io
---
# Binding: Developers to production (read-only)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developers-prod-access
namespace: production
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: prod-read-only
apiGroup: rbac.authorization.k8s.io
---
# Binding: Testers to development only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: testers-dev-access
namespace: development
subjects:
- kind: Group
name: testers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: dev-full-access
apiGroup: rbac.authorization.k8s.io
ServiceAccounts for Pods
Pods use ServiceAccounts to authenticate with the API server:
# Create ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: default
---
# Create Role for ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: configmap-reader
namespace: default
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
---
# Bind Role to ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-app-configmap-access
namespace: default
subjects:
- kind: ServiceAccount
name: my-app
namespace: default
roleRef:
kind: Role
name: configmap-reader
apiGroup: rbac.authorization.k8s.io
---
# Use ServiceAccount in Pod
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
serviceAccountName: my-app # Use this ServiceAccount
containers:
- name: app
image: my-app:1.0
Testing RBAC
# Check if user can perform action
kubectl auth can-i create deployments --namespace=development
# Check for another user
kubectl auth can-i create deployments \
--namespace=development \
--as=jane@example.com
# Check all permissions for a user
kubectl auth can-i --list --as=jane@example.com
Common RBAC Patterns
1. View Role (Read-Only)
kubectl create role viewer \
--verb=get,list,watch \
--resource=pods,services,deployments \
--namespace=development
2. Edit Role (Read-Write)
kubectl create role editor \
--verb=get,list,watch,create,update,patch,delete \
--resource=pods,services,deployments,configmaps \
--namespace=development
3. Admin Role (Full Access)
kubectl create role admin \
--verb=* \
--resource=* \
--namespace=development
RBAC Best Practices
Least privilege: Grant minimum necessary permissions
Use Roles for namespaces: Not ClusterRoles when possible
Regular audits: Review permissions periodically
Groups over individuals: Manage group memberships
ServiceAccounts for Pods: Don't use default SA
Test before applying: Use kubectl auth can-i
Document permissions: Keep track of who has what access
Avoid wildcards: Be specific about resources and verbs
Lesson 4 of 4
DNS & Service Access Best Practices
Service Access Methods
There are multiple ways for applications to access Services in Kubernetes. Understanding the trade-offs is important for performance.
Method 1: DNS Round Robin
DNS Round Robin Drawbacks
While DNS seems like a natural choice, it has operational drawbacks for service access within the cluster.
How DNS Round Robin Works
# Service with 3 Pods
kubectl get svc webapp
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
webapp ClusterIP 10.96.100.50 80/TCP
# DNS query returns multiple IPs (Pod IPs)
nslookup webapp.default.svc.cluster.local
Name: webapp.default.svc.cluster.local
Address: 10.244.1.5 # Pod 1
Address: 10.244.2.8 # Pod 2
Address: 10.244.3.12 # Pod 3
# Client randomly selects one IP from list
Problem 1: System Latency
Up to 20 Seconds of Latency
When a Pod goes down, the maximum time required for all components to register the change and stop sending traffic to the failed Pod can be up to 20 seconds.
T = 0s
Pod Crashes
Pod 2 (10.244.2.8) becomes unavailable
T = 0-5s
Kubelet Detects Failure
Readiness probe fails, kubelet notifies API Server
T = 5-10s
API Server Updates Endpoints
Failed Pod removed from Service endpoints
T = 10-15s
kube-proxy Updates iptables
Network rules updated on all nodes
T = 15-20s
DNS Records Update
DNS server (CoreDNS) updates A records
Impact: During this 20-second window, clients may still try to connect to the failed Pod, resulting in connection errors.
Problem 2: DNS TTL Overhead
Constant DNS Query Load
When using DNS Round Robin, the Time-To-Live (TTL) for DNS records is often set to a short duration (e.g., 5 seconds) to quickly reflect changes.
This forces client applications to repeatedly send queries to the DNS server every 5 seconds, placing significant and unnecessary load on DNS infrastructure.
# Short TTL forces frequent DNS queries
# TTL = 5 seconds
Client Application Loop:
1. Query DNS for webapp.default.svc.cluster.local
2. Receive list of IPs (TTL: 5 seconds)
3. Use IPs for requests
4. Wait 5 seconds
5. Query DNS again (refresh)
6. Repeat indefinitely
Load Impact:
- 1 client: 12 queries/minute
- 100 clients: 1,200 queries/minute
- 1,000 clients: 12,000 queries/minute
# This places heavy load on:
# - Local CoreDNS pods
# - Potential upstream DNS servers
# - Network bandwidth
Method 2: ClusterIP with kube-proxy (Recommended)
More Viable Scheme
Relying on the cluster's internal network address translation (NAT), typically managed by kube-proxy, is a more viable scheme for service resolution compared to the high overhead and latency risks of DNS Round Robin.
How ClusterIP/kube-proxy Works
1. Client Uses Service Name
Application connects to webapp.default.svc.cluster.local
↓
2. DNS Returns ClusterIP (Once)
DNS query returns stable ClusterIP: 10.96.100.50
TTL can be long (30s+) since IP doesn't change
↓
3. Client Sends to ClusterIP
All requests go to 10.96.100.50:80
↓
4. kube-proxy Intercepts (iptables)
iptables rules on local node intercept traffic
↓
5. Load Balanced to Healthy Pod
Traffic routed to one of the healthy Pods
Failed Pods automatically excluded
Benefits of ClusterIP/kube-proxy
Faster failover: kube-proxy updates iptables rules immediately when Endpoints change
Lower DNS load: ClusterIP is stable, so DNS queries can have long TTL