Distributed Databases & Failover Crash Course

Master Database Scalability, Reliability, and High Availability

0%
Lesson 1 of 6

Introduction to Distributed Databases

What is a Distributed Database?

A distributed database is a database system that stores data across multiple physical locations, either on the same network or across different geographic locations. Unlike traditional centralized databases, data is distributed across multiple nodes, servers, or data centers.

Why Distributed Databases?

Modern applications require distributed databases to address several critical challenges:

  • Scalability: Handle massive amounts of data and traffic by distributing load
  • High Availability: Ensure service continuity even when components fail
  • Performance: Reduce latency by placing data closer to users
  • Fault Tolerance: Survive hardware failures and disasters
  • Geographic Distribution: Serve users globally with low latency

Types of Distributed Databases

1. Homogeneous Distributed Databases

All sites use the same database management system (DBMS) and appear as a single logical database to users.

  • Same DBMS across all nodes
  • Easier to manage and maintain
  • Example: Multiple PostgreSQL instances synchronized together

2. Heterogeneous Distributed Databases

Different sites may use different DBMSs with varying schemas and data models.

  • Different DBMSs at different locations
  • More complex integration
  • Example: MySQL at one site, Oracle at another, integrated through middleware

Distributed Database Architecture

Architecture Description Use Case
Client-Server Clients connect to one or more database servers Web applications, mobile apps
Peer-to-Peer Each node is both client and server Blockchain, decentralized systems
Multi-tier Multiple layers (presentation, application, database) Enterprise applications

Key Challenges in Distributed Databases

1. Data Distribution

Deciding how to distribute data across nodes while maintaining performance and consistency.

2. Transaction Management

Ensuring ACID properties across multiple nodes is complex and requires sophisticated protocols.

3. Network Partitioning

Handling scenarios where network failures split the database into isolated segments.

4. Data Consistency

Maintaining consistent data across all replicas when updates occur.

CAP Theorem Preview

The CAP theorem states that distributed databases can only guarantee two of three properties: Consistency, Availability, and Partition Tolerance. We'll explore this in detail in Lesson 4.

Popular Distributed Database Systems

Database Type Key Features
Cassandra NoSQL (Column-family) High availability, linear scalability
MongoDB NoSQL (Document) Flexible schema, horizontal scaling
CockroachDB SQL (Relational) ACID compliant, geo-distribution
Amazon DynamoDB NoSQL (Key-value) Managed service, auto-scaling
Google Spanner SQL (Relational) Global consistency, SQL support

Basic Concepts

Node

An individual database server or instance in the distributed system. Each node stores a portion of the data.

Replica

A copy of data stored on multiple nodes to ensure availability and fault tolerance.

Shard

A horizontal partition of data. Each shard contains a subset of the total data.

Cluster

A collection of interconnected database nodes working together as a single system.

Real-World Example

Netflix uses Apache Cassandra with hundreds of nodes distributed across multiple AWS regions. This allows them to handle billions of requests per day, survive data center failures, and provide low-latency access to users worldwide.

Test Your Knowledge - Lesson 1

Answer the following questions to proceed. You need 70% (2/3 correct) to pass.

Question 1: What is a key benefit of distributed databases?

Question 2: What is a shard in distributed databases?

Question 3: Which database is known for being ACID compliant with geo-distribution?

Lesson 2 of 6

Replication Strategies

What is Database Replication?

Database replication is the process of copying and maintaining database objects (like tables and databases) in multiple database servers. This creates redundant copies of data to improve availability, fault tolerance, and read performance.

Benefits of Replication

  • High Availability: If one server fails, others can continue serving requests
  • Load Distribution: Read queries can be distributed across replicas
  • Disaster Recovery: Data is preserved even if one location is destroyed
  • Geographic Distribution: Place data closer to users for lower latency

Replication Architectures

1. Master-Slave Replication (Primary-Secondary)

One primary node handles all writes, and changes are replicated to one or more secondary nodes that handle reads.

# Master Node (Primary) - Accepts all write operations - Propagates changes to slaves - Can also serve read requests # Slave Nodes (Secondaries) - Receive data from master - Serve read-only requests - Cannot accept write operations Example: MySQL Master-Slave Setup ------------------------------------ Master: 192.168.1.10 (writes) Slave1: 192.168.1.11 (reads) Slave2: 192.168.1.12 (reads)

Advantages:

  • Simple to implement and understand
  • Scales read operations effectively
  • Reduced load on the master for read-heavy workloads

Disadvantages:

  • Single point of failure for writes (master)
  • Replication lag may cause stale reads
  • Limited write scalability

2. Master-Master Replication (Multi-Master)

Multiple nodes can accept both read and write operations, with changes synchronized between all masters.

# All nodes are equal Master1: 192.168.1.10 (reads & writes) Master2: 192.168.1.11 (reads & writes) Master3: 192.168.1.12 (reads & writes) # Bidirectional synchronization Master1 <---> Master2 <---> Master3

Advantages:

  • No single point of failure
  • Scales both reads and writes
  • Better availability

Disadvantages:

  • Complex conflict resolution needed
  • Risk of write conflicts
  • More difficult to maintain consistency

3. Peer-to-Peer Replication

All nodes are equal peers, each can read and write. Used in NoSQL databases like Cassandra.

# Cassandra-style architecture Node1 <---> Node2 ^ ^ | | v v Node4 <---> Node3 - No master node - Data distributed via consistent hashing - Replication factor determines copies

Synchronous vs Asynchronous Replication

Synchronous Replication

Write operations must be confirmed by replicas before completing the transaction.

# Synchronous Write Flow 1. Client sends write to primary 2. Primary writes to local storage 3. Primary sends to all replicas 4. Replicas acknowledge receipt 5. Primary confirms to client ← Only then! Latency: Higher Consistency: Strong Data Loss Risk: Very Low

Use When: Data consistency is critical (financial transactions, inventory management)

Asynchronous Replication

Write operations are confirmed immediately after being written to the primary, replicas updated later.

# Asynchronous Write Flow 1. Client sends write to primary 2. Primary writes to local storage 3. Primary confirms to client ← Immediately! 4. Primary sends to replicas (background) 5. Replicas eventually receive updates Latency: Lower Consistency: Eventual Data Loss Risk: Higher

Use When: Performance is critical and you can tolerate eventual consistency (social media feeds, caching)

Replication Lag

Replication lag is the delay between when data is written to the primary and when it appears on replicas.

Handling Replication Lag

Strategies to mitigate replication lag issues:

  • Read-your-writes consistency: Route user's reads to primary after they write
  • Monotonic reads: Always read from the same replica for a session
  • Monitoring: Track lag metrics and alert when exceeding thresholds
  • Connection pooling: Use smart routing to direct traffic appropriately

Replication Topologies

Topology Structure Best For
Star Central master, multiple slaves Simple setups, read scaling
Chain Master → Slave1 → Slave2 → Slave3 Geographic distribution
Ring Circular replication between nodes Multi-master scenarios
Mesh All nodes replicate to all others Maximum redundancy

Real-World Example: Instagram

Instagram uses PostgreSQL with master-slave replication. They use the master for writes and route read queries to multiple read replicas. During peak times, they can have 12+ read replicas to handle billions of read requests per day.

PostgreSQL Replication Example

# On Primary Server (postgresql.conf) wal_level = replica max_wal_senders = 10 wal_keep_size = 1GB # Create replication user CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret'; # On Replica Server # Create standby configuration standby_mode = 'on' primary_conninfo = 'host=192.168.1.10 port=5432 user=replicator password=secret' trigger_file = '/tmp/postgresql.trigger.5432' # Start replica server - it will sync from primary

Test Your Knowledge - Lesson 2

Answer the following questions to proceed. You need 70% (2/3 correct) to pass.

Question 1: In Master-Slave replication, where are write operations handled?

Question 2: What is the main advantage of asynchronous replication over synchronous?

Question 3: What is replication lag?

Lesson 3 of 6

Partitioning and Sharding

What is Data Partitioning?

Data partitioning is the process of dividing a large database into smaller, more manageable pieces called partitions. Each partition contains a subset of the data and can be stored on different servers or storage devices.

Why Partition Data?

  • Scalability: Distribute data across multiple servers to handle more data
  • Performance: Smaller datasets mean faster queries
  • Availability: Partition failures don't affect entire database
  • Manageability: Easier to backup, restore, and maintain smaller partitions

Partitioning vs Sharding

While often used interchangeably, there's a subtle difference:

  • Partitioning: Splitting data within a single database instance (vertical or horizontal)
  • Sharding: Horizontal partitioning across multiple database instances/servers

Types of Partitioning

1. Horizontal Partitioning (Sharding)

Dividing rows of a table across multiple partitions. Each partition has the same schema but different rows.

# Original Users Table (1M rows) users (user_id, name, email, country, created_at) # After Horizontal Partitioning: Shard 1: users_1 (user_id 1-250,000) Shard 2: users_2 (user_id 250,001-500,000) Shard 3: users_3 (user_id 500,001-750,000) Shard 4: users_4 (user_id 750,001-1,000,000) Each shard: Same columns, different rows

2. Vertical Partitioning

Dividing columns of a table across multiple partitions. Each partition has different columns but the same rows.

# Original Users Table users (user_id, name, email, address, phone, preferences, settings) # After Vertical Partitioning: users_basic (user_id, name, email) users_contact (user_id, address, phone) users_preferences (user_id, preferences, settings) Each partition: Different columns, same user_ids

Use Case: Separate frequently accessed columns from rarely accessed ones.

Sharding Strategies

1. Range-Based Sharding

Partition data based on ranges of a key (e.g., date ranges, alphabetical ranges).

# Sharding by User ID Range Shard 1: user_id 1 - 1,000,000 Shard 2: user_id 1,000,001 - 2,000,000 Shard 3: user_id 2,000,001 - 3,000,000 # Sharding by Date Range Shard 1: orders from 2023 Shard 2: orders from 2024 Shard 3: orders from 2025

Advantages: Simple to implement, easy to add new ranges

Disadvantages: Can lead to unbalanced shards (hotspots)

2. Hash-Based Sharding

Use a hash function on a shard key to determine which shard stores the data.

# Hash function determines shard shard_number = hash(user_id) % number_of_shards Example: user_id: 12345 hash(12345) = 98765 98765 % 4 = 1 → Store in Shard 1 user_id: 67890 hash(67890) = 45678 45678 % 4 = 2 → Store in Shard 2

Advantages: Even data distribution, no hotspots

Disadvantages: Difficult to add/remove shards (requires re-hashing)

3. Directory-Based Sharding

Use a lookup table (directory) that maps keys to shards.

# Shard Directory { "user_1000": "shard_1", "user_2000": "shard_2", "user_3000": "shard_1", "region_US": "shard_1", "region_EU": "shard_2", "region_ASIA": "shard_3" }

Advantages: Flexible, can use complex logic

Disadvantages: Directory is a single point of failure, needs maintenance

4. Geographic Sharding

Partition data based on geographic location for lower latency.

# Geographic Sharding Shard_US_EAST: Users in US East Coast Shard_US_WEST: Users in US West Coast Shard_EU: Users in Europe Shard_ASIA: Users in Asia Benefit: Data close to users = lower latency

Consistent Hashing

An advanced hashing technique that minimizes data movement when adding or removing shards.

# Traditional hashing problem: - 3 shards → hash(key) % 3 - Add 4th shard → hash(key) % 4 - Result: Most keys move to different shards! # Consistent hashing solution: - Uses a hash ring (0 to 2^32-1) - Shards placed on ring at hashed positions - Keys placed on ring, assigned to next shard clockwise - Adding/removing shard only affects immediate neighbors - Used by: Cassandra, DynamoDB, Memcached

Sharding Challenges

  • Cross-shard queries: Queries spanning multiple shards are expensive
  • Shard rebalancing: Moving data between shards is complex
  • Hotspots: Uneven access patterns can overload specific shards
  • Transactions: Distributed transactions across shards are complex
  • Application complexity: Application must be shard-aware

Choosing a Shard Key

The shard key determines how data is distributed. A good shard key should:

  • High cardinality: Many unique values for even distribution
  • Avoid hotspots: No single value accessed much more than others
  • Query-friendly: Most queries can target a single shard
  • Immutable: Shouldn't change after record creation
# Good Shard Keys: ✓ user_id (high cardinality, even distribution) ✓ order_id (high cardinality, immutable) ✓ tenant_id (for multi-tenant apps) # Bad Shard Keys: ✗ status (low cardinality: 'active', 'inactive') ✗ country (uneven: USA might have 80% of users) ✗ timestamp (creates hotspot on latest shard)

Real-World Example: MongoDB Sharding

MongoDB automatically shards collections across multiple servers. For example, Foursquare uses MongoDB sharding to manage billions of check-ins. They shard by user_id to ensure each user's data is on one shard, making queries efficient.

MongoDB Sharding Example

# Enable sharding on database use admin db.runCommand({ enableSharding: "mydb" }) # Shard a collection by user_id db.runCommand({ shardCollection: "mydb.users", key: { user_id: "hashed" } }) # MongoDB automatically: - Distributes data across shards - Routes queries to appropriate shards - Rebalances data when adding shards

Test Your Knowledge - Lesson 3

Answer the following questions to proceed. You need 70% (2/3 correct) to pass.

Question 1: What is horizontal partitioning?

Question 2: What is the main advantage of hash-based sharding?

Question 3: Which is a good characteristic for a shard key?

Lesson 4 of 6

Consistency Models & CAP Theorem

The CAP Theorem

The CAP theorem, proposed by Eric Brewer, states that a distributed database system can only guarantee two of the following three properties simultaneously:

The Three Properties

  • Consistency (C): All nodes see the same data at the same time
  • Availability (A): Every request receives a response (success or failure)
  • Partition Tolerance (P): System continues operating despite network partitions

In reality, network partitions are inevitable in distributed systems, so you must choose between Consistency and Availability.

CAP Trade-offs

CP Systems (Consistency + Partition Tolerance)

Sacrifice availability to maintain consistency during network partitions.

# CP System Behavior During Partition: - Nodes cannot communicate with each other - System rejects writes to maintain consistency - Reads may be rejected or delayed - System becomes unavailable until partition resolves Examples: - MongoDB (with strong consistency settings) - HBase - Redis (with wait command) - ZooKeeper Use When: Data correctness is critical (banking, inventory)

AP Systems (Availability + Partition Tolerance)

Sacrifice consistency to maintain availability during network partitions.

# AP System Behavior During Partition: - Nodes accept writes independently - System remains available - Different nodes may have different data - Eventually converges after partition resolves Examples: - Cassandra - DynamoDB - Couchbase - Riak Use When: Availability is critical (social media, analytics)

CA Systems (Consistency + Availability)

In practice, CA systems don't exist in truly distributed environments because network partitions are inevitable.

CA Systems Reality Check

Traditional relational databases (MySQL, PostgreSQL) running on a single node can be considered CA, but they're not truly distributed. Once you distribute them, you must handle partitions, moving into CP or AP territory.

Consistency Models

1. Strong Consistency (Linearizability)

Once a write completes, all subsequent reads will see that write or a later one.

# Strong Consistency Example: Time: T1 - Client A writes X=1 Time: T2 - Write completes Time: T3 - Client B reads X → Must see X=1 or later value Guarantees: All clients see the same value at the same time Latency: Higher (must wait for replication) Use Case: Financial transactions, inventory systems

2. Eventual Consistency

If no new updates are made, all replicas will eventually converge to the same value.

# Eventual Consistency Example: Time: T1 - Client A writes X=1 to Node 1 Time: T2 - Client B reads from Node 2 → May see old value X=0 Time: T3 - Replication completes Time: T4 - Client B reads from Node 2 → Now sees X=1 Guarantees: Eventually consistent, but may be stale Latency: Lower (don't wait for replication) Use Case: Social media feeds, caching, analytics

3. Causal Consistency

Writes that are causally related are seen by all nodes in the same order.

# Causal Consistency Example: Alice posts: "I love distributed databases!" Bob replies: "Me too!" All users must see: 1. Alice's post first 2. Bob's reply second But unrelated posts can appear in any order relative to these.

4. Read-Your-Writes Consistency

A client will always see their own writes in subsequent reads.

# Read-Your-Writes Example: User updates profile: name = "John Doe" User immediately views profile → Must see "John Doe" Implementation: Route user's reads to the primary or use session affinity

5. Monotonic Reads

Once a client reads a value, it will never see an older value in subsequent reads.

# Monotonic Reads Example: Time: T1 - User reads X=5 from Replica 1 Time: T2 - User reads X from Replica 2 → Must see X≥5 Prevents: Time going backwards from user's perspective

ACID vs BASE

ACID (Traditional Databases)

Property Description
Atomicity All or nothing: transaction fully completes or fully fails
Consistency Database moves from one valid state to another
Isolation Concurrent transactions don't interfere with each other
Durability Completed transactions survive system failures

BASE (Distributed NoSQL)

Property Description
Basically Available System guarantees availability (CAP theorem)
Soft state State may change over time without input (convergence)
Eventual consistency System will become consistent over time

Quorum-Based Consistency

Use voting to balance consistency and availability. Common in distributed databases like Cassandra and DynamoDB.

# Quorum Formula: N = Total number of replicas W = Write quorum (nodes that must acknowledge write) R = Read quorum (nodes queried for read) Strong Consistency: W + R > N Examples: - N=3, W=2, R=2 (2+2>3) → Strong consistency - N=3, W=3, R=1 (3+1>3) → Strong consistency, slow writes - N=3, W=1, R=3 (1+3>3) → Strong consistency, slow reads Eventual Consistency: W + R ≤ N Examples: - N=3, W=1, R=1 (1+1≤3) → Fast but eventually consistent

Cassandra Consistency Levels

Cassandra allows you to tune consistency per query:

  • ONE: 1 node responds (fastest, least consistent)
  • QUORUM: Majority of nodes respond
  • ALL: All nodes respond (slowest, most consistent)
  • LOCAL_QUORUM: Quorum in local datacenter

Real-World Example: Amazon

Amazon's DynamoDB uses eventual consistency by default for high availability. However, critical operations like checkout and payment use strong consistency settings to prevent inventory overselling and payment errors.

Conflict Resolution

In eventually consistent systems, conflicts can occur when the same data is updated on multiple nodes simultaneously.

Conflict Resolution Strategies:

  • Last Write Wins (LWW): Use timestamps, newest write wins (simple but can lose data)
  • Version Vectors: Track causality, identify true conflicts
  • Application-level resolution: Let application decide (e.g., shopping cart merge)
  • CRDTs: Conflict-free Replicated Data Types that merge automatically

Test Your Knowledge - Lesson 4

Answer the following questions to proceed. You need 70% (2/3 correct) to pass.

Question 1: According to CAP theorem, which two properties can be guaranteed together in the presence of network partitions?

Question 2: What does eventual consistency guarantee?

Question 3: For strong consistency in a quorum system with N=3 replicas, which setting works?

Lesson 5 of 6

Database Failover Mechanisms

What is Database Failover?

Database failover is the automatic or manual process of switching to a standby database server when the primary server fails. It's a critical component of high-availability database systems.

Goals of Failover

  • Minimize Downtime: Restore service as quickly as possible
  • Prevent Data Loss: Ensure no committed transactions are lost
  • Maintain Consistency: Keep data consistent across nodes
  • Transparent to Users: Users shouldn't notice the switch

Types of Failover

1. Automatic Failover

System automatically detects failure and promotes a standby without human intervention.

# Automatic Failover Flow: 1. Health check detects primary failure 2. Failover coordinator verifies failure 3. Standby is automatically promoted to primary 4. DNS/load balancer updated to point to new primary 5. Clients reconnect to new primary 6. Old primary joins as standby when recovered Time to Failover: Seconds to minutes Risk: Potential split-brain scenarios Tools: Patroni, MySQL Group Replication, MongoDB Replica Sets

2. Manual Failover

Database administrator manually triggers failover after assessing the situation.

# Manual Failover Flow: 1. DBA is alerted to primary failure 2. DBA investigates the cause 3. DBA manually promotes standby 4. DBA updates application configuration 5. DBA monitors the new primary Time to Failover: Minutes to hours Risk: Longer downtime Benefit: More control, reduced risk of split-brain

3. Planned Failover (Switchover)

Controlled failover during maintenance windows with zero data loss.

# Planned Failover Flow: 1. Ensure standby is fully synchronized 2. Stop writes to primary 3. Wait for replication to complete 4. Verify standby is caught up 5. Promote standby to primary 6. Redirect traffic to new primary 7. Demote old primary to standby Downtime: Minimal (seconds) Data Loss: Zero Use Case: Upgrades, maintenance, migrations

Failure Detection

Health Check Methods:

  • Heartbeat: Regular "alive" signals between nodes
  • TCP Connection: Monitor database port connectivity
  • Query-based: Execute test queries periodically
  • Replication lag: Monitor replication delay
# PostgreSQL Health Check Script #!/bin/bash pg_isready -h primary-db.example.com -p 5432 -U postgres # Returns: # - 0: accepting connections # - 1: rejecting connections # - 2: no response # - 3: no attempt (bad params) # Advanced check with query psql -h primary-db.example.com -U postgres -c "SELECT 1" > /dev/null 2>&1 if [ $? -eq 0 ]; then echo "Primary is healthy" else echo "Primary is down - trigger failover" fi

Split-Brain Problem

A dangerous scenario where two nodes both think they're the primary, potentially causing data divergence and corruption.

Split-Brain Scenario

# What Happens: 1. Network partition separates primary from standby 2. Standby can't reach primary, thinks it's down 3. Standby promotes itself to primary 4. Now TWO primaries exist! 5. Both accept writes → data divergence 6. When network heals, conflict resolution needed Prevention: - Use quorum-based systems - Implement fencing/STONITH - Use distributed consensus (Raft, Paxos) - Witness servers for tie-breaking

Fencing and STONITH

Techniques to prevent split-brain by forcibly isolating or shutting down the old primary.

# Fencing Methods: 1. Network Fencing: Block old primary's network access 2. Power Fencing: Power off old primary 3. STONITH: "Shoot The Other Node In The Head" 4. Disk Fencing: Revoke storage access Example with STONITH: - Standby detects primary failure - Before promoting, standby sends power-off to old primary - Ensures old primary cannot accept writes - Safe to promote standby

Consensus Algorithms

Raft Consensus

A consensus algorithm used in many distributed databases for leader election and maintaining consistency.

# Raft Leader Election: 1. Nodes start as followers 2. If leader heartbeat timeout, become candidate 3. Request votes from other nodes 4. Node with majority votes becomes leader 5. Leader handles all writes 6. If leader fails, process repeats Used by: etcd, Consul, CockroachDB Benefits: - Guaranteed single leader - Handles network partitions correctly - No split-brain possible

Paxos Consensus

An older, more complex consensus algorithm that Raft was designed to improve upon.

Used by: Google Spanner, Apache ZooKeeper More complex but proven correct Handles Byzantine failures (malicious nodes)

Failover Patterns by Database

PostgreSQL with Patroni

# Patroni Configuration (patroni.yml) scope: postgres-cluster namespace: /db/ name: postgres-node-1 etcd: host: etcd-cluster:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 postgresql: data_dir: /var/lib/postgresql/data parameters: max_connections: 100 shared_buffers: 256MB # Patroni handles: - Automatic failover - Leader election via etcd - Health checks - Replication configuration

MySQL Group Replication

# MySQL Group Replication - Multi-master replication - Automatic failover - Conflict detection and resolution - Majority-based decision making # Configuration [mysqld] gtid_mode = ON enforce_gtid_consistency = ON plugin_load_add = 'group_replication.so' group_replication_group_name = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" group_replication_start_on_boot = OFF group_replication_local_address = "192.168.1.10:33061" group_replication_group_seeds = "192.168.1.10:33061,192.168.1.11:33061"

MongoDB Replica Sets

# MongoDB Automatic Failover rs.initiate({ _id: "myReplicaSet", members: [ { _id: 0, host: "mongodb0.example.net:27017" }, { _id: 1, host: "mongodb1.example.net:27017" }, { _id: 2, host: "mongodb2.example.net:27017" } ] }) # Automatic behaviors: - Elections use majority vote - Failover typically completes in 10-30 seconds - Primary election requires majority of voting members - Arbiter nodes can participate in voting without data

Monitoring Failover Systems

Key Metrics to Monitor:

Metric Description Alert Threshold
Replication Lag Time delay between primary and standby > 10 seconds
Health Check Status Is the primary responding? Any failure
Connection Count Active database connections > 80% of max
Failover Count Number of failovers > 1 per week
Recovery Time Time to complete failover > 5 minutes

Real-World Example: GitHub

In 2018, GitHub experienced a 24-hour outage due to a split-brain scenario in their MySQL cluster. Network partition caused both data centers to think they were primary. They had to carefully reconcile divergent data. Since then, they've improved their consensus mechanisms and monitoring.

Testing Failover

Regular failover testing is crucial to ensure your system works when needed.

# Failover Testing Checklist: 1. Chaos Engineering: Randomly kill primary in test environment 2. Measure failover time 3. Verify zero data loss 4. Check application reconnection 5. Monitor for split-brain 6. Test failback to original primary 7. Document any issues # Tools for Testing: - Chaos Monkey (Netflix) - Gremlin - Litmus (Kubernetes) - Custom scripts to kill processes/VMs

Test Your Knowledge - Lesson 5

Answer the following questions to proceed. You need 70% (2/3 correct) to pass.

Question 1: What is the split-brain problem?

Question 2: What is STONITH in database failover?

Question 3: What is the main advantage of automatic failover over manual failover?

Lesson 6 of 6

Real-World Implementation & Best Practices

Designing a Distributed Database System

Implementing a production-ready distributed database system requires careful planning and consideration of multiple factors.

Key Design Decisions

  • Database Selection: SQL vs NoSQL, specific technology
  • Replication Strategy: Master-slave, multi-master, or peer-to-peer
  • Partitioning Strategy: How to shard data across nodes
  • Consistency Model: Strong vs eventual consistency
  • Failover Approach: Automatic, manual, or hybrid
  • Geographic Distribution: Single region vs multi-region

Multi-Region Architecture

Distributing databases across geographic regions for global availability and low latency.

# Multi-Region Setup Example Region: US-EAST (Primary) - Primary database cluster - 3 nodes with replication - Serves North American traffic Region: EU-WEST (Secondary) - Read replica cluster - 3 nodes synced from US-EAST - Serves European traffic Region: ASIA-PACIFIC (Secondary) - Read replica cluster - 3 nodes synced from US-EAST - Serves Asian traffic Replication: Asynchronous between regions Failover: Automatic within region, manual between regions Latency: <50ms local, 100-200ms cross-region

Multi-Region Challenges:

  • Cross-region latency: Speed of light limits
  • Data sovereignty: Legal requirements for data location
  • Consistency: CAP theorem trade-offs amplified
  • Cost: Data transfer between regions is expensive

Best Practices for High Availability

1. Redundancy at Every Layer

# High Availability Stack Load Balancer (HA pair) ↓ Application Servers (3+ instances) ↓ Connection Pooler (PgBouncer/ProxySQL) ↓ Database Cluster (3+ nodes) ↓ Storage (Replicated/RAID) No single point of failure!

2. Automate Everything

  • Automated failover
  • Automated backups
  • Automated monitoring and alerting
  • Automated scaling
  • Infrastructure as Code (Terraform, CloudFormation)

3. Monitor Proactively

# Essential Metrics - Query performance (slow query log) - Connection pool utilization - Replication lag - Disk I/O and space - CPU and memory usage - Network throughput - Error rates - Backup success/failure Tools: Prometheus, Grafana, Datadog, New Relic

4. Test Failure Scenarios

# Regular Disaster Recovery Drills 1. Simulate primary database failure 2. Test automatic failover 3. Measure recovery time objective (RTO) 4. Verify recovery point objective (RPO) 5. Test restore from backups 6. Simulate network partitions 7. Test cross-region failover 8. Document lessons learned Frequency: Monthly for critical systems

Backup Strategies

3-2-1 Backup Rule

  • 3 copies of data (original + 2 backups)
  • 2 different storage types (disk + cloud)
  • 1 offsite backup (different geographic location)
# PostgreSQL Backup Strategy # Full Backup (Daily) pg_basebackup -h primary -D /backup/full/$(date +%Y%m%d) -Ft -z -P # WAL Archiving (Continuous) archive_command = 'rsync -a %p backup-server:/wal_archive/%f' # Point-in-Time Recovery (PITR) # Can restore to any point in time between backups # Retention Policy - Daily backups: Keep 7 days - Weekly backups: Keep 4 weeks - Monthly backups: Keep 12 months - Yearly backups: Keep 7 years

Security Best Practices

Network Security

  • Firewall rules: Only allow necessary ports
  • VPC/Private networks: Isolate database traffic
  • SSL/TLS encryption: Encrypt data in transit
  • VPN/Bastion hosts: Secure administrative access

Access Control

# Principle of Least Privilege # Read-only user for reporting CREATE USER reporting_user WITH PASSWORD 'strong_password'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO reporting_user; # Application user with limited permissions CREATE USER app_user WITH PASSWORD 'strong_password'; GRANT SELECT, INSERT, UPDATE ON specific_tables TO app_user; # Admin user (use sparingly) CREATE USER admin_user WITH PASSWORD 'very_strong_password' SUPERUSER; # Audit logging ALTER DATABASE mydb SET log_statement = 'all'; ALTER DATABASE mydb SET log_connections = 'on';

Encryption

  • At rest: Encrypt database files (LUKS, dm-crypt, AWS EBS encryption)
  • In transit: SSL/TLS for all connections
  • Backups: Encrypt backup files before storing
  • Sensitive data: Application-level encryption for PII

Performance Optimization

Connection Pooling

# PgBouncer Configuration [databases] mydb = host=127.0.0.1 port=5432 dbname=mydb [pgbouncer] listen_port = 6432 listen_addr = * auth_type = md5 pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 # Benefits: - Reduces connection overhead - Prevents connection exhaustion - Better resource utilization

Query Optimization

# Indexing Strategy # Good indexes CREATE INDEX idx_users_email ON users(email); CREATE INDEX idx_orders_user_created ON orders(user_id, created_at); # Analyze query plans EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com'; # Monitor slow queries # PostgreSQL ALTER DATABASE mydb SET log_min_duration_statement = 1000; -- log queries > 1s # MySQL SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1;

Caching

# Multi-layer caching Application Cache (Redis/Memcached) ↓ (cache miss) Database Query Cache ↓ (cache miss) Actual Database Query # Cache common queries - User sessions - Product catalogs - Configuration data - Frequently accessed data # Cache invalidation strategies - TTL (Time To Live) - Write-through cache - Cache-aside pattern - Event-driven invalidation

Real-World Case Studies

Case Study 1: Uber

Challenge: Global ride-hailing service needs low latency worldwide.

Solution:

  • Schemaless (MySQL-based) for flexible data model
  • Geographic sharding by city
  • Multi-region deployment
  • Eventually consistent for non-critical data
  • Strong consistency for billing/payments

Result: Handles millions of trips per day with <50ms latency for critical operations.

Case Study 2: Discord

Challenge: Real-time chat with billions of messages, rapid growth.

Solution:

  • Started with MongoDB, migrated to Cassandra
  • Partitioned by channel_id
  • Eventually consistent (acceptable for chat)
  • Custom message ID scheme for ordering
  • Aggressive caching with Redis

Result: Scales to 140 million monthly active users, storing trillions of messages.

Case Study 3: Airbnb

Challenge: Complex booking system requiring strong consistency.

Solution:

  • MySQL with master-slave replication
  • Sharded by listing_id
  • Synchronous replication for bookings
  • Asynchronous for search/browse
  • Read replicas for analytics

Result: Prevents double-bookings while maintaining high availability.

Choosing the Right Database

Use Case Recommended Database Why
Financial transactions PostgreSQL, CockroachDB ACID compliance, strong consistency
Social media feed Cassandra, DynamoDB High write throughput, eventual consistency OK
E-commerce catalog MongoDB, Elasticsearch Flexible schema, full-text search
Real-time analytics ClickHouse, TimescaleDB Columnar storage, time-series optimization
Session storage Redis, Memcached In-memory, fast reads/writes
Graph relationships Neo4j, Amazon Neptune Optimized for graph queries

Migration Strategies

Blue-Green Deployment

# Migrating to distributed database 1. Set up new distributed database (Green) 2. Keep old database running (Blue) 3. Replicate data from Blue to Green 4. Run both in parallel (dual writes) 5. Verify data consistency 6. Switch reads to Green 7. Switch writes to Green 8. Monitor for issues 9. Decommission Blue after stabilization Allows instant rollback if issues arise!

Common Pitfalls to Avoid

  • Over-sharding: Too many small shards increases complexity
  • Under-monitoring: You can't fix what you can't see
  • Ignoring backups: Test restores regularly!
  • No capacity planning: Know your growth trajectory
  • Premature optimization: Start simple, scale when needed
  • Ignoring security: Security should be built-in from day one

Summary: Key Takeaways

  1. Distributed databases trade simplicity for scalability and availability
  2. Choose consistency model based on business requirements (CAP theorem)
  3. Implement proper replication for fault tolerance
  4. Use sharding to scale beyond single-server limits
  5. Automated failover minimizes downtime but requires careful implementation
  6. Monitor everything and test failure scenarios regularly
  7. There's no one-size-fits-all solution - understand your requirements

Final Test - Lesson 6

Answer the following questions to complete the course. You need 70% (2/3 correct) to pass.

Question 1: What is the 3-2-1 backup rule?

Question 2: Which database would you choose for financial transactions?

Question 3: What is the main benefit of connection pooling?