🧬 Guide: Replication — A Systems Design Deep Dive
Replication is the process of copying and maintaining data across multiple machines to improve availability, fault tolerance, and read scalability.
It’s a foundational building block in distributed systems, used by databases, file systems, caches, and messaging systems.
📦 1. What is Replication?
Replication means maintaining redundant copies of data on different nodes (replicas) in a system.
This helps ensure:
- High availability (data still accessible if a node fails)
- Improved read performance (load balancing reads)
- Disaster recovery and backups
🧭 2. Types of Replication
🟢 1. Master–Slave (Primary–Replica)
One node (master) handles writes, and one or more slaves replicate the data asynchronously.
Client → Write → Master
Client → Read → Replica
✅ Simple and scalable reads
❌ Risk of stale reads due to replication lag
🟡 2. Multi-Master Replication
Multiple nodes can accept writes and sync with each other.
✅ High availability, write flexibility
❌ Requires conflict resolution (e.g., last-write-wins, CRDTs)
🔁 3. Peer-to-Peer / Gossip
Every node shares state changes with peers (e.g. Cassandra, Dynamo).
✅ No single point of failure
❌ Eventually consistent, needs careful design
⏱️ 3. Sync vs Async Replication
Type | Description | Trade-offs |
---|---|---|
Synchronous | Waits for replica to acknowledge | Strong consistency, higher write latency |
Asynchronous | Returns immediately after master write | Low latency, risk of data loss on failure |
Semi-Sync | Master waits for one replica | Balance between safety and speed |
🧪 4. Consistency Trade-offs
Write and Read Patterns:
- Write to master, read from replica → eventual consistency
- Read-your-writes → read from same node (or quorum)
CAP Theorem:
You can’t have all 3:
- Consistency
- Availability
- Partition Tolerance
Replication helps with availability, but often sacrifices consistency.
🛠️ 5. Failover and Recovery
🔧 Detecting Failures
- Heartbeats (ping health)
- Timeouts and retries
- Leader election (e.g., via Raft or ZooKeeper)
🔄 Automatic Failover
- Promote replica to master
- Update routing/config
- Restore original replica when available
// Example: Redis Sentinel failover
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel failover mymaster
📊 6. Use Cases and Real-World Systems
Use Case | Replication Strategy | Example Systems |
---|---|---|
Read-heavy workloads | Master–replica, async | MySQL, PostgreSQL, Redis |
Global low latency | Geo-replication, eventual | DynamoDB, Cassandra |
High consistency needed | Quorum or synchronous writes | CockroachDB, Spanner, etcd |
Streaming updates | Change Data Capture (CDC) | Debezium, Kafka Connect |
⚠️ 7. Common Challenges
Problem | Notes |
---|---|
Replication lag | Async replicas behind master writes |
Split-brain | Multiple primaries writing simultaneously |
Conflict resolution | Needed for multi-master scenarios |
Data loss on crash | Async replicas may miss recent writes |
Write amplification | More writes = more replication overhead |
📚 8. Replication in Databases and Systems
System | Strategy | Notes |
---|---|---|
PostgreSQL | Streaming replication | Logical & physical, WAL-based |
MySQL | Binlog-based async | Semi-sync available |
MongoDB | Replica sets | Built-in failover |
Cassandra | Quorum + gossip | Tunable consistency |
Kafka | Log replication | ISR (in-sync replicas) model |
Redis | Master-replica, Sentinel | Redis Cluster for sharded replication |
🧠 9. Designing for Replication
- Use idempotent operations to handle retries
- Ensure logical clocks or vector clocks to detect conflicts
- Design around eventual consistency when needed
- Combine replication with partitioning for scalability
✅ Summary
Aspect | Key Idea |
---|---|
Purpose | Redundancy for availability + scalability |
Modes | Master-replica, multi-master, P2P |
Trade-offs | Consistency vs latency vs availability |
Real-world use | Used in databases, queues, file systems |
📚 Further Reading
- Designing Data-Intensive Applications – Martin Kleppmann
- Redis Replication Docs
- CAP Theorem Explained
- Raft Consensus Algorithm
- MySQL Replication Explained
<< back to Guides