<< back to Guides

🧬 Guide: Replication — A Systems Design Deep Dive

Replication is the process of copying and maintaining data across multiple machines to improve availability, fault tolerance, and read scalability.

It’s a foundational building block in distributed systems, used by databases, file systems, caches, and messaging systems.


📦 1. What is Replication?

Replication means maintaining redundant copies of data on different nodes (replicas) in a system.

This helps ensure:


🧭 2. Types of Replication

🟢 1. Master–Slave (Primary–Replica)

One node (master) handles writes, and one or more slaves replicate the data asynchronously.

Client → Write → Master
Client → Read  → Replica

✅ Simple and scalable reads
❌ Risk of stale reads due to replication lag


🟡 2. Multi-Master Replication

Multiple nodes can accept writes and sync with each other.

✅ High availability, write flexibility
❌ Requires conflict resolution (e.g., last-write-wins, CRDTs)


🔁 3. Peer-to-Peer / Gossip

Every node shares state changes with peers (e.g. Cassandra, Dynamo).

✅ No single point of failure
Eventually consistent, needs careful design


⏱️ 3. Sync vs Async Replication

Type Description Trade-offs
Synchronous Waits for replica to acknowledge Strong consistency, higher write latency
Asynchronous Returns immediately after master write Low latency, risk of data loss on failure
Semi-Sync Master waits for one replica Balance between safety and speed

🧪 4. Consistency Trade-offs

Write and Read Patterns:

CAP Theorem:

You can’t have all 3:

Replication helps with availability, but often sacrifices consistency.


🛠️ 5. Failover and Recovery

🔧 Detecting Failures

🔄 Automatic Failover

// Example: Redis Sentinel failover
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel failover mymaster

📊 6. Use Cases and Real-World Systems

Use Case Replication Strategy Example Systems
Read-heavy workloads Master–replica, async MySQL, PostgreSQL, Redis
Global low latency Geo-replication, eventual DynamoDB, Cassandra
High consistency needed Quorum or synchronous writes CockroachDB, Spanner, etcd
Streaming updates Change Data Capture (CDC) Debezium, Kafka Connect

⚠️ 7. Common Challenges

Problem Notes
Replication lag Async replicas behind master writes
Split-brain Multiple primaries writing simultaneously
Conflict resolution Needed for multi-master scenarios
Data loss on crash Async replicas may miss recent writes
Write amplification More writes = more replication overhead

📚 8. Replication in Databases and Systems

System Strategy Notes
PostgreSQL Streaming replication Logical & physical, WAL-based
MySQL Binlog-based async Semi-sync available
MongoDB Replica sets Built-in failover
Cassandra Quorum + gossip Tunable consistency
Kafka Log replication ISR (in-sync replicas) model
Redis Master-replica, Sentinel Redis Cluster for sharded replication

🧠 9. Designing for Replication


✅ Summary

Aspect Key Idea
Purpose Redundancy for availability + scalability
Modes Master-replica, multi-master, P2P
Trade-offs Consistency vs latency vs availability
Real-world use Used in databases, queues, file systems

📚 Further Reading


<< back to Guides