<< back to Guides

🧬 Guide: Replication — A Systems Design Deep Dive

Replication is the process of copying and maintaining data across multiple machines to improve availability, fault tolerance, and read scalability.

It’s a foundational building block in distributed systems, used by databases, file systems, caches, and messaging systems.

📦 1. What is Replication?

Replication means maintaining redundant copies of data on different nodes (replicas) in a system.

This helps ensure:

High availability (data still accessible if a node fails)
Improved read performance (load balancing reads)
Disaster recovery and backups

🧭 2. Types of Replication

🟢 1. Master–Slave (Primary–Replica)

One node (master) handles writes, and one or more slaves replicate the data asynchronously.

Client → Write → Master
Client → Read  → Replica

✅ Simple and scalable reads
❌ Risk of stale reads due to replication lag

🟡 2. Multi-Master Replication

Multiple nodes can accept writes and sync with each other.

✅ High availability, write flexibility
❌ Requires conflict resolution (e.g., last-write-wins, CRDTs)

🔁 3. Peer-to-Peer / Gossip

Every node shares state changes with peers (e.g. Cassandra, Dynamo).

✅ No single point of failure
❌ Eventually consistent, needs careful design

⏱️ 3. Sync vs Async Replication

Type	Description	Trade-offs
Synchronous	Waits for replica to acknowledge	Strong consistency, higher write latency
Asynchronous	Returns immediately after master write	Low latency, risk of data loss on failure
Semi-Sync	Master waits for one replica	Balance between safety and speed

🧪 4. Consistency Trade-offs

Write and Read Patterns:

Write to master, read from replica → eventual consistency
Read-your-writes → read from same node (or quorum)

CAP Theorem:

You can’t have all 3:

Consistency
Availability
Partition Tolerance

Replication helps with availability, but often sacrifices consistency.

🛠️ 5. Failover and Recovery

🔧 Detecting Failures

Heartbeats (ping health)
Timeouts and retries
Leader election (e.g., via Raft or ZooKeeper)

🔄 Automatic Failover

Promote replica to master
Update routing/config
Restore original replica when available

// Example: Redis Sentinel failover
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel failover mymaster

📊 6. Use Cases and Real-World Systems

Use Case	Replication Strategy	Example Systems
Read-heavy workloads	Master–replica, async	MySQL, PostgreSQL, Redis
Global low latency	Geo-replication, eventual	DynamoDB, Cassandra
High consistency needed	Quorum or synchronous writes	CockroachDB, Spanner, etcd
Streaming updates	Change Data Capture (CDC)	Debezium, Kafka Connect

⚠️ 7. Common Challenges

Problem	Notes
Replication lag	Async replicas behind master writes
Split-brain	Multiple primaries writing simultaneously
Conflict resolution	Needed for multi-master scenarios
Data loss on crash	Async replicas may miss recent writes
Write amplification	More writes = more replication overhead

📚 8. Replication in Databases and Systems

System	Strategy	Notes
PostgreSQL	Streaming replication	Logical & physical, WAL-based
MySQL	Binlog-based async	Semi-sync available
MongoDB	Replica sets	Built-in failover
Cassandra	Quorum + gossip	Tunable consistency
Kafka	Log replication	ISR (in-sync replicas) model
Redis	Master-replica, Sentinel	Redis Cluster for sharded replication

🧠 9. Designing for Replication

Use idempotent operations to handle retries
Ensure logical clocks or vector clocks to detect conflicts
Design around eventual consistency when needed
Combine replication with partitioning for scalability

✅ Summary

Aspect	Key Idea
Purpose	Redundancy for availability + scalability
Modes	Master-replica, multi-master, P2P
Trade-offs	Consistency vs latency vs availability
Real-world use	Used in databases, queues, file systems

📚 Further Reading

<< back to Guides