🧠 Deep Dive: Apache Kafka
Apache Kafka is a distributed event streaming platform designed for high-throughput, low-latency, and durable real-time data pipelines.
📌 What is Kafka?
Apache Kafka was originally developed by LinkedIn and is now part of the Apache Software Foundation. It's widely used in large-scale systems like Netflix, Uber, LinkedIn, and Airbnb.
Kafka is fundamentally used for:
- Event streaming
- Log aggregation
- Change Data Capture (CDC)
- Metrics pipelines
- Real-time analytics
📦 Kafka Messages
A message is the smallest unit of data in Kafka. Think of it like a row in a database.
Each message contains:
- Key: Used to determine the partition.
- Value: The actual payload.
- Headers: Metadata.
- Offset: A unique ID within a partition.
// Example Kafka message (JSON)
{
"key": "user-42",
"value": {
"event": "login",
"timestamp": "2024-06-01T12:00:00Z"
},
"headers": {
"source": "auth-service"
}
}
📁 Topics and 🧩 Partitions
- A Topic is a logical channel to which messages are written.
- A Partition is a subdivision of a topic that allows Kafka to scale horizontally.
Each partition is an ordered log and messages are written in sequence.
Topic: user-events
Partitions: [0, 1, 2]
This allows parallelism — multiple consumers can read from different partitions at the same time.
⚙️ Kafka Producer
Kafka producers are responsible for:
- Publishing data to a Kafka topic.
- Choosing the right partition (via hash of key or round-robin).
- Batching and compressing messages.
- Optionally ensuring delivery acknowledgment from the broker.
producer.send("user-events", key="user-42", value="login-event")
👥 Kafka Consumer
Kafka consumers read messages from topics and process them.
- Consumers belong to Consumer Groups.
- Each partition is read by one consumer per group, enabling parallel processing.
- Offsets are tracked per group.
Consumer Group: auth-processors
Members: [consumer-1, consumer-2]
🖥️ Kafka Cluster
A Kafka cluster consists of multiple brokers (nodes). Key components:
- Brokers: Store and serve partitions.
- Zookeeper: Used for broker coordination (Kafka 2.x and earlier).
- Controller: Coordinates partition leadership and failover.
Redundancy and replication across brokers ensure high availability.
Partition-0: Leader on Broker-1, Replicas on Broker-2 and Broker-3
Kafka 3.x supports KRaft (Kafka Raft) as a replacement for Zookeeper.
✅ Advantages of Kafka
- High throughput for real-time pipelines.
- Horizontal scalability (via partitions).
- Durability (disk-based retention).
- Decoupling between producers and consumers.
- Replayability: Consumers can re-read old messages.
📦 Kafka Use Cases
- Real-time analytics (e.g. fraud detection, user behavior tracking)
- Log aggregation (centralized logging)
- Data lake ingestion (streaming into S3, BigQuery, etc.)
- Microservices communication (decoupling services)
- Change Data Capture (via Debezium)
📚 Summary Table
Component | Role |
---|---|
Producer | Writes messages to Kafka topics |
Consumer | Reads messages from topics |
Topic | Logical grouping of messages |
Partition | Physical segment of a topic for parallelism |
Broker | Node that stores and serves messages |
Consumer Group | Group of consumers that share workload |
Offset | Unique ID for message position in a partition |
🛠 Common Kafka Tools
- Kafka Connect – Integrates with external systems (DBs, cloud, etc.)
- Kafka Streams – Lightweight library for stream processing
- ksqlDB – SQL interface for Kafka data
- Schema Registry – Enforces data schema consistency using Avro/JSON
🔐 Kafka Considerations
- Security: Use SASL + TLS + ACLs.
- Monitoring: Prometheus + Grafana + Kafka Manager.
- Retention: Configure based on size (e.g.
retention.bytes
) or time (retention.ms
). - Ordering: Maintained within a partition, not across partitions.
Kafka is at the heart of modern event-driven architecture. It excels in decoupling, streaming, and scaling, making it ideal for building robust data pipelines and distributed systems.
<< back to Guides