<< back to Guides

🧠 Deep Dive: Apache Kafka

Apache Kafka is a distributed event streaming platform designed for high-throughput, low-latency, and durable real-time data pipelines.

📌 What is Kafka?

Apache Kafka was originally developed by LinkedIn and is now part of the Apache Software Foundation. It's widely used in large-scale systems like Netflix, Uber, LinkedIn, and Airbnb.

Kafka is fundamentally used for:

Event streaming
Log aggregation
Change Data Capture (CDC)
Metrics pipelines
Real-time analytics

📦 Kafka Messages

A message is the smallest unit of data in Kafka. Think of it like a row in a database.

Each message contains:

Key: Used to determine the partition.
Value: The actual payload.
Headers: Metadata.
Offset: A unique ID within a partition.

// Example Kafka message (JSON)
{
  "key": "user-42",
  "value": {
    "event": "login",
    "timestamp": "2024-06-01T12:00:00Z"
  },
  "headers": {
    "source": "auth-service"
  }
}

📁 Topics and 🧩 Partitions

A Topic is a logical channel to which messages are written.
A Partition is a subdivision of a topic that allows Kafka to scale horizontally.

Each partition is an ordered log and messages are written in sequence.

Topic: user-events
Partitions: [0, 1, 2]

This allows parallelism — multiple consumers can read from different partitions at the same time.

⚙️ Kafka Producer

Kafka producers are responsible for:

Publishing data to a Kafka topic.
Choosing the right partition (via hash of key or round-robin).
Batching and compressing messages.
Optionally ensuring delivery acknowledgment from the broker.

producer.send("user-events", key="user-42", value="login-event")

👥 Kafka Consumer

Kafka consumers read messages from topics and process them.

Consumers belong to Consumer Groups.
Each partition is read by one consumer per group, enabling parallel processing.
Offsets are tracked per group.

Consumer Group: auth-processors
Members: [consumer-1, consumer-2]

🖥️ Kafka Cluster

A Kafka cluster consists of multiple brokers (nodes). Key components:

Brokers: Store and serve partitions.
Zookeeper: Used for broker coordination (Kafka 2.x and earlier).
Controller: Coordinates partition leadership and failover.

Redundancy and replication across brokers ensure high availability.

Partition-0: Leader on Broker-1, Replicas on Broker-2 and Broker-3

Kafka 3.x supports KRaft (Kafka Raft) as a replacement for Zookeeper.

✅ Advantages of Kafka

High throughput for real-time pipelines.
Horizontal scalability (via partitions).
Durability (disk-based retention).
Decoupling between producers and consumers.
Replayability: Consumers can re-read old messages.

📦 Kafka Use Cases

Real-time analytics (e.g. fraud detection, user behavior tracking)
Log aggregation (centralized logging)
Data lake ingestion (streaming into S3, BigQuery, etc.)
Microservices communication (decoupling services)
Change Data Capture (via Debezium)

📚 Summary Table

Component	Role
Producer	Writes messages to Kafka topics
Consumer	Reads messages from topics
Topic	Logical grouping of messages
Partition	Physical segment of a topic for parallelism
Broker	Node that stores and serves messages
Consumer Group	Group of consumers that share workload
Offset	Unique ID for message position in a partition

🛠 Common Kafka Tools

Kafka Connect – Integrates with external systems (DBs, cloud, etc.)
Kafka Streams – Lightweight library for stream processing
ksqlDB – SQL interface for Kafka data
Schema Registry – Enforces data schema consistency using Avro/JSON

🔐 Kafka Considerations

Security: Use SASL + TLS + ACLs.
Monitoring: Prometheus + Grafana + Kafka Manager.
Retention: Configure based on size (e.g. retention.bytes) or time (retention.ms).
Ordering: Maintained within a partition, not across partitions.

Kafka is at the heart of modern event-driven architecture. It excels in decoupling, streaming, and scaling, making it ideal for building robust data pipelines and distributed systems.

<< back to Guides