<< back to Guides

🏞️ Guide: How Data Lake Architecture Works

A Data Lake is a centralized repository that stores raw data in its native format (structured, semi-structured, unstructured) for future analysis, ML, or operational processing.

This guide covers:


1. 📚 What Is a Data Lake?

A Data Lake:

Compare to a Data Warehouse:

Feature Data Lake Data Warehouse
Data Types Structured + semi/unstructured Mostly structured
Schema Schema-on-read Schema-on-write
Storage Cheap object storage Columnar formats (Parquet, etc.)
Query Performance Slower Fast OLAP
Use Cases ML, raw logs, staging BI, dashboards, reporting

2. 🧱 Data Lake Architecture: Key Layers

Data Lakes are typically organized in 4 logical layers:

1️⃣ Ingestion Layer

Responsible for collecting and importing raw data from multiple sources:

# Example: ingesting Kafka stream to S3
kafka-console-consumer --topic logs --bootstrap-server localhost:9092 > logs.json
aws s3 cp logs.json s3://data-lake/raw/logs/

2️⃣ Storage Layer

Stores all raw and processed data in cheap, scalable storage (e.g. object store).

Zone Description
Raw Original, untransformed data
Staging Cleaned but not validated
Curated Validated, enriched, trusted
Sandbox Ad-hoc, temporary explorations

3️⃣ Processing Layer

Transforms data for downstream use (ETL/ELT):

# Example: Spark job to clean CSV data
df = spark.read.csv("s3://data-lake/raw/sales.csv", header=True)
df_clean = df.dropna().filter("amount > 0")
df_clean.write.parquet("s3://data-lake/curated/sales/")

4️⃣ Consumption Layer

Exposes data to users/tools for querying, ML, analytics:

-- Query S3 using Athena
SELECT COUNT(*) FROM curated.sales WHERE region = 'EU';

3. 🗃️ Storage Formats & Metadata

Data lakes use open formats:

Format Type Notes
CSV/JSON Text Simple, not efficient
Parquet Columnar Compressed, splittable
Avro Row-based Good for streaming
Delta Lake Transactional Adds ACID to data lakes
Iceberg Transactional Schema evolution + versioning

Also: Catalog systems like Apache Hive Metastore or AWS Glue Catalog manage table schemas and partitions.


4. 🚢 ETL vs ELT in Data Lakes

ETL ELT
Transform data before loading Load raw data, transform later
Used in warehouses Preferred in data lakes
Slower ingestion Faster ingestion, lazy transforms

Data Lakes favor ELT to keep ingest simple and defer transformation to query time.


5. 🔍 Querying Data Lakes

-- Query a partitioned dataset in Athena
SELECT region, SUM(sales)
FROM "sales_data"
WHERE year = 2024 AND month = '06'
GROUP BY region;

6. ✅ Benefits of Data Lakes


7. ⚠️ Challenges & Pitfalls

Challenge Mitigation Strategy
Data swamp (unorganized) Enforce naming, folder structure, catalog
Performance on large reads Use columnar formats, partitioning
Lack of ACID Use Delta Lake / Apache Iceberg
Access control complexity Centralize IAM policies, use lakehouse

8. 🔄 Lakehouse: The Evolution

Lakehouse = Data Lake + Data Warehouse


📚 Further Reading


<< back to Guides