Skip to main content

🧩 Data Lake / Lakehouse

✅ Overview

Approach combining "Data Lake" accumulating raw data at low cost and manageability of DWH as "Lakehouse".

✅ Problems Addressed

Semi-structured / unstructured data that cannot be handled flexibly in DWH.
Raw data that might be discarded before ETL.
Want to respond to increase in Analysis / ML / Exploration use.

✅ Basic Philosophy & Rules

Data Lake

Accumulate raw data in object storage like S3/GCS.
Schema is applied at read time (Schema on Read).

Lakehouse

Enhance table management with Delta Lake / Iceberg / Hudi etc.
Provide DWH-like functions like ACID, version control, index.

✅ Suitable Applications

Data infrastructure for ML model training.
Massive data processing like logs and clickstreams.
Utilization of semi-structured data (JSON / Parquet).

❌ Unsuitable Cases

Operations centered on standard reports with fixed format.
Areas strongly requiring data governance (Lakehouse is needed).

✅ History (Genealogy / Parent Styles)

Data Lake → Developed into Lakehouse from challenges in actual operation.
Hadoop → Cloud Object Storage → Evolved to ACID Table.

DWH: Complement with strong schema and integrated analysis.
Lambda/Kappa: Integrate Lake with streaming layer.

✅ Representative Frameworks

Data Lake

AWS S3 / Google Cloud Storage / Azure Data Lake Storage
Central storage of Data Lake accumulating raw data on large scale.
Apache Spark
Central existence of data processing on Lake.
Trino / Presto
SQL query execution engine for Data Lake.

Lakehouse

Delta Lake (Databricks)
Provision of ACID Table / Time Travel / Versioning.
Apache Iceberg / Apache Hudi
Table formats to realize Lakehouse as OSS.

✅ Design Patterns Supporting This Style

Iterator
Used when handling large-scale data in stream / chunk units.
Strategy
Switching execution method per data processing engine (Spark / Presto).
Template Method
Unifies common procedures of ETL / ELT.
Facade
Used in meta-layer hiding complex API of Data Lake / Lakehouse and handling uniformly.

✅ Summary

Data Lake / Lakehouse is the central approach of modern data infrastructure handling massive, diverse, and high-speed data.

✅ Overview
✅ Problems Addressed
✅ Basic Philosophy & Rules
- Data Lake
- Lakehouse
✅ Suitable Applications
❌ Unsuitable Cases
✅ History (Genealogy / Parent Styles)
✅ Related Styles
✅ Representative Frameworks
- Data Lake
- Lakehouse
✅ Design Patterns Supporting This Style
✅ Summary