Lakehouse Table Formats: Delta Lake, Apache Iceberg, Apache Hudi

The core of data lakehouse is combining the flexibility of existing data lakes with data warehouse features like ACID transactions, schema evolution, data quality assurance, etc. What makes this possible is the Table Format.

Currently, the three most widely used table formats are:

Delta Lake - Open source project developed by Databricks
Apache Iceberg - Started by Netflix and moved to Apache Foundation
Apache Hudi - Developed by Uber and moved to Apache Foundation

1. Introduction

💡 What is a Table Format?

A table format is a standardized way to manage metadata of data stored in data lakes and provide features like ACID transactions, schema evolution, partitioning, indexing, etc. This allows data lakes to be used like data warehouses.

2. Delta Lake

2.1 Overview

Delta Lake is a table format open-sourced by Databricks in 2019, providing ACID transactions to data lakes through tight integration with Apache Spark.

🔗 Official Links

2.2 Key Features

ACID Transactions: Ensures atomicity, consistency, isolation, and durability
Schema Enforcement: Ensures data quality and integrity
Schema Evolution: Supports safe schema changes
Time Travel: Access to past data versions
Upsert/Merge: Efficient data updates
Open Format: Compatible with all tools based on Parquet

2.3 Architecture

Delta Lake has the following layered structure:

Application Layer

Spark SQL
Spark Streaming
BI Tools

Delta Lake Layer

ACID transactions
Schema management
Metadata processing

Storage Layer

Parquet files
Transaction logs
Checkpoints

3. Apache Iceberg

3.1 Overview

Apache Iceberg is a table format started by Netflix and moved to Apache Foundation, supporting efficient schema evolution and partitioning for large-scale datasets.

3.2 Key Features

Schema Evolution: Safe and efficient schema changes
Partition Evolution: Support for partition layout changes
Time Travel: Access to past snapshots
ACID Transactions: Atomic write guarantees
Metadata Layers: Efficient metadata management

3.3 Architecture

Iceberg has the following metadata layers:

Catalog Layer: Table metadata management Metadata Layer: Snapshots, manifests, schema information Data Layer: Actual data files (Parquet, ORC, Avro)

4. Apache Hudi

4.1 Overview

Apache Hudi is a table format developed by Uber and moved to Apache Foundation, specialized in real-time data processing and incremental processing.

4.2 Key Features

Real-time Processing: Support for streaming data processing
Incremental Processing: Processing only changed data
ACID Transactions: Atomic write guarantees
Time Travel: Access to past data versions
CDC Support: Change Data Capture support

4.3 Architecture

Hudi supports two table types:

Copy-on-Write (CoW): Creates new files during writes Merge-on-Read (MoR): Merges data during reads

5. Comparison of Three Platforms

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Developer	Databricks	Netflix/Apache	Uber/Apache
Spark Integration	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Real-time Processing	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Schema Evolution	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Partition Evolution	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Community	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

6. Use Cases and Selection Guide

When to Choose Delta Lake

When working in Spark-based environments
When ACID transactions are important
When using Databricks environments

When to Choose Apache Iceberg

When dealing with large-scale datasets
When schema evolution occurs frequently
When partition layout changes are needed

When to Choose Apache Hudi

When real-time data processing is needed
When incremental processing is important
When implementing CDC (Change Data Capture)

7. Conclusion

The three table formats each have their own advantages and specialized areas. It’s important to choose the appropriate format considering project requirements and technology stack.

Delta Lake: Stable ACID transactions in Spark-centered environments Apache Iceberg: When large-scale data and complex schema evolution are needed Apache Hudi: When real-time processing and incremental processing are important

This article was written for engineers who want to understand lakehouse table formats in the data engineering field. We recommend deeper learning through official documentation and hands-on practice for each format.