Lakehouse Table Formats: Delta Lake, Apache Iceberg, Apache Hudi

The core of data lakehouse is combining the flexibility of existing data lakes with data warehouse features like ACID transactions, schema evolution, data quality assurance, etc. What makes this possible is the Table Format.

Currently, the three most widely used table formats are:

  • Delta Lake - Open source project developed by Databricks
  • Apache Iceberg - Started by Netflix and moved to Apache Foundation
  • Apache Hudi - Developed by Uber and moved to Apache Foundation

📋 Table of Contents

  1. Introduction
  2. Delta Lake
  3. Apache Iceberg
  4. Apache Hudi
  5. Comparison of Three Platforms
  6. Use Cases and Selection Guide
  7. Conclusion

1. Introduction

💡 What is a Table Format?

A table format is a standardized way to manage metadata of data stored in data lakes and provide features like ACID transactions, schema evolution, partitioning, indexing, etc. This allows data lakes to be used like data warehouses.

2. Delta Lake

2.1 Overview

Delta Lake is a table format open-sourced by Databricks in 2019, providing ACID transactions to data lakes through tight integration with Apache Spark.

🔗 Official Links

2.2 Key Features

  • ACID Transactions: Ensures atomicity, consistency, isolation, and durability
  • Schema Enforcement: Ensures data quality and integrity
  • Schema Evolution: Supports safe schema changes
  • Time Travel: Access to past data versions
  • Upsert/Merge: Efficient data updates
  • Open Format: Compatible with all tools based on Parquet

2.3 Architecture

Delta Lake has the following layered structure:

Application Layer

  • Spark SQL
  • Spark Streaming
  • BI Tools

Delta Lake Layer

  • ACID transactions
  • Schema management
  • Metadata processing

Storage Layer

  • Parquet files
  • Transaction logs
  • Checkpoints

3. Apache Iceberg

3.1 Overview

Apache Iceberg is a table format started by Netflix and moved to Apache Foundation, supporting efficient schema evolution and partitioning for large-scale datasets.

3.2 Key Features

  • Schema Evolution: Safe and efficient schema changes
  • Partition Evolution: Support for partition layout changes
  • Time Travel: Access to past snapshots
  • ACID Transactions: Atomic write guarantees
  • Metadata Layers: Efficient metadata management

3.3 Architecture

Iceberg has the following metadata layers:

Catalog Layer: Table metadata management Metadata Layer: Snapshots, manifests, schema information Data Layer: Actual data files (Parquet, ORC, Avro)

4. Apache Hudi

4.1 Overview

Apache Hudi is a table format developed by Uber and moved to Apache Foundation, specialized in real-time data processing and incremental processing.

4.2 Key Features

  • Real-time Processing: Support for streaming data processing
  • Incremental Processing: Processing only changed data
  • ACID Transactions: Atomic write guarantees
  • Time Travel: Access to past data versions
  • CDC Support: Change Data Capture support

4.3 Architecture

Hudi supports two table types:

Copy-on-Write (CoW): Creates new files during writes Merge-on-Read (MoR): Merges data during reads

5. Comparison of Three Platforms

Feature Delta Lake Apache Iceberg Apache Hudi
Developer Databricks Netflix/Apache Uber/Apache
Spark Integration ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐
Real-time Processing ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐
Schema Evolution ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐
Partition Evolution ⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐
Community ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐

6. Use Cases and Selection Guide

When to Choose Delta Lake

  • When working in Spark-based environments
  • When ACID transactions are important
  • When using Databricks environments

When to Choose Apache Iceberg

  • When dealing with large-scale datasets
  • When schema evolution occurs frequently
  • When partition layout changes are needed

When to Choose Apache Hudi

  • When real-time data processing is needed
  • When incremental processing is important
  • When implementing CDC (Change Data Capture)

7. Conclusion

The three table formats each have their own advantages and specialized areas. It’s important to choose the appropriate format considering project requirements and technology stack.

Delta Lake: Stable ACID transactions in Spark-centered environments Apache Iceberg: When large-scale data and complex schema evolution are needed Apache Hudi: When real-time processing and incremental processing are important


This article was written for engineers who want to understand lakehouse table formats in the data engineering field. We recommend deeper learning through official documentation and hands-on practice for each format.