🗄️ Parquet vs ORC vs Avro Real-World Comparison - Complete Guide to Data Lake File Formats

“Choosing the right file format can make a 10x difference in performance and cost” - One of the most important decisions in building a data lake

When building a data lake, one of the first questions you face is “which file format should I use?” Parquet, ORC, and Avro each have unique characteristics and trade-offs, and the wrong choice can lead to serious performance degradation and cost increases. This post provides internal structure of the three formats, actual benchmark results, and optimal selection guide for each scenario.


📚 Table of Contents


📋 File Format Overview

Major File Format Comparison

Characteristic Parquet ORC Avro
Storage Method Columnar Columnar Row-based
Compression Ratio High (4-10x) Very High (5-12x) Medium (2-4x)
Read Performance Very Fast Very Fast Slow
Write Performance Medium Medium Fast
Schema Evolution Limited Limited Excellent
Ecosystem Spark, Presto, Athena Hive, Presto Kafka, Streaming
File Size Small Smaller Large

When to Use Which Format?

Use Case Recommended Format Reason
Analytics Data Lake Parquet Versatility, Spark/Athena optimization
Hive-centric Environment ORC Perfect integration with Hive
Real-time Streaming Avro Fast writes, schema evolution
Log Collection Parquet Compression ratio, analytics performance
CDC Pipeline Avro → Parquet Streaming + batch conversion

🔷 Parquet Internal Structure

Design Philosophy

Parquet was designed based on Google’s Dremel paper to efficiently store nested data structures.

Core Features

  • Columnar Storage: Data stored by column
  • Nested Data Support: Supports complex nested structures
  • Efficient Compression: Optimal compression per column type
  • Predicate Pushdown: Skip unnecessary reads with file-level statistics

File Structure

Parquet File Structure:
┌─────────────────────────────────┐
│ Header (Magic: PAR1)            │
├─────────────────────────────────┤
│ Row Group 1                     │
│  ├── Column Chunk A             │
│  │   ├── Page 1 (compressed)    │
│  │   ├── Page 2 (compressed)    │
│  │   └── Page 3 (compressed)    │
│  ├── Column Chunk B             │
│  └── Column Chunk C             │
├─────────────────────────────────┤
│ Row Group 2                     │
│  ├── Column Chunk A             │
│  ├── Column Chunk B             │
│  └── Column Chunk C             │
├─────────────────────────────────┤
│ Footer Metadata                 │
│  ├── Schema                     │
│  ├── Row Group Metadata         │
│  ├── Column Statistics          │
│  └── Compression Codec          │
└─────────────────────────────────┘
│ Footer Size (4 bytes)           │
│ Magic: PAR1 (4 bytes)           │
└─────────────────────────────────┘

Row Group and Page

Row Group

  • Definition: Logical group of rows (default 128MB)
  • Purpose: Unit of parallel processing
  • Statistics: Min/Max/Null count per column

Page

  • Definition: Unit of compression and encoding (default 1MB)
  • Encoding: Dictionary, RLE, Delta encoding
  • Compression: Snappy, GZIP, LZO, ZSTD

Parquet Creation Example

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("Parquet Example") \
    .getOrCreate()

# Generate data
data = [
    (1, "Alice", 100.5, "2024-01-15"),
    (2, "Bob", 200.3, "2024-01-15"),
    (3, "Charlie", 150.7, "2024-01-15")
]

schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("amount", DoubleType(), False),
    StructField("date", StringType(), False)
])

df = spark.createDataFrame(data, schema)

# Optimize Parquet settings
spark.conf.set("spark.sql.parquet.compression.codec", "snappy")
spark.conf.set("spark.sql.parquet.block.size", 134217728)  # 128MB
spark.conf.set("spark.sql.parquet.page.size", 1048576)     # 1MB

# Save
df.write \
    .mode("overwrite") \
    .parquet("s3://bucket/data/events.parquet")

Parquet Metadata Analysis

# Read Parquet file metadata
import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('events.parquet')

# Check schema
print("Schema:")
print(parquet_file.schema)

# Row Group information
print(f"\nRow Groups: {parquet_file.num_row_groups}")

# Statistics per Row Group
for i in range(parquet_file.num_row_groups):
    rg = parquet_file.metadata.row_group(i)
    print(f"\nRow Group {i}:")
    print(f"  Rows: {rg.num_rows}")
    print(f"  Total Size: {rg.total_byte_size / 1024 / 1024:.2f} MB")
    
    # Statistics per column
    for j in range(rg.num_columns):
        col = rg.column(j)
        print(f"  Column {col.path_in_schema}:")
        print(f"    Compressed: {col.total_compressed_size / 1024:.2f} KB")
        print(f"    Uncompressed: {col.total_uncompressed_size / 1024:.2f} KB")
        print(f"    Compression Ratio: {col.total_uncompressed_size / col.total_compressed_size:.2f}x")

🔶 ORC Internal Structure

Design Philosophy

ORC is a format optimized for Hive workloads, providing more aggressive compression than Parquet.

Core Features

  • High Compression: ZLIB default, very high compression ratio
  • Built-in Indexes: Row group, bloom filter, column statistics
  • ACID Support: Hive ACID transaction support
  • Predicate Pushdown: Strong filtering with multi-layer indexes

File Structure

ORC File Structure:
┌─────────────────────────────────┐
│ Postscript                      │
│  ├── Compression                │
│  ├── Footer Length              │
│  └── Version                    │
├─────────────────────────────────┤
│ File Footer                     │
│  ├── Schema                     │
│  ├── Statistics                 │
│  ├── Stripe Information         │
│  └── User Metadata              │
├─────────────────────────────────┤
│ Stripe 1                        │
│  ├── Index Data                 │
│  │   ├── Row Index              │
│  │   ├── Bloom Filter           │
│  │   └── Column Statistics      │
│  ├── Data (Compressed)          │
│  │   ├── Column A Stream        │
│  │   ├── Column B Stream        │
│  │   └── Column C Stream        │
│  └── Stripe Footer              │
├─────────────────────────────────┤
│ Stripe 2                        │
│  └── ...                        │
└─────────────────────────────────┘

Stripe and Index

Stripe

  • Definition: Basic processing unit of ORC (default 64MB)
  • Composition: Index Data + Actual Data + Footer
  • Parallel Processing: Distributed processing by stripe

Index Types

  • Row Index: min/max/sum/count every 10,000 rows
  • Bloom Filter: Fast check for value existence
  • Column Statistics: Stripe-level statistics

ORC Creation Example

# Create ORC in Spark
df.write \
    .format("orc") \
    .option("compression", "zlib") \
    .option("orc.stripe.size", 67108864) \
    .option("orc.compress.size", 262144) \
    .option("orc.bloom.filter.columns", "user_id,product_id") \
    .mode("overwrite") \
    .save("s3://bucket/data/events.orc")

ORC Metadata Analysis

# Analyze ORC file (using PyArrow)
import pyarrow.orc as orc

orc_file = orc.ORCFile('events.orc')

# Schema
print("Schema:")
print(orc_file.schema)

# Stripe information
print(f"\nStripes: {orc_file.nstripes}")
print(f"Rows: {orc_file.nrows}")

# Metadata
metadata = orc_file.metadata
print(f"Compression: {metadata.compression}")
print(f"Writer Version: {metadata.writer_version}")

🔹 Avro Internal Structure

Design Philosophy

Avro is a row-based format optimized for schema evolution and fast serialization.

Core Features

  • Row-based: Stores entire records sequentially
  • Self-describing: Schema included in file
  • Schema Evolution: Perfect support for schema changes
  • Compact Binary: Efficient binary encoding

File Structure

Avro File Structure:
┌─────────────────────────────────┐
│ Header                          │
│  ├── Magic: Obj\x01             │
│  ├── File Metadata              │
│  │   ├── Schema (JSON)          │
│  │   └── Codec (snappy/deflate) │
│  └── Sync Marker (16 bytes)     │
├─────────────────────────────────┤
│ Data Block 1                    │
│  ├── Record Count               │
│  ├── Block Size (compressed)    │
│  ├── Records (compressed)       │
│  │   ├── Record 1 (all fields)  │
│  │   ├── Record 2 (all fields)  │
│  │   └── Record N (all fields)  │
│  └── Sync Marker                │
├─────────────────────────────────┤
│ Data Block 2                    │
│  └── ...                        │
└─────────────────────────────────┘

Schema Definition

{
  "type": "record",
  "name": "Event",
  "namespace": "com.example",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "date", "type": "string"},
    {"name": "metadata", "type": ["null", {
      "type": "map",
      "values": "string"
    }], "default": null}
  ]
}

Avro Creation Example

# Create Avro in Spark
df.write \
    .format("avro") \
    .option("compression", "snappy") \
    .mode("overwrite") \
    .save("s3://bucket/data/events.avro")

# Use Avro in Kafka
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer

value_schema_str = """
{
   "namespace": "com.example",
   "type": "record",
   "name": "Event",
   "fields" : [
     {"name": "id", "type": "int"},
     {"name": "name", "type": "string"}
   ]
}
"""

value_schema = avro.loads(value_schema_str)

avroProducer = AvroProducer({
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}, default_value_schema=value_schema)

# Send message
avroProducer.produce(topic='events', value={"id": 1, "name": "Alice"})
avroProducer.flush()

📊 Actual Benchmark Comparison

Test Environment

Item Configuration
Dataset NYC Taxi (100M records, 100GB CSV)
Spark Version 3.4.0
Instance r5.4xlarge × 10
Compression Codec Snappy (Parquet/Avro), ZLIB (ORC)
Row Group/Stripe 128MB

Test 1: File Size and Compression Ratio

Original Data: 100GB CSV

Format Compression Codec File Size Compression Ratio File Count
CSV None 100 GB 1.0x 1,000
Parquet Snappy 12.3 GB 8.1x 97
Parquet GZIP 8.9 GB 11.2x 70
ORC ZLIB 9.1 GB 11.0x 72
ORC Snappy 11.8 GB 8.5x 93
Avro Snappy 28.4 GB 3.5x 224
Avro Deflate 24.1 GB 4.1x 190

Compression Time Comparison

import time

# Parquet write
start = time.time()
df.write.mode("overwrite").parquet("output.parquet")
parquet_time = time.time() - start

# ORC write
start = time.time()
df.write.format("orc").mode("overwrite").save("output.orc")
orc_time = time.time() - start

# Avro write
start = time.time()
df.write.format("avro").mode("overwrite").save("output.avro")
avro_time = time.time() - start

print(f"Parquet: {parquet_time:.2f}s")  # Result: 142.3s
print(f"ORC: {orc_time:.2f}s")          # Result: 156.8s
print(f"Avro: {avro_time:.2f}s")        # Result: 98.4s
Format Write Time Processing Speed
Parquet (Snappy) 142.3s 703 MB/s
ORC (ZLIB) 156.8s 638 MB/s
Avro (Snappy) 98.4s 1,016 MB/s

Test 2: Read Performance (Full Scan)

-- Query: Aggregate entire data
SELECT 
    pickup_date,
    COUNT(*) as trip_count,
    AVG(fare_amount) as avg_fare,
    SUM(tip_amount) as total_tips
FROM trips
GROUP BY pickup_date;

Read Performance Comparison

Format Compression Scan Time Processing Speed Memory Usage
Parquet Snappy 23.4s 4.3 GB/s 18.2 GB
Parquet GZIP 31.2s 3.2 GB/s 16.8 GB
ORC ZLIB 28.7s 3.5 GB/s 17.1 GB
ORC Snappy 24.1s 4.1 GB/s 18.5 GB
Avro Snappy 87.3s 1.1 GB/s 32.4 GB

Test 3: Column Selection Query (Projection Pushdown)

-- Query: Select specific columns only
SELECT pickup_date, fare_amount
FROM trips
WHERE pickup_date = '2024-01-15';

Column Selection Performance

Format All Columns 2 Columns Improvement Scanned Data
Parquet 23.4s 2.8s 8.4x 1.2 GB
ORC 28.7s 3.1s 9.3x 1.1 GB
Avro 87.3s 84.2s 1.0x 28.4 GB (full)

Key Point: Columnar formats achieve massive performance improvement by reading only specific columns, while Avro must read entire records

Test 4: Predicate Pushdown

-- Query: Filter conditions
SELECT *
FROM trips
WHERE fare_amount > 50 AND tip_amount > 10;

Predicate Pushdown Effect

Format Scanned Data Actually Read Skipped Ratio Query Time
Parquet 12.3 GB 3.2 GB 74% 8.4s
ORC 9.1 GB 2.1 GB 77% 7.2s
Avro 28.4 GB 28.4 GB 0% 72.1s

Key Point: ORC’s Row Index and Bloom Filter are most effective

Test 5: Schema Evolution

# Schema change test
# 1. Save data with existing schema
schema_v1 = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("amount", DoubleType())
])

df_v1.write.format(format_type).save(f"data_{format_type}_v1")

# 2. Schema with new column added
schema_v2 = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("amount", DoubleType()),
    StructField("category", StringType())  # New column
])

df_v2.write.format(format_type).save(f"data_{format_type}_v2")

# 3. Read both versions simultaneously
df_merged = spark.read.format(format_type).load(f"data_{format_type}_*")

Schema Evolution Support

Format Add Column Drop Column Type Change Rename Column
Parquet ✅ Possible ⚠️ Caution needed ❌ Not possible ❌ Not possible
ORC ✅ Possible ⚠️ Caution needed ❌ Not possible ❌ Not possible
Avro ✅ Full support ✅ Full support ✅ Partially possible ✅ Alias support

🎯 Optimal Format Selection by Use Case

Use Case 1: Large-scale Analytics Data Lake

Scenario

  • 10TB data collection per day
  • Ad-hoc queries with Athena, Spark
  • Mainly aggregate queries
# Optimal settings
spark.conf.set("spark.sql.parquet.compression.codec", "snappy")
spark.conf.set("spark.sql.parquet.block.size", 134217728)
spark.conf.set("spark.sql.parquet.page.size", 1048576)
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")

df.write \
    .partitionBy("date") \
    .parquet("s3://bucket/analytics/")

Reasons:

  • ✅ Perfect Athena support
  • ✅ Fast read performance
  • ✅ Good compression ratio
  • ✅ Versatility

Use Case 2: Hive-centric Data Warehouse

Scenario

  • Using Hive metastore
  • ACID transactions needed
  • Frequent UPDATE/DELETE operations
-- Create ORC table in Hive
CREATE TABLE events (
    id INT,
    name STRING,
    amount DOUBLE,
    event_date STRING
)
PARTITIONED BY (date STRING)
STORED AS ORC
TBLPROPERTIES (
    "orc.compress"="ZLIB",
    "orc.create.index"="true",
    "orc.bloom.filter.columns"="id,name"
);

-- ACID transaction
UPDATE events SET amount = amount * 1.1 WHERE date = '2024-01-15';

Reasons:

  • ✅ Hive optimization
  • ✅ ACID support
  • ✅ Best compression ratio
  • ✅ Powerful indexes

Use Case 3: Real-time Streaming Pipeline

Scenario

  • Real-time data collection with Kafka
  • Using Schema Registry
  • Frequent schema changes
# Real-time: Kafka + Avro
from confluent_kafka import avro

# Save to Kafka as Avro
avro_producer.produce(topic='events', value=event_data)

# Batch: Avro → Parquet conversion
df = spark.read.format("avro").load("s3://bucket/streaming/avro/")

df.write \
    .partitionBy("date") \
    .parquet("s3://bucket/analytics/parquet/")

Reasons:

  • ✅ Avro: Fast writes, schema evolution
  • ✅ Parquet: Analytics optimization
  • ✅ Leverage both advantages

Use Case 4: Long-term Log Data Storage

Scenario

  • 50TB log data per day
  • Mostly cold storage
  • Occasional period-specific analysis
# Maximum compression ratio settings
spark.conf.set("spark.sql.parquet.compression.codec", "gzip")  # or zstd

df.write \
    .partitionBy("date") \
    .parquet("s3://bucket/logs/")

# Automatic transition with lifecycle policy
import boto3

s3 = boto3.client('s3')
s3.put_bucket_lifecycle_configuration(
    Bucket='bucket',
    LifecycleConfiguration={
        'Rules': [{
            'Id': 'TransitionLogs',
            'Status': 'Enabled',
            'Prefix': 'logs/',
            'Transitions': [
                {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                {'Days': 90, 'StorageClass': 'GLACIER'}
            ]
        }]
    }
)

Reasons:

  • ✅ High compression ratio (storage cost savings)
  • ✅ S3 Glacier compatible
  • ✅ Fast analysis when needed

Use Case 5: Complex Nested Data

Scenario

  • JSON event data
  • Deep nested structures
  • Frequently query only specific fields
# Nested JSON data
json_data = """
{
  "user": {
    "id": 123,
    "profile": {
      "name": "Alice",
      "email": "alice@example.com"
    }
  },
  "event": {
    "type": "purchase",
    "items": [
      {"id": 1, "price": 100.5},
      {"id": 2, "price": 50.3}
    ]
  }
}
"""

# Handle nested structure in Spark
df = spark.read.json("s3://bucket/raw/events.json")

# Save as Parquet (preserving nested structure)
df.write.parquet("s3://bucket/processed/events.parquet")

# Efficiently read only specific fields
df = spark.read.parquet("s3://bucket/processed/events.parquet")
df.select("user.profile.name", "event.type").show()
# Parquet reads only needed columns (nested column pruning)

Reasons:

  • ✅ Perfect nested structure support
  • ✅ Nested column pruning
  • ✅ Memory efficient

🔄 Format Conversion Guide

CSV → Parquet Migration

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("CSV to Parquet") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read CSV (schema inference)
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("s3://bucket/raw/csv/*.csv")

# Optimize data types
from pyspark.sql.functions import col

df = df \
    .withColumn("amount", col("amount").cast("decimal(10,2)")) \
    .withColumn("event_time", col("event_time").cast("timestamp"))

# Convert to Parquet
df.repartition(100) \
    .write \
    .mode("overwrite") \
    .option("compression", "snappy") \
    .partitionBy("date") \
    .parquet("s3://bucket/processed/parquet/")

print(f"Original CSV: {df.inputFiles()[0]}")
print(f"Rows: {df.count():,}")

Migration Results

Item CSV Parquet Improvement
File Size 100 GB 12.3 GB 87% reduction
Query Time 245s 23.4s 10.5x faster
S3 Cost $2,300/month $283/month 87% savings
Athena Scan $512/query $62/query 88% savings

Avro → Parquet Batch Conversion

# Convert Avro collected from streaming to Parquet for analytics
from pyspark.sql import SparkSession
from datetime import datetime, timedelta

spark = SparkSession.builder \
    .appName("Avro to Parquet Batch") \
    .getOrCreate()

# Process yesterday's data
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

# Read Avro
avro_path = f"s3://bucket/streaming/avro/date={yesterday}/"
df = spark.read.format("avro").load(avro_path)

# Data quality check
print(f"Records: {df.count():,}")
print(f"Duplicates: {df.count() - df.dropDuplicates().count():,}")

# Remove duplicates and sort
df = df.dropDuplicates(["id"]) \
    .orderBy("event_time")

# Save as Parquet
parquet_path = f"s3://bucket/analytics/parquet/date={yesterday}/"
df.repartition(20) \
    .write \
    .mode("overwrite") \
    .parquet(parquet_path)

# Validation
parquet_df = spark.read.parquet(parquet_path)
assert df.count() == parquet_df.count(), "Record count mismatch!"

print(f"✓ Migration completed: {yesterday}")

ORC ↔ Parquet Mutual Conversion

# ORC → Parquet
orc_df = spark.read.format("orc").load("s3://bucket/data.orc")
orc_df.write.parquet("s3://bucket/data.parquet")

# Parquet → ORC
parquet_df = spark.read.parquet("s3://bucket/data.parquet")
parquet_df.write.format("orc").save("s3://bucket/data.orc")

# Performance comparison
import time

# ORC read
start = time.time()
orc_df = spark.read.format("orc").load("s3://bucket/large_data.orc")
orc_count = orc_df.count()
orc_time = time.time() - start

# Parquet read
start = time.time()
parquet_df = spark.read.parquet("s3://bucket/large_data.parquet")
parquet_count = parquet_df.count()
parquet_time = time.time() - start

print(f"ORC: {orc_time:.2f}s, {orc_count:,} rows")
print(f"Parquet: {parquet_time:.2f}s, {parquet_count:,} rows")

🛠️ Production Optimization Tips

Parquet Optimization

# 1. Choose compression codec
# - Snappy: Fast compression/decompression (real-time analytics)
# - GZIP: High compression ratio (long-term storage)
# - ZSTD: Balanced performance (recommended)

spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

# 2. Adjust Row Group size
spark.conf.set("spark.sql.parquet.block.size", 268435456)  # 256MB

# 3. Utilize dictionary encoding
# Automatically applied to low cardinality columns
# To manually disable:
spark.conf.set("spark.sql.parquet.enableDictionaryEncoding", "false")

# 4. Enable vectorized reader
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")

# 5. Binary as string optimization
spark.conf.set("spark.sql.parquet.binaryAsString", "false")

ORC Optimization

# 1. Adjust Stripe size
spark.conf.set("spark.sql.orc.stripe.size", 67108864)  # 64MB

# 2. Set Bloom filter
df.write \
    .format("orc") \
    .option("orc.bloom.filter.columns", "user_id,product_id") \
    .option("orc.bloom.filter.fpp", 0.05) \
    .save("s3://bucket/data.orc")

# 3. Choose compression
# - ZLIB: Best compression ratio (default)
# - SNAPPY: Fast performance
# - LZO: Balanced

spark.conf.set("spark.sql.orc.compression.codec", "zlib")

# 4. Index stride (row index interval)
spark.conf.set("orc.row.index.stride", 10000)

Avro Optimization

# 1. Compression settings
df.write \
    .format("avro") \
    .option("compression", "snappy") \
    .save("s3://bucket/data.avro")

# 2. Schema registry integration
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer

producer_config = {
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}

producer = AvroProducer(producer_config, default_value_schema=schema)

Format Selection Decision Tree

def choose_format(use_case):
    """Format selection helper function"""
    
    # Real-time streaming?
    if use_case["streaming"] and use_case["schema_changes"]:
        return "Avro"
    
    # Hive-centric environment?
    if use_case["hive"] and use_case["acid"]:
        return "ORC"
    
    # Maximum compression needed?
    if use_case["storage_critical"]:
        return "ORC with ZLIB"
    
    # General analytics?
    if use_case["analytics"] and use_case["athena"]:
        return "Parquet with Snappy"
    
    # Fast writes needed?
    if use_case["write_heavy"]:
        return "Avro"
    
    # Default
    return "Parquet"

# Usage example
use_case = {
    "streaming": False,
    "schema_changes": False,
    "hive": False,
    "acid": False,
    "storage_critical": False,
    "analytics": True,
    "athena": True,
    "write_heavy": False
}

recommended = choose_format(use_case)
print(f"Recommended format: {recommended}")
# Output: Recommended format: Parquet with Snappy

📚 Learning Summary

Key Points

  1. Understanding Format Characteristics
    • Parquet: General analytics, Athena/Spark optimization
    • ORC: Hive optimization, best compression ratio, ACID support
    • Avro: Streaming, schema evolution, fast writes
  2. Performance Comparison Summary
    • Compression Ratio: ORC > Parquet > Avro
    • Read Performance: Parquet ≈ ORC » Avro
    • Write Performance: Avro > Parquet ≈ ORC
    • Column Selection: Parquet/ORC 8-9x faster
  3. Production Selection Guide
    • Analytics-focused: Parquet (Snappy)
    • Hive Environment: ORC (ZLIB)
    • Streaming: Avro → Parquet hybrid
    • Long-term Storage: Parquet (GZIP/ZSTD)
  4. Optimization Strategies
    • File size: Maintain 64-256MB
    • Compression codec: Choose according to purpose
    • Partitioning: Simple and shallow
    • Schema design: Optimize data types

Production Checklist

  • Use case analysis complete
  • Current format performance measured
  • Benchmark tests performed
  • Format selection and configuration optimized
  • Migration plan established
  • Validation process defined
  • Cost impact analyzed
  • Monitoring dashboard built

Next Steps

  • Apache Iceberg/Delta Lake: Abstracting file formats with table formats
  • Parquet Advanced Optimization: Bloom filter, Column index
  • Compression Algorithm Comparison: ZSTD vs LZ4 vs Brotli
  • Schema Evolution Strategy: Compatibility management

“File format selection is not just a technical decision, but a strategic choice that directly impacts business outcomes.”

Data lake file format is difficult to change once decided. Understanding the characteristics of each format accurately and choosing the optimal format for your use case is key to building a successful data lake. We hope this guide helps you make the right choice!