🗄️ S3 vs HDFS Partitioning Strategy - Optimizing Data Lake for the Cloud Era
“Past best practices can become today’s anti-patterns” - What you must know about partitioning strategy changes when migrating from HDFS to S3
When migrating data lakes from on-premise HDFS to cloud S3, many teams maintain the existing yyyy/mm/dd partitioning structure. However, this can cause serious performance degradation due to S3’s architectural characteristics. This post provides practical guidance that can be applied directly to production environments by understanding the fundamental differences between HDFS and S3, S3-optimized partitioning strategies, and actual query performance comparisons.
📚 Table of Contents
🏛️ HDFS Era Partitioning Strategy
Understanding HDFS Architecture
HDFS was designed as a hierarchical file system.
| Component |
Role |
Characteristics |
| NameNode |
Metadata management |
Stores directory structure in memory |
| DataNode |
Actual data storage |
Local disk-based block storage |
| Block |
Data unit |
128MB default size, maintains replicas |
Why yyyy/mm/dd Was Suitable
# Typical partition structure in HDFS
/data/events/
└── year=2024/
└── month=01/
└── day=15/
├── part-00000.parquet
├── part-00001.parquet
└── part-00002.parquet
- Directory Structure: Tree structure for metadata management
- Memory Usage: ~150 bytes per directory/file
- Hierarchical Traversal: Fast search with O(log n) time complexity
2. Hive Partition Pruning
- Dynamic Partitioning: Automatic partitioning by date
- Metastore: Caches partition metadata
- Query Optimization: Excludes unnecessary partitions with WHERE clause
-- Efficient query in Hive
SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day = 15;
-- NameNode can immediately navigate to that directory
3. Local Disk Characteristics
- Sequential Reading: Fast directory structure traversal
- Block Locality: DataNode processes local data first
- Network Cost: Minimized
HDFS Partitioning Best Practices
| Strategy |
Description |
Advantages |
| Time-based |
yyyy/mm/dd or yyyy/mm/dd/hh |
Efficient time-series data processing |
| Hierarchical Structure |
Nested directories by category |
Structured metadata |
| Partition Count |
Thousands to tens of thousands possible |
OK as long as NameNode memory is sufficient |
☁️ Fundamental Differences of S3
S3 is Object Storage
S3 is Key-Value object storage, not a file system.
| Characteristic |
HDFS |
S3 |
| Storage Method |
Hierarchical file system |
Flat namespace (Key-Value) |
| Directory |
Actual directories exist |
Directories are conceptual (part of Key) |
| Metadata |
NameNode memory |
Distributed metadata store |
| Access Method |
File path |
Object Key |
| List Operation |
Fast (local disk) |
Slow (network API calls) |
S3 Internal Structure
# "Directories" don't actually exist in S3
# Everything is a Key-Value pair
s3://bucket/data/events/year=2024/month=01/day=15/part-00000.parquet
# This is actually just one long Key
Cost of S3 List Operations
import boto3
s3 = boto3.client('s3')
# Finding data for a specific date in yyyy/mm/dd structure
# 1. List year=2024 -> API call 1
# 2. List month=01 -> API call 1
# 3. List day=15 -> API call 1
# Total 3 API calls + network latency
response = s3.list_objects_v2(
Bucket='my-bucket',
Prefix='data/events/year=2024/month=01/day=15/'
)
1. Request Rate Limits
- Throughput per Prefix: 3,500 PUT/COPY/POST/DELETE, 5,500 GET/HEAD requests/second
- Deep Directory Structure: Concentrated on same prefix, causing bottlenecks
- Performance Degradation: Rapid increase in response time with many List operations
2. List Operation Overhead
| Operation |
HDFS |
S3 |
| Single Directory List |
~1ms (local) |
~100-300ms (network) |
| Depth 3 Traversal |
~3ms |
~300-900ms |
| 1,000 Objects List |
~10ms |
~1-2 seconds |
3. Eventually Consistent Characteristics
- Read-after-write: New objects have immediate consistency guaranteed (since December 2020)
- Overwrite/Delete: Eventual consistency (slight delay possible)
- List Operation: Latest changes may not be immediately reflected
⚠️ S3 Partitioning Anti-patterns
Anti-pattern #1: Excessively Deep Hierarchical Structure
# Anti-pattern: Using HDFS style as-is
s3://bucket/data/events/
└── year=2024/
└── month=01/
└── day=15/
└── hour=10/
├── part-00000.parquet
└── part-00001.parquet
Problems
- List Operation Explosion: API call needed for each level
- Network Latency: Cumulative round-trip time for 4-5 calls
- Query Delay: Spark/Athena spends excessive time on partition exploration
Actual Impact
# Measuring partition discovery time in Spark
import time
start = time.time()
df = spark.read.parquet("s3://bucket/data/events/year=2024/month=01/day=15/")
print(f"Partition discovery: {time.time() - start:.2f}s")
# Result: Partition discovery: 5.43s (deep structure)
# vs
# Result: Partition discovery: 0.87s (simple structure)
Anti-pattern #2: Small Files Problem
# Anti-pattern: Creating small files per hour
s3://bucket/data/events/date=2024-01-15/
├── hour=00/
│ ├── part-00000.parquet (2MB)
│ ├── part-00001.parquet (1.5MB)
│ └── part-00002.parquet (3MB)
├── hour=01/
│ ├── part-00000.parquet (2.3MB)
│ └── part-00001.parquet (1.8MB)
...
Problems
- GET Request Explosion: Separate HTTP request per small file
- I/O Overhead: Repeated file open/close operations
- Metadata Cost: Number of files × metadata size
- Query Performance: Spark executors processing numerous files
Small Files Impact
| File Size |
File Count |
Total Data |
Spark Read Time |
S3 Cost |
| 128MB |
1,000 |
128GB |
45s |
Baseline |
| 10MB |
13,000 |
128GB |
4min 20s |
1.8x |
| 1MB |
130,000 |
128GB |
12min 35s |
3.2x |
Anti-pattern #3: Prefix Hotspot
# Anti-pattern: Concentration on same prefix
s3://bucket/data/events/2024-01-15/
├── event-000001.parquet
├── event-000002.parquet
├── event-000003.parquet
...
└── event-999999.parquet
Problems
- Request Rate Limit: Request concentration on same prefix
- Performance Degradation: Reaching 3,500/5,500 RPS limit
- Parallel Processing Limitation: Degraded distributed read performance
🚀 S3 Optimized Partitioning Strategy
Strategy #1: Simple and Shallow Structure
# Optimized: Single-level yyyy-mm-dd or yyyymmdd
s3://bucket/data/events/date=2024-01-15/
├── part-00000-uuid.snappy.parquet (128MB)
├── part-00001-uuid.snappy.parquet (128MB)
└── part-00002-uuid.snappy.parquet (128MB)
Advantages
- Minimize List Operations: 1-2 API calls
- Fast Partition Discovery: Reduced network round trips
- Predictable Performance: Consistent response time
Spark Configuration
# Create simple partition structure in Spark
df.write \
.partitionBy("date") \
.parquet("s3://bucket/data/events/")
# Prepare date column in yyyy-mm-dd format
from pyspark.sql.functions import date_format
df = df.withColumn("date", date_format("timestamp", "yyyy-MM-dd"))
Strategy #2: Maintain Appropriate File Size
| File Size |
Recommendation |
Reason |
| < 10MB |
❌ Too small |
Small files problem |
| 10-64MB |
⚠️ Small |
Prefer larger if possible |
| 64-256MB |
✅ Optimal |
Recommended range |
| 256-512MB |
✅ Good |
Suitable for large-scale processing |
| > 512MB |
⚠️ Large |
Increased processing time per file |
File Size Optimization
# Control file size in Spark
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000000)
spark.conf.set("spark.sql.files.maxPartitionBytes", 134217728) # 128MB
# Adjust file count with repartition
df.repartition(100) \
.write \
.partitionBy("date") \
.parquet("s3://bucket/data/events/")
Compaction Job
# Merge small files into large files
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("S3 Compaction") \
.getOrCreate()
# Read existing small files
df = spark.read.parquet("s3://bucket/data/events/date=2024-01-15/")
# Calculate appropriate partition count
num_partitions = max(1, int(df.count() / 1000000)) # 1M records per partition
# Rewrite with appropriate partition count
df.repartition(num_partitions) \
.write \
.mode("overwrite") \
.parquet("s3://bucket/data/events-compacted/date=2024-01-15/")
Strategy #3: Prefix Distribution
# Optimized: Distribute prefix for improved parallel processing
s3://bucket/data/events/
├── date=2024-01-15/shard=0/
│ ├── part-00000.parquet
│ └── part-00001.parquet
├── date=2024-01-15/shard=1/
│ ├── part-00000.parquet
│ └── part-00001.parquet
...
Shard-based Partitioning
# Create hash-based shard
from pyspark.sql.functions import hash, abs, col
df = df.withColumn("shard", abs(hash(col("user_id"))) % 10)
df.write \
.partitionBy("date", "shard") \
.parquet("s3://bucket/data/events/")
| Format |
Example |
Pros/Cons |
| yyyy/mm/dd |
2024/01/15 |
❌ 3 levels, many List operations |
| yyyy-mm-dd |
2024-01-15 |
✅ 1 level, good readability |
| yyyymmdd |
20240115 |
✅ 1 level, concise |
| yyyy-mm |
2024-01 |
⚠️ For monthly aggregation |
Date Partition Creation
from pyspark.sql.functions import date_format
# yyyy-mm-dd format (recommended)
df = df.withColumn("date", date_format("event_time", "yyyy-MM-dd"))
# yyyymmdd format (more concise)
df = df.withColumn("date", date_format("event_time", "yyyyMMdd"))
# Partitioning
df.write \
.partitionBy("date") \
.parquet("s3://bucket/data/events/")
Test Environment
| Item |
Configuration |
| Data Size |
1TB (1 billion records) |
| Period |
365 days |
| Spark Version |
3.4.0 |
| Instance |
r5.4xlarge × 10 |
| File Format |
Parquet (Snappy compression) |
Scenario 1: Single Date Query
-- Query: Retrieve data for a specific date
SELECT COUNT(*), AVG(amount)
FROM events
WHERE date = '2024-01-15';
| Partition Structure |
File Count |
Partition Discovery |
Data Read |
Total Time |
Improvement |
| yyyy/mm/dd |
720 (2MB) |
5.4s |
48.3s |
53.7s |
- |
| yyyy-mm-dd |
24 (128MB) |
0.9s |
12.1s |
13.0s |
4.1x |
| yyyymmdd |
24 (128MB) |
0.8s |
12.2s |
13.0s |
4.1x |
Detailed Metrics
# yyyy/mm/dd structure (anti-pattern)
{
"partition_discovery_ms": 5430,
"s3_list_calls": 4,
"s3_get_calls": 720,
"network_latency_ms": 18200,
"data_read_mb": 2880,
"executor_time_s": 48.3
}
# yyyy-mm-dd structure (optimized)
{
"partition_discovery_ms": 870,
"s3_list_calls": 2,
"s3_get_calls": 24,
"network_latency_ms": 2400,
"data_read_mb": 3072,
"executor_time_s": 12.1
}
Scenario 2: 7-Day Range Query
-- Query: Analyze last 7 days of data
SELECT date, COUNT(*), SUM(amount)
FROM events
WHERE date BETWEEN '2024-01-15' AND '2024-01-21'
GROUP BY date;
| Partition Structure |
File Count |
Partition Discovery |
Data Read |
Total Time |
Improvement |
| yyyy/mm/dd |
5,040 |
38.2s |
4min 23s |
5min 1s |
- |
| yyyy-mm-dd |
168 |
6.1s |
1min 24s |
1min 30s |
3.3x |
| yyyy-mm-dd + shard |
168 |
5.9s |
58.2s |
1min 4s |
4.7x |
Scenario 3: Full Table Scan
-- Query: Aggregate entire data (no partition pruning)
SELECT user_id, COUNT(*)
FROM events
GROUP BY user_id;
| Partition Structure |
File Count |
Partition Discovery |
Data Read |
Total Time |
Improvement |
| yyyy/mm/dd |
262,800 |
6min 32s |
24min 18s |
30min 50s |
- |
| yyyy-mm-dd |
8,760 |
1min 48s |
18min 52s |
20min 40s |
1.5x |
| yyyy-mm-dd (compacted) |
4,380 |
52.3s |
15min 23s |
16min 15s |
1.9x |
Athena Query Cost Comparison
-- Execute same query in Athena
SELECT *
FROM events
WHERE date = '2024-01-15';
| Partition Structure |
Scanned Data |
Execution Time |
Cost (per query) |
| yyyy/mm/dd |
3.2 GB |
8.3s |
$0.016 |
| yyyy-mm-dd |
3.0 GB |
2.1s |
$0.015 |
| No Partition |
1,024 GB |
1min 34s |
$5.12 |
Improvement Effect: 74% cost savings with partition optimization (vs no partition)
🔧 Production Migration Guide
Step 1: Analyze Current State
# Analyze existing partition structure
import boto3
from collections import defaultdict
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
bucket = 'my-bucket'
prefix = 'data/events/'
# Aggregate file count and size per partition
stats = defaultdict(lambda: {"count": 0, "size": 0})
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get('Contents', []):
# Extract partition (e.g., year=2024/month=01/day=15)
parts = obj['Key'].split('/')
partition = '/'.join(parts[:-1])
stats[partition]["count"] += 1
stats[partition]["size"] += obj['Size']
# Output statistics
for partition, data in sorted(stats.items()):
avg_size_mb = data["size"] / data["count"] / 1024 / 1024
print(f"{partition}: {data['count']} files, avg {avg_size_mb:.1f}MB")
Analysis Result Example
year=2024/month=01/day=15: 720 files, avg 2.3MB ❌ Small files problem
year=2024/month=01/day=16: 680 files, avg 2.5MB ❌
year=2024/month=01/day=17: 740 files, avg 2.1MB ❌
Recommendation: Consolidation to 128MB files needed
Step 2: Migration Planning
| Task |
Duration |
Downtime |
Priority |
| Partition Structure Redesign |
1-2 weeks |
None |
High |
| Compaction Job |
Depends on data volume |
None |
High |
| Parallel Migration |
3-4 weeks |
None |
Medium |
| Query/Application Modification |
2-3 weeks |
Planned deployment |
High |
| Validation and Monitoring |
1-2 weeks |
None |
High |
Step 3: Migration Script
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_format, abs, hash, col
spark = SparkSession.builder \
.appName("S3 Partition Migration") \
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
.getOrCreate()
# Read existing data
source_path = "s3://bucket/data/events-old/year=*/month=*/day=*/"
df = spark.read.parquet(source_path)
# Create new date column
df = df.withColumn("date", date_format(col("event_time"), "yyyy-MM-dd"))
# Add shard (optional)
df = df.withColumn("shard", abs(hash(col("user_id"))) % 10)
# Optimize file size
spark.conf.set("spark.sql.files.maxPartitionBytes", 134217728) # 128MB
# Calculate appropriate partition count
total_size_gb = df.count() * 500 / 1024 / 1024 / 1024 # ~500 bytes per record
num_partitions = int(total_size_gb * 8) # Based on 128MB files
# Save with new structure
df.repartition(num_partitions, "date") \
.write \
.mode("overwrite") \
.partitionBy("date") \
.parquet("s3://bucket/data/events-new/")
Step 4: Incremental Migration
# Incremental migration by date
from datetime import datetime, timedelta
start_date = datetime(2024, 1, 1)
end_date = datetime(2024, 12, 31)
current_date = start_date
while current_date <= end_date:
year = current_date.year
month = current_date.month
day = current_date.day
date_str = current_date.strftime("%Y-%m-%d")
print(f"Processing {date_str}...")
# Read data for specific date
old_path = f"s3://bucket/data/events-old/year={year}/month={month:02d}/day={day:02d}/"
try:
df = spark.read.parquet(old_path)
df = df.withColumn("date", lit(date_str))
# Optimize file size
num_files = max(1, int(df.count() / 1000000)) # 1M records per file
df.repartition(num_files) \
.write \
.mode("overwrite") \
.parquet(f"s3://bucket/data/events-new/date={date_str}/")
print(f"✓ {date_str} completed")
except Exception as e:
print(f"✗ {date_str} failed: {e}")
current_date += timedelta(days=1)
Step 5: Validation
# Migration validation script
def validate_migration(old_path, new_path, date):
# Compare record counts
old_df = spark.read.parquet(f"{old_path}/year={date.year}/month={date.month:02d}/day={date.day:02d}/")
new_df = spark.read.parquet(f"{new_path}/date={date.strftime('%Y-%m-%d')}/")
old_count = old_df.count()
new_count = new_df.count()
# Compare checksums (sampling)
old_checksum = old_df.sample(0.01).selectExpr("sum(hash(*))").collect()[0][0]
new_checksum = new_df.sample(0.01).selectExpr("sum(hash(*))").collect()[0][0]
# Result
if old_count == new_count and old_checksum == new_checksum:
print(f"✓ {date}: Valid ({old_count:,} records)")
return True
else:
print(f"✗ {date}: Invalid (old: {old_count:,}, new: {new_count:,})")
return False
# Validate entire period
from datetime import datetime, timedelta
start_date = datetime(2024, 1, 1)
end_date = datetime(2024, 1, 31)
current_date = start_date
while current_date <= end_date:
validate_migration(
"s3://bucket/data/events-old",
"s3://bucket/data/events-new",
current_date
)
current_date += timedelta(days=1)
# Collect CloudWatch metrics
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
# Monitor Athena query execution time
response = cloudwatch.get_metric_statistics(
Namespace='AWS/Athena',
MetricName='EngineExecutionTime',
Dimensions=[
{'Name': 'WorkGroup', 'Value': 'primary'}
],
StartTime=datetime.now() - timedelta(days=7),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average', 'Maximum']
)
# Analyze results
for datapoint in response['Datapoints']:
print(f"{datapoint['Timestamp']}: "
f"Avg={datapoint['Average']:.2f}s, "
f"Max={datapoint['Maximum']:.2f}s")
Step 7: Cost Optimization
S3 Storage Class Transition
# Transition old partitions to Intelligent-Tiering
import boto3
s3 = boto3.client('s3')
def transition_old_partitions(bucket, prefix, days_old=90):
cutoff_date = datetime.now() - timedelta(days=days_old)
# Create lifecycle policy
lifecycle_config = {
'Rules': [
{
'Id': 'TransitionOldData',
'Status': 'Enabled',
'Prefix': prefix,
'Transitions': [
{
'Days': days_old,
'StorageClass': 'INTELLIGENT_TIERING'
}
]
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket=bucket,
LifecycleConfiguration=lifecycle_config
)
print(f"✓ Lifecycle policy applied: {days_old}+ days → INTELLIGENT_TIERING")
transition_old_partitions('my-bucket', 'data/events/', 90)
Cost Savings Effect
| Item |
Before Migration |
After Migration |
Savings |
| S3 Storage |
$23,040/month (1TB, Standard) |
$20,736/month (Intelligent-Tiering) |
10% |
| S3 API Cost |
$1,200/month (LIST/GET) |
$360/month |
70% |
| Athena Scan |
$512/month |
$128/month |
75% |
| Spark Compute |
$4,800/month |
$3,200/month |
33% |
| Total Cost |
$29,552/month |
$24,424/month |
17% |
📚 Learning Summary
Key Points
- Architecture Understanding is Key
- HDFS: Hierarchical file system, NameNode metadata
- S3: Flat namespace object storage, List operation cost
- S3 Optimization Strategies
- Shallow Structure: yyyy-mm-dd single level
- Large Files: 64-256MB recommended
- Prefix Distribution: Avoid request rate limits
- Performance Improvement Effects
- Single Date Query: 4.1x faster
- 7-Day Range Query: 4.7x faster (with shard)
- Cost Savings: 17% reduction
- Migration Best Practices
- Incremental migration
- Thorough validation
- Performance monitoring
Production Checklist
Additional Learning Resources
- AWS Official Docs: S3 Performance Best Practices
- Spark Optimization: Adaptive Query Execution (AQE)
- Parquet Optimization: Row Group size, Compression
- Iceberg/Delta Lake: Partitioning abstraction with table formats
“Proper partitioning strategy not only improves performance but also reduces costs and enhances operational efficiency.”
The transition from HDFS to S3 is not just a simple storage migration. When you understand the fundamental architectural differences and apply optimization strategies accordingly, you can realize the true value of a cloud-native data lake.