LSM Tree and Bloom Filter - Core Data Structures of Modern Databases
🌳 LSM Tree and Bloom Filter - Core Data Structures of Modern Databases
“Fast writes and efficient reads simultaneously - Modern databases powered by LSM Tree and Bloom Filter” - Core technologies of RocksDB, Cassandra, and HBase
Traditional B-Tree-based databases have limitations in random write performance. LSM Tree (Log-Structured Merge Tree) leverages sequential writes to improve write performance by orders of magnitude, while Bloom Filter dramatically reduces unnecessary disk I/O to optimize read performance. This post provides a complete guide to both technologies, from principles to actual database implementations.
📚 Table of Contents
- What is LSM Tree?
- LSM Tree Operation Principles
- Bloom Filter Fundamentals
- Bloom Filter Operation Principles
- Combining LSM Tree and Bloom Filter
- Real Database Implementations
- Performance Optimization Strategies
- Learning Summary
🎯 What is LSM Tree?
Limitations of Traditional B-Tree
Traditional B-Tree-based databases (MySQL, PostgreSQL, etc.) have the following problems:
Problem 1: Random Write Performance Degradation
# Random writes in B-Tree
# Each write must modify multiple disk locations
INSERT INTO users (id, name) VALUES (1, 'Alice'); # Modify disk location A
INSERT INTO users (id, name) VALUES (2, 'Bob'); # Modify disk location B
INSERT INTO users (id, name) VALUES (3, 'Charlie'); # Modify disk location C
# → Frequent disk head movement causes performance degradation
Problem 2: Write Amplification
B-Tree write process:
1. Read data page (disk I/O)
2. Modify page
3. Write page (disk I/O)
4. Update index page (additional disk I/O)
5. Write WAL (Write-Ahead Log) (additional disk I/O)
→ One write causes multiple disk I/Os
LSM Tree’s Solution
LSM Tree solves performance issues by leveraging sequential writes.
Core Idea
Traditional B-Tree:
Random writes → Disk head movement → Slow performance
LSM Tree:
1. Write sequentially to memory buffer (very fast)
2. When buffer is full, write sequentially to disk (fast)
3. Perform merge (Compaction) in background
Advantages of LSM Tree
| Feature | B-Tree | LSM Tree |
|---|---|---|
| Write Performance | Random writes (slow) | Sequential writes (fast) |
| Write Amplification | High (multiple page modifications) | Low (sequential writes) |
| Read Performance | Fast (direct index access) | Relatively slow (multi-level search) |
| Space Efficiency | High | Medium (duplicate data exists) |
| Compression | Limited | Efficient (compression during merge) |
🔄 LSM Tree Operation Principles
Basic Structure
LSM Tree consists of multiple levels:
LSM Tree Structure:
┌─────────────────────┐
│ MemTable (L0) │ ← Memory (most recent data)
│ (In-Memory) │
└─────────────────────┘
↓ (flush)
┌─────────────────────┐
│ SSTable (L1) │ ← Disk (small files)
│ (Sorted String) │
└─────────────────────┘
↓ (merge)
┌─────────────────────┐
│ SSTable (L2) │ ← Disk (larger files)
└─────────────────────┘
↓
┌─────────────────────┐
│ SSTable (L3) │ ← Disk (largest files)
└─────────────────────┘
1. Write Path
Step 1: Write to MemTable
# Simulating LSM Tree write process with Python
class MemTable:
def __init__(self, max_size=100):
self.data = {} # Sorted dictionary in memory
self.max_size = max_size
def put(self, key, value):
"""Write data to memory (very fast)"""
self.data[key] = value
# Flush when MemTable is full
if len(self.data) >= self.max_size:
return self.flush()
return None
def flush(self):
"""Flush MemTable to disk (SSTable)"""
# Convert sorted data to SSTable
sorted_data = sorted(self.data.items())
sstable = SSTable(sorted_data)
self.data.clear() # Reset MemTable
return sstable
# Usage example
memtable = MemTable(max_size=5)
# Write sequentially (very fast)
memtable.put("user:1", "Alice")
memtable.put("user:2", "Bob")
memtable.put("user:3", "Charlie")
memtable.put("user:4", "David")
memtable.put("user:5", "Eve") # → MemTable full → flush occurs
Step 2: Write to WAL (Write-Ahead Log)
class WAL:
"""Write-Ahead Log - log for disaster recovery"""
def __init__(self, log_file):
self.log_file = log_file
def append(self, key, value):
"""Write sequentially to WAL (very fast)"""
log_entry = f"{key}:{value}\n"
with open(self.log_file, 'a') as f:
f.write(log_entry)
def replay(self):
"""Replay WAL during disaster recovery"""
operations = []
with open(self.log_file, 'r') as f:
for line in f:
key, value = line.strip().split(':')
operations.append((key, value))
return operations
# Write process
wal = WAL("wal.log")
wal.append("user:1", "Alice") # Write to WAL first
memtable.put("user:1", "Alice") # Then write to MemTable
Step 3: Create SSTable
class SSTable:
"""Sorted String Table - sorted data stored on disk"""
def __init__(self, sorted_data):
self.data = dict(sorted_data)
self.min_key = min(self.data.keys())
self.max_key = max(self.data.keys())
self.size = len(self.data)
def get(self, key):
"""Query key from SSTable"""
return self.data.get(key)
def range_query(self, start_key, end_key):
"""Range query"""
result = {}
for key, value in self.data.items():
if start_key <= key <= end_key:
result[key] = value
return result
# MemTable flush → Create SSTable
sstable = memtable.flush()
print(f"SSTable created: {sstable.min_key} to {sstable.max_key}")
2. Read Path
LSM Tree reads must search multiple levels:
class LSMTree:
def __init__(self):
self.memtable = MemTable()
self.sstables = [] # List of SSTables by level
self.wal = WAL("wal.log")
def get(self, key):
"""Query key - search from most recent data"""
# 1. Search MemTable first (most recent)
if key in self.memtable.data:
return self.memtable.data[key]
# 2. Search SSTables in reverse order (newest first)
for sstable in reversed(self.sstables):
if sstable.min_key <= key <= sstable.max_key:
value = sstable.get(key)
if value is not None:
return value
# 3. Not found
return None
# Read example
lsm = LSMTree()
lsm.put("user:1", "Alice")
lsm.put("user:2", "Bob")
# Found immediately in MemTable (fast)
print(lsm.get("user:1")) # "Alice"
# After MemTable flush
lsm.memtable.flush()
# Found in SSTable (relatively slow)
print(lsm.get("user:1")) # "Alice"
3. Compaction Process
Compaction merges multiple SSTables to improve read performance:
class Compactor:
"""Responsible for SSTable merging"""
@staticmethod
def compact(sstables):
"""Merge multiple SSTables into one"""
merged_data = {}
# Merge data from all SSTables (newest value takes priority)
for sstable in reversed(sstables): # From newest
for key, value in sstable.data.items():
if key not in merged_data: # Keep only newest value
merged_data[key] = value
# Create new sorted SSTable
sorted_data = sorted(merged_data.items())
return SSTable(sorted_data)
@staticmethod
def level_compact(level_sstables, next_level_sstables):
"""Merge between levels (Leveled Compaction)"""
# Merge L1 SSTables with L2
all_sstables = level_sstables + next_level_sstables
return Compactor.compact(all_sstables)
# Compaction example
sstable1 = SSTable([("a", "1"), ("b", "2")])
sstable2 = SSTable([("b", "3"), ("c", "4")]) # b is updated
# Merge (newest value takes priority)
merged = Compactor.compact([sstable1, sstable2])
print(merged.data) # {'a': '1', 'b': '3', 'c': '4'}
Compaction Strategies
Size-Tiered Compaction
Merge SSTables of similar sizes by level
L1: [10MB, 10MB, 10MB] → merge → L2: [30MB]
L2: [30MB, 30MB, 30MB] → merge → L3: [90MB]
Leveled Compaction
Each level has maximum size limit
L1: Max 10 files, each 10MB
L2: Max 10 files, each 100MB
L3: Max 10 files, each 1GB
→ File size increases as level increases
🔍 Bloom Filter Fundamentals
What is Bloom Filter?
Bloom Filter is a probabilistic data structure that can quickly check element existence, but False Positives are possible while False Negatives are impossible.
Why is Bloom Filter Needed?
When searching for a key in LSM Tree:
# Key search without Bloom Filter
def get_without_bloom_filter(key):
# Must search all SSTables
for sstable in sstables:
if key in sstable: # Disk I/O occurs!
return sstable.get(key)
return None
# Problem: Non-existent keys also require searching all SSTables
# → Unnecessary disk I/O occurs
With Bloom Filter:
# Key search with Bloom Filter
def get_with_bloom_filter(key):
for sstable in sstables:
if sstable.bloom_filter.might_contain(key): # Quick check in memory
if key in sstable: # Disk I/O only when actually exists
return sstable.get(key)
return None
# Advantage: Non-existent keys are quickly rejected without disk I/O
Characteristics of Bloom Filter
| Feature | Description |
|---|---|
| Space Efficiency | Very small memory usage (bit array) |
| Speed | O(k) - k is number of hash functions |
| False Positive | Possible (reports existence when not present) |
| False Negative | Impossible (never reports absence when present) |
| Deletion | Not possible by default (possible with Counting Bloom Filter) |
🧮 Bloom Filter Operation Principles
Basic Structure
import mmh3 # MurmurHash3
import math
class BloomFilter:
"""Bloom Filter implementation"""
def __init__(self, capacity, error_rate=0.01):
"""
capacity: Expected number of elements
error_rate: Acceptable False Positive rate
"""
self.capacity = capacity
self.error_rate = error_rate
# Calculate bit array size
# m = -n * ln(p) / (ln(2)^2)
self.bit_array_size = int(-capacity * math.log(error_rate) / (math.log(2) ** 2))
# Calculate number of hash functions
# k = (m/n) * ln(2)
self.hash_count = int((self.bit_array_size / capacity) * math.log(2))
# Initialize bit array
self.bit_array = [0] * self.bit_array_size
def _hash(self, item, seed):
"""Hash function (using MurmurHash3)"""
return mmh3.hash(item, seed) % self.bit_array_size
def add(self, item):
"""Add element"""
for i in range(self.hash_count):
index = self._hash(item, i)
self.bit_array[index] = 1
def might_contain(self, item):
"""Check element existence possibility"""
for i in range(self.hash_count):
index = self._hash(item, i)
if self.bit_array[index] == 0:
return False # Definitely does not exist
return True # May exist (False Positive possible)
# Usage example
bloom = BloomFilter(capacity=1000, error_rate=0.01)
# Add elements
bloom.add("user:1")
bloom.add("user:2")
bloom.add("user:3")
# Check existence
print(bloom.might_contain("user:1")) # True
print(bloom.might_contain("user:999")) # False (definitely not present)
print(bloom.might_contain("user:1001")) # True or False (False Positive possible)
Hash Functions and Bit Array
Bloom Filter operation process:
1. Add element:
"user:1" → Hash1 → Index 5 → Bit[5] = 1
→ Hash2 → Index 12 → Bit[12] = 1
→ Hash3 → Index 23 → Bit[23] = 1
2. Check element:
"user:1" → Hash1 → Index 5 → Bit[5] == 1? ✓
→ Hash2 → Index 12 → Bit[12] == 1? ✓
→ Hash3 → Index 23 → Bit[23] == 1? ✓
→ "May exist" (True)
"user:999" → Hash1 → Index 7 → Bit[7] == 1? ✗
→ "Definitely does not exist" (False)
False Positive Rate Calculation
def calculate_false_positive_rate(n, m, k):
"""
n: Actual number of elements
m: Bit array size
k: Number of hash functions
"""
# False Positive probability = (1 - e^(-kn/m))^k
import math
return (1 - math.exp(-k * n / m)) ** k
# Example
n = 1000 # 1000 elements
m = 9585 # Bit array size (when error_rate=0.01)
k = 7 # Number of hash functions
fp_rate = calculate_false_positive_rate(n, m, k)
print(f"False Positive Rate: {fp_rate:.4f}") # Approximately 0.01 (1%)
Optimal Parameter Selection
def optimal_bloom_filter_params(capacity, error_rate):
"""Calculate optimal Bloom Filter parameters"""
import math
# Bit array size
m = int(-capacity * math.log(error_rate) / (math.log(2) ** 2))
# Number of hash functions
k = int((m / capacity) * math.log(2))
return {
'bit_array_size': m,
'hash_count': k,
'memory_bytes': m / 8, # Convert bits to bytes
'bits_per_element': m / capacity
}
# Example: 1 million elements, 1% error rate
params = optimal_bloom_filter_params(1_000_000, 0.01)
print(f"Bit array size: {params['bit_array_size']:,} bits")
print(f"Number of hash functions: {params['hash_count']}")
print(f"Memory usage: {params['memory_bytes'] / 1024 / 1024:.2f} MB")
print(f"Bits per element: {params['bits_per_element']:.2f}")
# Output:
# Bit array size: 9,585,058 bits
# Number of hash functions: 7
# Memory usage: 1.14 MB
# Bits per element: 9.59
🔗 Combining LSM Tree and Bloom Filter
Integrated Architecture
class SSTableWithBloomFilter:
"""SSTable with Bloom Filter"""
def __init__(self, sorted_data):
self.data = dict(sorted_data)
self.min_key = min(self.data.keys())
self.max_key = max(self.data.keys())
# Create Bloom Filter
self.bloom_filter = BloomFilter(
capacity=len(self.data),
error_rate=0.01
)
# Add all keys to Bloom Filter
for key in self.data.keys():
self.bloom_filter.add(key)
def get(self, key):
"""Query after filtering with Bloom Filter"""
# 1. Quick filtering with Bloom Filter
if not self.bloom_filter.might_contain(key):
return None # Definitely not present → No disk I/O
# 2. Actual data query (disk I/O occurs)
return self.data.get(key)
class OptimizedLSMTree:
"""LSM Tree integrated with Bloom Filter"""
def __init__(self):
self.memtable = MemTable()
self.sstables = []
self.wal = WAL("wal.log")
def get(self, key):
"""Optimized read - leveraging Bloom Filter"""
# 1. Search MemTable (memory, fastest)
if key in self.memtable.data:
return self.memtable.data[key]
# 2. Search SSTables (filtered with Bloom Filter)
for sstable in reversed(self.sstables):
# Check with Bloom Filter first (memory, fast)
if sstable.bloom_filter.might_contain(key):
# May actually exist → Query from disk
value = sstable.get(key)
if value is not None:
return value
return None
# Performance comparison
def benchmark_read_performance():
"""Read performance benchmark"""
import time
lsm_without_bf = LSMTree() # Without Bloom Filter
lsm_with_bf = OptimizedLSMTree() # With Bloom Filter
# Prepare test data
for i in range(10000):
lsm_without_bf.put(f"key:{i}", f"value:{i}")
lsm_with_bf.put(f"key:{i}", f"value:{i}")
# Search for non-existent keys (biggest difference)
start = time.time()
for i in range(10000, 20000):
lsm_without_bf.get(f"key:{i}") # Search all SSTables
time_without_bf = time.time() - start
start = time.time()
for i in range(10000, 20000):
lsm_with_bf.get(f"key:{i}") # Quick filtering with Bloom Filter
time_with_bf = time.time() - start
print(f"Without Bloom Filter: {time_without_bf:.4f} seconds")
print(f"With Bloom Filter: {time_with_bf:.4f} seconds")
print(f"Performance improvement: {time_without_bf / time_with_bf:.2f}x")
# benchmark_read_performance()
# Example output:
# Without Bloom Filter: 2.3456 seconds
# With Bloom Filter: 0.1234 seconds
# Performance improvement: 19.01x
Performance Improvement Effects
| Scenario | Without Bloom Filter | With Bloom Filter | Improvement |
|---|---|---|---|
| Existing Key | Search all levels | Search all levels | Similar |
| Non-existent Key | Search all levels | Immediately rejected in memory | 10-100x |
| Partially Existing Key | Search some levels | Search some levels | 2-5x |
🗄️ Real Database Implementations
RocksDB
RocksDB is an LSM Tree-based storage engine developed by Facebook.
RocksDB Architecture
RocksDB Structure:
┌─────────────────────┐
│ MemTable │ ← SkipList (memory)
│ (SkipList) │
└─────────────────────┘
↓
┌─────────────────────┐
│ Immutable MemTable│ ← Waiting for flush
└─────────────────────┘
↓
┌─────────────────────┐
│ L0 SSTable │ ← Level 0 (multiple files)
│ (Bloom Filter) │
└─────────────────────┘
↓
┌─────────────────────┐
│ L1 SSTable │ ← Level 1
│ (Bloom Filter) │
└─────────────────────┘
↓
┌─────────────────────┐
│ L2+ SSTable │ ← Level 2 and above
│ (Bloom Filter) │
└─────────────────────┘
RocksDB Usage Example
import rocksdb
# Open RocksDB
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))
# Write (very fast - sequential write)
db.put(b"user:1", b"Alice")
db.put(b"user:2", b"Bob")
db.put(b"user:3", b"Charlie")
# Read (optimized with Bloom Filter)
value = db.get(b"user:1")
print(value) # b"Alice"
# Batch write (even faster)
batch = rocksdb.WriteBatch()
batch.put(b"user:4", b"David")
batch.put(b"user:5", b"Eve")
db.write(batch)
# Range scan
it = db.iteritems()
it.seek(b"user:")
for key, value in it:
print(f"{key}: {value}")
Apache Cassandra
Cassandra is a distributed NoSQL database that uses LSM Tree.
Cassandra’s LSM Tree Implementation
Cassandra Structure:
┌─────────────────────┐
│ Memtable │ ← Memory of each node
│ (ConcurrentSkipList)│
└─────────────────────┘
↓
┌─────────────────────┐
│ Commit Log │ ← WAL (disaster recovery)
└─────────────────────┘
↓
┌─────────────────────┐
│ SSTable (L0) │ ← Multiple SSTables
│ (Bloom Filter) │
└─────────────────────┘
↓
┌─────────────────────┐
│ Compaction │ ← Size-Tiered Compaction
└─────────────────────┘
HBase
HBase is a Hadoop-based distributed database.
HBase’s LSM Tree Implementation
HBase Structure:
┌─────────────────────┐
│ MemStore │ ← RegionServer memory
│ (ConcurrentSkipList)│
└─────────────────────┘
↓
┌─────────────────────┐
│ WAL (HLog) │ ← Stored in HDFS
└─────────────────────┘
↓
┌─────────────────────┐
│ HFile (SSTable) │ ← Stored in HDFS
│ (Bloom Filter) │
└─────────────────────┘
⚡ Performance Optimization Strategies
1. Bloom Filter Optimization
Adaptive Bloom Filter Size
class AdaptiveBloomFilter:
"""Bloom Filter that automatically adjusts based on data size"""
def __init__(self, initial_capacity=1000):
self.capacity = initial_capacity
self.bloom_filter = BloomFilter(initial_capacity)
self.actual_count = 0
def add(self, item):
"""Auto-expand capacity when adding elements"""
self.bloom_filter.add(item)
self.actual_count += 1
# Expand when actual count exceeds 80% of expected capacity
if self.actual_count >= self.capacity * 0.8:
self._expand()
def _expand(self):
"""Expand Bloom Filter"""
old_filter = self.bloom_filter
self.capacity *= 2
self.bloom_filter = BloomFilter(self.capacity)
# Reconstruct existing data (actually done during SSTable regeneration)
# Simulation here
print(f"Bloom Filter expanded to {self.capacity}")
Block-Level Bloom Filter
class BlockBloomFilter:
"""Separate Bloom Filter for each block of SSTable"""
def __init__(self, block_size=4096):
self.block_size = block_size
self.block_filters = {} # Bloom Filter per block
def add_to_block(self, key, block_id):
"""Add to specific block's Bloom Filter"""
if block_id not in self.block_filters:
self.block_filters[block_id] = BloomFilter(
capacity=self.block_size
)
self.block_filters[block_id].add(key)
def might_contain_in_block(self, key, block_id):
"""Check if key exists in specific block"""
if block_id not in self.block_filters:
return False
return self.block_filters[block_id].might_contain(key)
2. Compaction Optimization
Compaction Strategy Selection
class CompactionStrategy:
"""Compaction strategy selection"""
@staticmethod
def size_tiered_compact(sstables):
"""Size-Tiered: Merge files of similar sizes"""
# Group SSTables by similar size range
size_groups = {}
for sstable in sstables:
size_range = sstable.size // 1000 # Group by 1KB units
if size_range not in size_groups:
size_groups[size_range] = []
size_groups[size_range].append(sstable)
# Merge if 4 or more in each group
merged = []
for size_range, group in size_groups.items():
if len(group) >= 4:
merged.append(Compactor.compact(group))
else:
merged.extend(group)
return merged
@staticmethod
def leveled_compact(levels):
"""Leveled: Size limit per level"""
for level, sstables in enumerate(levels):
max_files = 10 * (level + 1) # More files allowed at higher levels
if len(sstables) > max_files:
# Merge excess files with next level
return Compactor.level_compact(
sstables[:max_files],
levels[level + 1] if level + 1 < len(levels) else []
)
return levels
3. Memory Management
MemTable Size Limit
class MemoryAwareMemTable(MemTable):
"""MemTable that monitors memory usage"""
def __init__(self, max_size=100, max_memory_mb=100):
super().__init__(max_size)
self.max_memory_mb = max_memory_mb
self.current_memory_mb = 0
def put(self, key, value):
"""Check memory usage"""
key_size = len(str(key).encode())
value_size = len(str(value).encode())
entry_size_mb = (key_size + value_size) / (1024 * 1024)
if self.current_memory_mb + entry_size_mb > self.max_memory_mb:
# Memory exceeded → Flush immediately
return self.flush()
self.current_memory_mb += entry_size_mb
return super().put(key, value)
def flush(self):
"""Reset memory after flush"""
result = super().flush()
self.current_memory_mb = 0
return result
📚 Learning Summary
Key Points
- Advantages of LSM Tree
- Excellent write performance with sequential writes
- Reduced Write Amplification
- Improved compression efficiency
- Suitable for large-scale data processing
- Role of Bloom Filter
- Quickly filter non-existent keys
- Eliminate unnecessary disk I/O
- Significant performance improvement with small memory
- Real Applications
- RocksDB: Embedded database
- Cassandra: Distributed NoSQL
- HBase: Big data storage
Performance Comparison Summary
| Operation | B-Tree | LSM Tree (No Bloom Filter) | LSM Tree (With Bloom Filter) |
|---|---|---|---|
| Write | Slow (random) | Very fast (sequential) | Very fast (sequential) |
| Read (exists) | Fast | Medium | Medium |
| Read (not exists) | Fast | Slow | Very fast |
Practical Checklist
- Is it a write-heavy workload? → Consider LSM Tree
- Is read performance important? → Bloom Filter essential
- Are there memory constraints? → Adjust Bloom Filter size
- Choose Compaction strategy (Size-Tiered vs Leveled)
- Configure WAL and disaster recovery strategy
- Set up monitoring and tuning metrics
Next Steps
- Distributed LSM Tree: Distributed architecture of Cassandra, HBase
- Advanced Compaction: Tiered Compaction, Universal Compaction
- Performance Tuning: Memory allocation, Compaction thread count
- Monitoring: Write latency, read latency, Compaction latency
“LSM Tree and Bloom Filter are core technologies that determine the performance of modern databases.”
LSM Tree is essential for workloads where write performance is important, and Bloom Filter significantly improves read performance. Many modern databases like RocksDB, Cassandra, and HBase achieve excellent performance by combining these two technologies. I hope this guide helps you understand and optimize database internal structures!