Part 3: Transformer-Based Time Series Forecasting Models
Part 3: Transformer-Based Time Series Forecasting Models
Welcome to the third part of our time series forecasting series! In this installment, we’ll explore how the revolutionary Transformer architecture, which transformed natural language processing, has been adapted for time series forecasting.
📚 Learning Objectives
By the end of this part, you will learn:
- The background and necessity of applying Transformers to time series forecasting
- Key characteristics of major models: Informer, Autoformer, FEDformer, PatchTST
- Strengths and weaknesses of each model and their application domains
- Hands-on practice with transformer-based time series forecasting using real data
🔍 Background: Why Transformers for Time Series?
Limitations of Previous Methods
RNN/LSTM Problems:
- Sequential processing leading to long training times
- Difficulty learning long-term dependencies
- Inability to parallelize processing
CNN Limitations:
- Focus only on local patterns
- Difficulty capturing long-range dependencies
Transformer Advantages
- Parallel Processing: Process all time points simultaneously
- Long-range Dependencies: Learn relationships in long sequences through self-attention
- Scalability: Performance improvement with larger models and datasets
🚀 Major Transformer-Based Time Series Models
1. Informer (2021)
Core Ideas:
- ProbSparse Self-attention reduces complexity from O(L²) to O(L log L)
- Self-attention Distilling compresses information across layers
- Generative Decoder predicts long sequences at once
Key Features:
# Informer's core structure
class Informer(nn.Module):
def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, out_len):
super(Informer, self).__init__()
self.enc_in = enc_in
self.dec_in = dec_in
self.c_out = c_out
self.seq_len = seq_len
self.label_len = label_len
self.out_len = out_len
# Uses ProbSparse Attention
self.attn = ProbAttention(attention_dropout=0.1)
# Encoder and Decoder
self.encoder = Encoder(...)
self.decoder = Decoder(...)
Advantages:
- Efficient on long sequences
- Excellent performance across diverse datasets
Disadvantages:
- Long training time due to complex structure
- Difficult hyperparameter tuning
2. Autoformer (2021)
Core Ideas:
- Auto-Correlation mechanism automatically learns periodicity in time series
- Decomposition Block separates trend and seasonality
- Series-wise Connection minimizes information loss
Key Features:
# Autoformer's Auto-Correlation
class AutoCorrelation(nn.Module):
def forward(self, queries, keys, values):
# Calculate correlation to find periodicity in time series
autocorr = self.autocorrelation(queries, keys)
return self.value_projection(values) * autocorr
Advantages:
- Automatic learning of time series periodicity
- Improved interpretability through trend and seasonality decomposition
Disadvantages:
- Limited performance on data without periodicity
- Difficulty learning complex patterns
3. FEDformer (2022)
Core Ideas:
- Fourier Enhanced Decomposed Transformer
- Attention in frequency domain using FFT
- Model ensemble for performance improvement
Key Features:
# FEDformer's Fourier Attention
class FourierAttention(nn.Module):
def forward(self, x):
# Transform to frequency domain using FFT
x_freq = torch.fft.rfft(x, dim=-1)
# Calculate attention in frequency domain
attn_freq = self.frequency_attention(x_freq)
# Inverse transform back to time domain
return torch.fft.irfft(attn_freq, dim=-1)
Advantages:
- Efficient processing in frequency domain
- Can learn various periodic patterns
Disadvantages:
- FFT computation cost
- May not be suitable for real-time prediction
4. PatchTST (2023)
Core Ideas:
- Process time series in patch units
- Channel Independence for multivariate time series
- Simple structure with excellent performance
Key Features:
# PatchTST's patch creation
def create_patch(x, patch_len, stride):
# Split time series into patches
patches = x.unfold(dim=-1, size=patch_len, step=stride)
return patches.transpose(-1, -2)
Advantages:
- Simple and efficient structure
- Excellent performance on multivariate time series
- Fast training and inference
Disadvantages:
- Sensitive to patch size
- Limited on very long sequences
🛠️ Hands-on Practice: Stock Price Prediction with PatchTST
Now let’s implement a PatchTST model using real data.
1. Data Preparation
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset
# Generate stock data (use yfinance for real data)
def generate_stock_data(n_samples=1000, n_features=5):
"""Generate synthetic stock data"""
np.random.seed(42)
# Generate data with trend and seasonality
t = np.linspace(0, 4*np.pi, n_samples)
trend = 0.01 * t
seasonal = 0.5 * np.sin(t) + 0.3 * np.sin(2*t)
noise = 0.1 * np.random.randn(n_samples)
# Generate multivariate time series
data = np.zeros((n_samples, n_features))
for i in range(n_features):
data[:, i] = trend + seasonal + noise + i*0.1
return pd.DataFrame(data, columns=[f'stock_{i+1}' for i in range(n_features)])
# Generate data
data = generate_stock_data()
print(f"Data shape: {data.shape}")
print(data.head())
2. PatchTST Model Implementation
class PatchTST(nn.Module):
def __init__(self, seq_len, pred_len, patch_len, stride, n_features, d_model=128, n_heads=8, n_layers=3):
super(PatchTST, self).__init__()
self.seq_len = seq_len
self.pred_len = pred_len
self.patch_len = patch_len
self.stride = stride
self.n_features = n_features
self.d_model = d_model
# Calculate number of patches
self.num_patches = (seq_len - patch_len) // stride + 1
# Input projection
self.input_projection = nn.Linear(patch_len, d_model)
# Positional encoding
self.pos_encoding = nn.Parameter(torch.randn(1, self.num_patches, d_model))
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=n_heads,
dim_feedforward=d_model*4,
dropout=0.1,
batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
# Output projection
self.output_projection = nn.Linear(d_model, pred_len)
def create_patches(self, x):
"""Split time series into patches"""
# x: (batch_size, n_features, seq_len)
batch_size, n_features, seq_len = x.shape
# Create patches for each feature
patches = []
for i in range(n_features):
feature_patches = x[:, i, :].unfold(dim=-1, size=self.patch_len, step=self.stride)
patches.append(feature_patches)
# (batch_size, n_features, num_patches, patch_len)
patches = torch.stack(patches, dim=1)
return patches
def forward(self, x):
# x: (batch_size, n_features, seq_len)
batch_size, n_features, seq_len = x.shape
# Create patches
patches = self.create_patches(x) # (batch_size, n_features, num_patches, patch_len)
# Process each feature independently (Channel Independence)
outputs = []
for i in range(n_features):
feature_patches = patches[:, i, :, :] # (batch_size, num_patches, patch_len)
# Input projection
projected = self.input_projection(feature_patches) # (batch_size, num_patches, d_model)
# Add positional encoding
projected = projected + self.pos_encoding
# Transformer encoder
encoded = self.transformer(projected) # (batch_size, num_patches, d_model)
# Global average pooling
pooled = encoded.mean(dim=1) # (batch_size, d_model)
# Output projection
output = self.output_projection(pooled) # (batch_size, pred_len)
outputs.append(output)
# Combine outputs from all features
final_output = torch.stack(outputs, dim=1) # (batch_size, n_features, pred_len)
return final_output
# Model parameters
seq_len = 96 # Input sequence length
pred_len = 24 # Prediction length
patch_len = 16 # Patch length
stride = 8 # Stride
n_features = 5 # Number of features
model = PatchTST(
seq_len=seq_len,
pred_len=pred_len,
patch_len=patch_len,
stride=stride,
n_features=n_features
)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
3. Data Preprocessing and Training
def prepare_data(data, seq_len, pred_len, train_ratio=0.7, val_ratio=0.2):
"""Prepare data for training"""
# Normalization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Create sequences
X, y = [], []
for i in range(len(scaled_data) - seq_len - pred_len + 1):
X.append(scaled_data[i:i+seq_len])
y.append(scaled_data[i+seq_len:i+seq_len+pred_len])
X = np.array(X)
y = np.array(y)
# Train/validation/test split
n_train = int(len(X) * train_ratio)
n_val = int(len(X) * val_ratio)
X_train, y_train = X[:n_train], y[:n_train]
X_val, y_val = X[n_train:n_train+n_val], y[n_train:n_train+n_val]
X_test, y_test = X[n_train+n_val:], y[n_train+n_val:]
return (X_train, y_train), (X_val, y_val), (X_test, y_test), scaler
# Prepare data
(X_train, y_train), (X_val, y_val), (X_test, y_test), scaler = prepare_data(
data, seq_len, pred_len
)
print(f"Train data: {X_train.shape}, {y_train.shape}")
print(f"Validation data: {X_val.shape}, {y_val.shape}")
print(f"Test data: {X_test.shape}, {y_test.shape}")
# Create DataLoader
def create_dataloader(X, y, batch_size=32, shuffle=True):
X_tensor = torch.FloatTensor(X).transpose(1, 2) # (batch, features, seq_len)
y_tensor = torch.FloatTensor(y).transpose(1, 2) # (batch, features, pred_len)
dataset = TensorDataset(X_tensor, y_tensor)
return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
train_loader = create_dataloader(X_train, y_train, batch_size=32)
val_loader = create_dataloader(X_val, y_val, batch_size=32, shuffle=False)
4. Model Training
def train_model(model, train_loader, val_loader, epochs=50, lr=0.001):
"""Train the model"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
train_losses = []
val_losses = []
for epoch in range(epochs):
# Training
model.train()
train_loss = 0
for batch_X, batch_y in train_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for batch_X, batch_y in val_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
val_loss += loss.item()
train_loss /= len(train_loader)
val_loss /= len(val_loader)
train_losses.append(train_loss)
val_losses.append(val_loss)
scheduler.step(val_loss)
if epoch % 10 == 0:
print(f'Epoch {epoch:3d}: Train Loss = {train_loss:.6f}, Val Loss = {val_loss:.6f}')
return train_losses, val_losses
# Train model
print("Starting model training...")
train_losses, val_losses = train_model(model, train_loader, val_loader, epochs=100)
print("Training completed!")
5. Results Visualization
def plot_results(model, test_loader, scaler, n_samples=3):
"""Visualize results"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.eval()
with torch.no_grad():
for i, (batch_X, batch_y) in enumerate(test_loader):
if i >= n_samples:
break
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
predictions = model(batch_X)
# Visualize only the first sample
X_sample = batch_X[0].cpu().numpy().T # (seq_len, n_features)
y_true = batch_y[0].cpu().numpy().T # (pred_len, n_features)
y_pred = predictions[0].cpu().numpy().T # (pred_len, n_features)
# Inverse normalization
X_sample = scaler.inverse_transform(X_sample)
y_true = scaler.inverse_transform(y_true)
y_pred = scaler.inverse_transform(y_pred)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for j in range(min(5, n_features)):
ax = axes[j]
# Historical data
ax.plot(range(seq_len), X_sample[:, j], 'b-', label='Past', linewidth=2)
# Actual future
future_x = range(seq_len, seq_len + pred_len)
ax.plot(future_x, y_true[:, j], 'g-', label='Actual', linewidth=2)
# Prediction
ax.plot(future_x, y_pred[:, j], 'r--', label='Prediction', linewidth=2)
ax.set_title(f'Stock {j+1}')
ax.legend()
ax.grid(True, alpha=0.3)
# Hide last subplot
if n_features < 6:
axes[-1].set_visible(False)
plt.tight_layout()
plt.show()
# Visualize results
test_loader = create_dataloader(X_test, y_test, batch_size=1, shuffle=False)
plot_results(model, test_loader, scaler)
# Plot training curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss', color='blue')
plt.plot(val_losses, label='Validation Loss', color='red')
plt.title('Training Curves')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(train_losses[-20:], label='Train Loss (Last 20)', color='blue')
plt.plot(val_losses[-20:], label='Validation Loss (Last 20)', color='red')
plt.title('Recent Training Curves')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
6. Performance Evaluation
def evaluate_model(model, test_loader, scaler):
"""Evaluate model performance"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.eval()
all_predictions = []
all_targets = []
with torch.no_grad():
for batch_X, batch_y in test_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
predictions = model(batch_X)
all_predictions.append(predictions.cpu().numpy())
all_targets.append(batch_y.cpu().numpy())
# Combine predictions and targets
predictions = np.concatenate(all_predictions, axis=0)
targets = np.concatenate(all_targets, axis=0)
# Inverse normalization
predictions = scaler.inverse_transform(predictions.transpose(0, 2, 1).reshape(-1, n_features))
targets = scaler.inverse_transform(targets.transpose(0, 2, 1).reshape(-1, n_features))
# Calculate MSE, MAE
mse = np.mean((predictions - targets) ** 2)
mae = np.mean(np.abs(predictions - targets))
rmse = np.sqrt(mse)
print(f"Test Performance:")
print(f"MSE: {mse:.6f}")
print(f"MAE: {mae:.6f}")
print(f"RMSE: {rmse:.6f}")
return mse, mae, rmse
# Evaluate performance
mse, mae, rmse = evaluate_model(model, test_loader, scaler)
📊 Model Comparison and Selection Guide
Performance Comparison
Model | Advantages | Disadvantages | Application Areas |
---|---|---|---|
Informer | Efficient on long sequences, strong performance | Complex structure, long training time | Long-term prediction, large-scale data |
Autoformer | Automatic periodicity learning, interpretable | Limited on non-periodic data | Seasonal data, business analysis |
FEDformer | Frequency domain processing, ensemble | FFT computation cost | Signal processing, periodic data |
PatchTST | Simple and efficient, fast training | Sensitive to patch size | Real-time prediction, multivariate time series |
Model Selection Guide
1. Based on Data Characteristics:
- Strongly periodic data: Autoformer, FEDformer
- Long sequence data: Informer, PatchTST
- Multivariate time series: PatchTST, Informer
- Real-time prediction: PatchTST
2. Based on Resource Constraints:
- Limited computational resources: PatchTST
- Sufficient resources: Informer, FEDformer
- Fast prototyping: PatchTST
🎯 Next Steps
In this part, we explored transformer-based time series forecasting models. In the next parts:
- Part 4: Latest generative AI models (TimeGPT, Lag-Llama, Moirai, Chronos)
- Part 5: Practical application and MLOps (model deployment, monitoring, A/B testing)
💡 Key Takeaways
- Transformer Advantages: Parallel processing, long-range dependency learning, scalability
- Model-specific Characteristics: Each model has unique strengths and application areas
- Practical Considerations: Consider data characteristics, resource constraints, and performance requirements comprehensively
- Importance of Practice: Learn theory and code together to improve practical application skills
Transformer-based models present a new paradigm for time series forecasting. Join us in the next part to explore even more interesting cutting-edge models!
🔗 Series Navigation
← Previous: Part 2: Deep Learning-based Time Series Forecasting - N-BEATS and DeepAR
Next →: Part 4: Latest Generative AI Models - TimeGPT, Lag-Llama, Moirai, Chronos
Next Part Preview: In Part 4, we’ll explore how the latest generative AI models like TimeGPT and Lag-Llama are being utilized for time series forecasting. 🚀