Part 3: Transformer-Based Time Series Forecasting Models

Welcome to the third part of our time series forecasting series! In this installment, we’ll explore how the revolutionary Transformer architecture, which transformed natural language processing, has been adapted for time series forecasting.

📚 Learning Objectives

By the end of this part, you will learn:

The background and necessity of applying Transformers to time series forecasting
Key characteristics of major models: Informer, Autoformer, FEDformer, PatchTST
Strengths and weaknesses of each model and their application domains
Hands-on practice with transformer-based time series forecasting using real data

🔍 Background: Why Transformers for Time Series?

Limitations of Previous Methods

RNN/LSTM Problems:

Sequential processing leading to long training times
Difficulty learning long-term dependencies
Inability to parallelize processing

CNN Limitations:

Focus only on local patterns
Difficulty capturing long-range dependencies

Transformer Advantages

Parallel Processing: Process all time points simultaneously
Long-range Dependencies: Learn relationships in long sequences through self-attention
Scalability: Performance improvement with larger models and datasets

🚀 Major Transformer-Based Time Series Models

1. Informer (2021)

Core Ideas:

ProbSparse Self-attention reduces complexity from O(L²) to O(L log L)
Self-attention Distilling compresses information across layers
Generative Decoder predicts long sequences at once

Key Features:

# Informer's core structure
class Informer(nn.Module):
    def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, out_len):
        super(Informer, self).__init__()
        self.enc_in = enc_in
        self.dec_in = dec_in
        self.c_out = c_out
        self.seq_len = seq_len
        self.label_len = label_len
        self.out_len = out_len
        
        # Uses ProbSparse Attention
        self.attn = ProbAttention(attention_dropout=0.1)
        
        # Encoder and Decoder
        self.encoder = Encoder(...)
        self.decoder = Decoder(...)

Advantages:

Efficient on long sequences
Excellent performance across diverse datasets

Disadvantages:

Long training time due to complex structure
Difficult hyperparameter tuning

2. Autoformer (2021)

Core Ideas:

Auto-Correlation mechanism automatically learns periodicity in time series
Decomposition Block separates trend and seasonality
Series-wise Connection minimizes information loss

Key Features:

# Autoformer's Auto-Correlation
class AutoCorrelation(nn.Module):
    def forward(self, queries, keys, values):
        # Calculate correlation to find periodicity in time series
        autocorr = self.autocorrelation(queries, keys)
        return self.value_projection(values) * autocorr

Advantages:

Automatic learning of time series periodicity
Improved interpretability through trend and seasonality decomposition

Disadvantages:

Limited performance on data without periodicity
Difficulty learning complex patterns

3. FEDformer (2022)

Core Ideas:

Fourier Enhanced Decomposed Transformer
Attention in frequency domain using FFT
Model ensemble for performance improvement

Key Features:

# FEDformer's Fourier Attention
class FourierAttention(nn.Module):
    def forward(self, x):
        # Transform to frequency domain using FFT
        x_freq = torch.fft.rfft(x, dim=-1)
        # Calculate attention in frequency domain
        attn_freq = self.frequency_attention(x_freq)
        # Inverse transform back to time domain
        return torch.fft.irfft(attn_freq, dim=-1)

Advantages:

Efficient processing in frequency domain
Can learn various periodic patterns

Disadvantages:

FFT computation cost
May not be suitable for real-time prediction

4. PatchTST (2023)

Core Ideas:

Process time series in patch units
Channel Independence for multivariate time series
Simple structure with excellent performance

Key Features:

# PatchTST's patch creation
def create_patch(x, patch_len, stride):
    # Split time series into patches
    patches = x.unfold(dim=-1, size=patch_len, step=stride)
    return patches.transpose(-1, -2)

Advantages:

Simple and efficient structure
Excellent performance on multivariate time series
Fast training and inference

Disadvantages:

Sensitive to patch size
Limited on very long sequences

🛠️ Hands-on Practice: Stock Price Prediction with PatchTST

Now let’s implement a PatchTST model using real data.

1. Data Preparation

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

# Generate stock data (use yfinance for real data)
def generate_stock_data(n_samples=1000, n_features=5):
    """Generate synthetic stock data"""
    np.random.seed(42)
    
    # Generate data with trend and seasonality
    t = np.linspace(0, 4*np.pi, n_samples)
    trend = 0.01 * t
    seasonal = 0.5 * np.sin(t) + 0.3 * np.sin(2*t)
    noise = 0.1 * np.random.randn(n_samples)
    
    # Generate multivariate time series
    data = np.zeros((n_samples, n_features))
    for i in range(n_features):
        data[:, i] = trend + seasonal + noise + i*0.1
    
    return pd.DataFrame(data, columns=[f'stock_{i+1}' for i in range(n_features)])

# Generate data
data = generate_stock_data()
print(f"Data shape: {data.shape}")
print(data.head())

2. PatchTST Model Implementation

class PatchTST(nn.Module):
    def __init__(self, seq_len, pred_len, patch_len, stride, n_features, d_model=128, n_heads=8, n_layers=3):
        super(PatchTST, self).__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.patch_len = patch_len
        self.stride = stride
        self.n_features = n_features
        self.d_model = d_model
        
        # Calculate number of patches
        self.num_patches = (seq_len - patch_len) // stride + 1
        
        # Input projection
        self.input_projection = nn.Linear(patch_len, d_model)
        
        # Positional encoding
        self.pos_encoding = nn.Parameter(torch.randn(1, self.num_patches, d_model))
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, 
            nhead=n_heads, 
            dim_feedforward=d_model*4,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        
        # Output projection
        self.output_projection = nn.Linear(d_model, pred_len)
        
    def create_patches(self, x):
        """Split time series into patches"""
        # x: (batch_size, n_features, seq_len)
        batch_size, n_features, seq_len = x.shape
        
        # Create patches for each feature
        patches = []
        for i in range(n_features):
            feature_patches = x[:, i, :].unfold(dim=-1, size=self.patch_len, step=self.stride)
            patches.append(feature_patches)
        
        # (batch_size, n_features, num_patches, patch_len)
        patches = torch.stack(patches, dim=1)
        return patches
    
    def forward(self, x):
        # x: (batch_size, n_features, seq_len)
        batch_size, n_features, seq_len = x.shape
        
        # Create patches
        patches = self.create_patches(x)  # (batch_size, n_features, num_patches, patch_len)
        
        # Process each feature independently (Channel Independence)
        outputs = []
        for i in range(n_features):
            feature_patches = patches[:, i, :, :]  # (batch_size, num_patches, patch_len)
            
            # Input projection
            projected = self.input_projection(feature_patches)  # (batch_size, num_patches, d_model)
            
            # Add positional encoding
            projected = projected + self.pos_encoding
            
            # Transformer encoder
            encoded = self.transformer(projected)  # (batch_size, num_patches, d_model)
            
            # Global average pooling
            pooled = encoded.mean(dim=1)  # (batch_size, d_model)
            
            # Output projection
            output = self.output_projection(pooled)  # (batch_size, pred_len)
            outputs.append(output)
        
        # Combine outputs from all features
        final_output = torch.stack(outputs, dim=1)  # (batch_size, n_features, pred_len)
        return final_output

# Model parameters
seq_len = 96      # Input sequence length
pred_len = 24     # Prediction length
patch_len = 16    # Patch length
stride = 8        # Stride
n_features = 5    # Number of features

model = PatchTST(
    seq_len=seq_len,
    pred_len=pred_len,
    patch_len=patch_len,
    stride=stride,
    n_features=n_features
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

3. Data Preprocessing and Training

def prepare_data(data, seq_len, pred_len, train_ratio=0.7, val_ratio=0.2):
    """Prepare data for training"""
    # Normalization
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)
    
    # Create sequences
    X, y = [], []
    for i in range(len(scaled_data) - seq_len - pred_len + 1):
        X.append(scaled_data[i:i+seq_len])
        y.append(scaled_data[i+seq_len:i+seq_len+pred_len])
    
    X = np.array(X)
    y = np.array(y)
    
    # Train/validation/test split
    n_train = int(len(X) * train_ratio)
    n_val = int(len(X) * val_ratio)
    
    X_train, y_train = X[:n_train], y[:n_train]
    X_val, y_val = X[n_train:n_train+n_val], y[n_train:n_train+n_val]
    X_test, y_test = X[n_train+n_val:], y[n_train+n_val:]
    
    return (X_train, y_train), (X_val, y_val), (X_test, y_test), scaler

# Prepare data
(X_train, y_train), (X_val, y_val), (X_test, y_test), scaler = prepare_data(
    data, seq_len, pred_len
)

print(f"Train data: {X_train.shape}, {y_train.shape}")
print(f"Validation data: {X_val.shape}, {y_val.shape}")
print(f"Test data: {X_test.shape}, {y_test.shape}")

# Create DataLoader
def create_dataloader(X, y, batch_size=32, shuffle=True):
    X_tensor = torch.FloatTensor(X).transpose(1, 2)  # (batch, features, seq_len)
    y_tensor = torch.FloatTensor(y).transpose(1, 2)  # (batch, features, pred_len)
    dataset = TensorDataset(X_tensor, y_tensor)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

train_loader = create_dataloader(X_train, y_train, batch_size=32)
val_loader = create_dataloader(X_val, y_val, batch_size=32, shuffle=False)

4. Model Training

def train_model(model, train_loader, val_loader, epochs=50, lr=0.001):
    """Train the model"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                batch_X, batch_y = batch_X.to(device), batch_y.to(device)
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()
        
        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        
        scheduler.step(val_loss)
        
        if epoch % 10 == 0:
            print(f'Epoch {epoch:3d}: Train Loss = {train_loss:.6f}, Val Loss = {val_loss:.6f}')
    
    return train_losses, val_losses

# Train model
print("Starting model training...")
train_losses, val_losses = train_model(model, train_loader, val_loader, epochs=100)
print("Training completed!")

5. Results Visualization

def plot_results(model, test_loader, scaler, n_samples=3):
    """Visualize results"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()
    
    with torch.no_grad():
        for i, (batch_X, batch_y) in enumerate(test_loader):
            if i >= n_samples:
                break
                
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            predictions = model(batch_X)
            
            # Visualize only the first sample
            X_sample = batch_X[0].cpu().numpy().T  # (seq_len, n_features)
            y_true = batch_y[0].cpu().numpy().T    # (pred_len, n_features)
            y_pred = predictions[0].cpu().numpy().T # (pred_len, n_features)
            
            # Inverse normalization
            X_sample = scaler.inverse_transform(X_sample)
            y_true = scaler.inverse_transform(y_true)
            y_pred = scaler.inverse_transform(y_pred)
            
            # Visualization
            fig, axes = plt.subplots(2, 3, figsize=(15, 10))
            axes = axes.flatten()
            
            for j in range(min(5, n_features)):
                ax = axes[j]
                
                # Historical data
                ax.plot(range(seq_len), X_sample[:, j], 'b-', label='Past', linewidth=2)
                
                # Actual future
                future_x = range(seq_len, seq_len + pred_len)
                ax.plot(future_x, y_true[:, j], 'g-', label='Actual', linewidth=2)
                
                # Prediction
                ax.plot(future_x, y_pred[:, j], 'r--', label='Prediction', linewidth=2)
                
                ax.set_title(f'Stock {j+1}')
                ax.legend()
                ax.grid(True, alpha=0.3)
            
            # Hide last subplot
            if n_features < 6:
                axes[-1].set_visible(False)
            
            plt.tight_layout()
            plt.show()

# Visualize results
test_loader = create_dataloader(X_test, y_test, batch_size=1, shuffle=False)
plot_results(model, test_loader, scaler)

# Plot training curves
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss', color='blue')
plt.plot(val_losses, label='Validation Loss', color='red')
plt.title('Training Curves')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(train_losses[-20:], label='Train Loss (Last 20)', color='blue')
plt.plot(val_losses[-20:], label='Validation Loss (Last 20)', color='red')
plt.title('Recent Training Curves')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

6. Performance Evaluation

def evaluate_model(model, test_loader, scaler):
    """Evaluate model performance"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()
    
    all_predictions = []
    all_targets = []
    
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            predictions = model(batch_X)
            
            all_predictions.append(predictions.cpu().numpy())
            all_targets.append(batch_y.cpu().numpy())
    
    # Combine predictions and targets
    predictions = np.concatenate(all_predictions, axis=0)
    targets = np.concatenate(all_targets, axis=0)
    
    # Inverse normalization
    predictions = scaler.inverse_transform(predictions.transpose(0, 2, 1).reshape(-1, n_features))
    targets = scaler.inverse_transform(targets.transpose(0, 2, 1).reshape(-1, n_features))
    
    # Calculate MSE, MAE
    mse = np.mean((predictions - targets) ** 2)
    mae = np.mean(np.abs(predictions - targets))
    rmse = np.sqrt(mse)
    
    print(f"Test Performance:")
    print(f"MSE: {mse:.6f}")
    print(f"MAE: {mae:.6f}")
    print(f"RMSE: {rmse:.6f}")
    
    return mse, mae, rmse

# Evaluate performance
mse, mae, rmse = evaluate_model(model, test_loader, scaler)

📊 Model Comparison and Selection Guide

Performance Comparison

Model	Advantages	Disadvantages	Application Areas
Informer	Efficient on long sequences, strong performance	Complex structure, long training time	Long-term prediction, large-scale data
Autoformer	Automatic periodicity learning, interpretable	Limited on non-periodic data	Seasonal data, business analysis
FEDformer	Frequency domain processing, ensemble	FFT computation cost	Signal processing, periodic data
PatchTST	Simple and efficient, fast training	Sensitive to patch size	Real-time prediction, multivariate time series

Model Selection Guide

1. Based on Data Characteristics:

Strongly periodic data: Autoformer, FEDformer
Long sequence data: Informer, PatchTST
Multivariate time series: PatchTST, Informer
Real-time prediction: PatchTST

2. Based on Resource Constraints:

Limited computational resources: PatchTST
Sufficient resources: Informer, FEDformer
Fast prototyping: PatchTST

🎯 Next Steps

In this part, we explored transformer-based time series forecasting models. In the next parts:

Part 4: Latest generative AI models (TimeGPT, Lag-Llama, Moirai, Chronos)
Part 5: Practical application and MLOps (model deployment, monitoring, A/B testing)

💡 Key Takeaways

Transformer Advantages: Parallel processing, long-range dependency learning, scalability
Model-specific Characteristics: Each model has unique strengths and application areas
Practical Considerations: Consider data characteristics, resource constraints, and performance requirements comprehensively
Importance of Practice: Learn theory and code together to improve practical application skills

Transformer-based models present a new paradigm for time series forecasting. Join us in the next part to explore even more interesting cutting-edge models!

← Previous: Part 2: Deep Learning-based Time Series Forecasting - N-BEATS and DeepAR

Next →: Part 4: Latest Generative AI Models - TimeGPT, Lag-Llama, Moirai, Chronos

Next Part Preview: In Part 4, we’ll explore how the latest generative AI models like TimeGPT and Lag-Llama are being utilized for time series forecasting. 🚀

Part 3: Transformer-Based Time Series Forecasting Models

📚 Learning Objectives

🔍 Background: Why Transformers for Time Series?

Limitations of Previous Methods

Transformer Advantages

🚀 Major Transformer-Based Time Series Models

1. Informer (2021)

2. Autoformer (2021)

3. FEDformer (2022)

4. PatchTST (2023)

🛠️ Hands-on Practice: Stock Price Prediction with PatchTST

1. Data Preparation

2. PatchTST Model Implementation

3. Data Preprocessing and Training

4. Model Training

5. Results Visualization

6. Performance Evaluation

📊 Model Comparison and Selection Guide

Performance Comparison

Model Selection Guide

🎯 Next Steps

💡 Key Takeaways

🔗 Series Navigation