Run this notebook: Open in Colab Open in Kaggle

04: Deep Learning for Time Series¶

“The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” - Edsger W. Dijkstra

Welcome to the cutting edge of time series forecasting! This notebook explores how deep learning models - specifically LSTMs, GRUs, and Transformers - can capture complex patterns in sequential data.

🎯 Learning Objectives¶

By the end of this notebook, you’ll be able to:

Understand sequence modeling with deep learning
Implement LSTM and GRU networks for forecasting
Apply Transformer architectures to time series
Handle sequence preprocessing and windowing
Compare deep learning with traditional methods
Deploy and monitor deep learning forecasts

🧠 Deep Learning for Sequences¶

Traditional forecasting methods like ARIMA work well for stationary data with clear patterns, but deep learning excels at:

Advantages:¶

Non-linear patterns: Complex relationships in data
Long-term dependencies: Remembering patterns over long sequences
Multiple variables: Multivariate forecasting
Automatic feature learning: No manual feature engineering
Scalability: Handle large datasets efficiently

Challenges:¶

Data requirements: Need substantial training data
Computational cost: More expensive to train
Interpretability: Black-box nature
Overfitting: Risk with insufficient data
Hyperparameter tuning: Many parameters to optimize

Key Architectures:¶

RNN/LSTM/GRU: Sequential processing with memory
CNN: Pattern recognition in sequences
Transformer: Attention-based sequence modeling
Autoencoders: Unsupervised feature learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12]

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Deep Learning Libraries Loaded!")
print(f"PyTorch version: {torch.__version__}")

def generate_complex_time_series(n_days=1000, noise_level=0.1):
    """Generate complex time series with multiple patterns"""
    
    dates = pd.date_range('2020-01-01', periods=n_days, freq='D')
    t = np.arange(n_days)
    
    # Multiple seasonal patterns
    daily_pattern = 0.5 * np.sin(2 * np.pi * t / 1)  # Very short-term
    weekly_pattern = 2 * np.sin(2 * np.pi * t / 7)   # Weekly
    monthly_pattern = 3 * np.sin(2 * np.pi * t / 30) # Monthly
    yearly_pattern = 5 * np.sin(2 * np.pi * t / 365) # Yearly
    
    # Non-linear trend with changes
    trend = 0.001 * t + 0.00001 * t**2  # Quadratic trend
    
    # Add trend changes
    trend_changes = np.zeros(n_days)
    trend_changes[200:400] += 10  # Positive shock
    trend_changes[600:700] -= 15  # Negative shock
    
    # Complex interactions
    interaction = weekly_pattern * monthly_pattern * 0.1
    
    # External factors (simulated)
    external = np.random.choice([-2, -1, 0, 1, 2], n_days, p=[0.1, 0.2, 0.4, 0.2, 0.1])
    external = np.convolve(external, np.ones(7)/7, mode='same')  # Smooth external factors
    
    # Combine all components
    y = (trend + trend_changes + daily_pattern + weekly_pattern + 
         monthly_pattern + yearly_pattern + interaction + external)
    
    # Add noise
    noise = np.random.normal(0, noise_level * np.std(y), n_days)
    y += noise
    
    # Create DataFrame
    df = pd.DataFrame({
        'ds': dates,
        'y': y,
        'trend': trend,
        'daily': daily_pattern,
        'weekly': weekly_pattern,
        'monthly': monthly_pattern,
        'yearly': yearly_pattern,
        'interaction': interaction,
        'external': external,
        'noise': noise
    })
    
    return df

# Generate complex time series
complex_data = generate_complex_time_series(n_days=800)

print(f"Generated {len(complex_data)} days of complex time series data")
print(f"Date range: {complex_data['ds'].min()} to {complex_data['ds'].max()}")
print(f"Value range: {complex_data['y'].min():.2f} to {complex_data['y'].max():.2f}")

# Plot the data
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

# Main time series
ax1.plot(complex_data['ds'], complex_data['y'], 'b-', linewidth=1.5, alpha=0.8)
ax1.set_title('Complex Time Series with Multiple Patterns')
ax1.set_xlabel('Date')
ax1.set_ylabel('Value')
ax1.grid(True, alpha=0.3)

# Component breakdown
components = ['trend', 'weekly', 'monthly', 'yearly', 'interaction']
colors = ['red', 'green', 'orange', 'purple', 'brown']
for i, comp in enumerate(components):
    ax2.plot(complex_data['ds'], complex_data[comp], 
             color=colors[i], linewidth=1.5, label=comp.capitalize())

ax2.set_title('Time Series Components')
ax2.set_xlabel('Date')
ax2.set_ylabel('Component Value')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Show data structure
print("\nData Structure:")
print(complex_data.head())
print("\nData Info:")
print(complex_data.info())

🔄 Sequence Preprocessing¶

Deep learning models require data in sequences/windows. Key preprocessing steps:

Sequence Creation:¶

Sliding windows: Fixed-size windows of historical data
Lookback window: How many past steps to use for prediction
Forecast horizon: How many steps ahead to predict
Stride: How much to slide the window (usually 1)

Data Preparation:¶

Scaling: Normalize data to [0,1] or [-1,1] range
Train/Val/Test split: Respect temporal order
Batch processing: Group sequences into batches
Sequence padding: Handle variable-length sequences

PyTorch Datasets:¶

Custom Dataset class: Handle sequence loading
DataLoader: Efficient batch processing
Collate functions: Custom batch preparation

class TimeSeriesDataset(Dataset):
    """Custom dataset for time series forecasting"""
    
    def __init__(self, data, lookback=30, forecast_horizon=1, target_col='y'):
        """
        Args:
            data: DataFrame with time series data
            lookback: Number of past time steps to use
            forecast_horizon: Number of steps ahead to predict
            target_col: Column name of target variable
        """
        self.data = data[target_col].values
        self.lookback = lookback
        self.forecast_horizon = forecast_horizon
        
        # Create sequences
        self.sequences = []
        self.targets = []
        
        for i in range(len(self.data) - lookback - forecast_horizon + 1):
            seq = self.data[i:i + lookback]
            target = self.data[i + lookback:i + lookback + forecast_horizon]
            self.sequences.append(seq)
            self.targets.append(target)
        
        self.sequences = np.array(self.sequences)
        self.targets = np.array(self.targets)
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.sequences[idx]),
            torch.FloatTensor(self.targets[idx])
        )

def prepare_time_series_data(data, lookback=30, forecast_horizon=1, train_ratio=0.7, val_ratio=0.15):
    """Prepare time series data for deep learning"""
    
    # Scale the data
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaled_data = scaler.fit_transform(data[['y']])
    
    # Create scaled DataFrame
    scaled_df = data.copy()
    scaled_df['y_scaled'] = scaled_data
    
    # Split data respecting temporal order
    n_total = len(scaled_df)
    n_train = int(n_total * train_ratio)
    n_val = int(n_total * val_ratio)
    n_test = n_total - n_train - n_val
    
    train_data = scaled_df[:n_train]
    val_data = scaled_df[n_train:n_train + n_val]
    test_data = scaled_df[n_train + n_val:]
    
    print(f"Data split: Train={len(train_data)}, Val={len(val_data)}, Test={len(test_data)}")
    
    # Create datasets
    train_dataset = TimeSeriesDataset(train_data, lookback, forecast_horizon, 'y_scaled')
    val_dataset = TimeSeriesDataset(val_data, lookback, forecast_horizon, 'y_scaled')
    test_dataset = TimeSeriesDataset(test_data, lookback, forecast_horizon, 'y_scaled')
    
    # Create data loaders
    batch_size = 32
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    return (train_loader, val_loader, test_loader, scaler, 
            train_data, val_data, test_data)

# Prepare data
lookback = 30  # Use 30 days of history
forecast_horizon = 1  # Predict 1 day ahead

(train_loader, val_loader, test_loader, scaler, 
 train_data, val_data, test_data) = prepare_time_series_data(
    complex_data, lookback=lookback, forecast_horizon=forecast_horizon
)

# Show sample sequence
sample_seq, sample_target = next(iter(train_loader))
print(f"Sample sequence shape: {sample_seq.shape}")
print(f"Sample target shape: {sample_target.shape}")
print(f"First sequence (first 5 values): {sample_seq[0][:5].numpy()}")
print(f"Corresponding target: {sample_target[0].numpy()}")

# Visualize sequence creation
fig, ax = plt.subplots(figsize=(15, 6))

# Plot original data
ax.plot(complex_data['ds'], complex_data['y'], 'b-', alpha=0.7, linewidth=1, label='Original Data')

# Highlight a sample sequence
sample_idx = 100
seq_start = sample_idx
seq_end = sample_idx + lookback
target_idx = seq_end

ax.plot(complex_data['ds'][seq_start:seq_end], 
        complex_data['y'][seq_start:seq_end], 
        'r-', linewidth=3, label='Input Sequence')
ax.plot(complex_data['ds'][target_idx], 
        complex_data['y'][target_idx], 
        'go', markersize=10, label='Target Value')

# Add vertical lines
ax.axvline(x=complex_data['ds'][seq_start], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][seq_end-1], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][target_idx], color='g', linestyle='--', alpha=0.7)

ax.set_title('Sequence Creation Example')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

🧠 LSTM Networks¶

Long Short-Term Memory (LSTM) networks are the workhorse of sequence modeling.

LSTM Architecture:¶

Forget Gate: Controls what information to discard
Input Gate: Controls what new information to store
Output Gate: Controls what information to output
Cell State: Long-term memory of the network
Hidden State: Short-term memory passed to next timestep

Key Equations:¶

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\]

\[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]

\[C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t\]

\[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\]

\[h_t = o_t \cdot \tanh(C_t)\]

Advantages:¶

Long-term dependencies: Can remember patterns over long sequences
Gradient flow: LSTM gates prevent vanishing gradients
Flexible: Can be stacked and combined with other layers

class LSTMModel(nn.Module):
    """LSTM model for time series forecasting"""
    
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
        super(LSTMModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward pass through LSTM
        out, _ = self.lstm(x, (h0, c0))
        
        # Take the last time step output
        out = out[:, -1, :]
        
        # Apply dropout and fully connected layer
        out = self.dropout(out)
        out = self.fc(out)
        
        return out

def train_lstm_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train LSTM model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            
            # Reshape for LSTM (batch_size, seq_len, input_size)
            sequences = sequences.unsqueeze(-1)
            
            # Forward pass
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train LSTM model
lstm_model = LSTMModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)

print("Training LSTM model...")
print(f"Model parameters: {sum(p.numel() for p in lstm_model.parameters())}")

train_losses, val_losses = train_lstm_model(lstm_model, train_loader, val_loader, num_epochs=50)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('LSTM Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate on test set
def evaluate_model(model, test_loader, scaler):
    """Evaluate model performance"""
    model.eval()
    predictions = []
    actuals = []
    
    with torch.no_grad():
        for sequences, targets in test_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            
            predictions.extend(outputs.cpu().numpy())
            actuals.extend(targets.cpu().numpy())
    
    # Inverse transform predictions
    predictions = np.array(predictions).reshape(-1, 1)
    actuals = np.array(actuals).reshape(-1, 1)
    
    predictions_inv = scaler.inverse_transform(predictions)
    actuals_inv = scaler.inverse_transform(actuals)
    
    mae = mean_absolute_error(actuals_inv, predictions_inv)
    rmse = np.sqrt(mean_squared_error(actuals_inv, predictions_inv))
    
    return predictions_inv, actuals_inv, mae, rmse

# Evaluate LSTM
lstm_pred, lstm_actual, lstm_mae, lstm_rmse = evaluate_model(lstm_model, test_loader, scaler)

print(f"\nLSTM Test Performance:")
print(f"MAE: {lstm_mae:.4f}")
print(f"RMSE: {lstm_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((lstm_actual - lstm_pred) / lstm_actual))*100:.2f}%")

🚀 GRU Networks¶

Gated Recurrent Units (GRU) are a simplified version of LSTMs with similar performance.

GRU Architecture:¶

Reset Gate: Controls what information to forget
Update Gate: Controls what information to update
No separate cell state: Hidden state serves both purposes

Key Equations:¶

\[r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)\]

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)\]

\[\tilde{h}_t = \tanh(W \cdot [r_t \cdot h_{t-1}, x_t] + b)\]

\[h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t\]

Advantages over LSTM:¶

Fewer parameters: Simpler architecture
Faster training: Less computation
Similar performance: Often comparable to LSTMs
Easier to implement: Less complex gates

class GRUModel(nn.Module):
    """GRU model for time series forecasting"""
    
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
        super(GRUModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # GRU layer
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward pass through GRU
        out, _ = self.gru(x, h0)
        
        # Take the last time step output
        out = out[:, -1, :]
        
        # Apply dropout and fully connected layer
        out = self.dropout(out)
        out = self.fc(out)
        
        return out

def train_gru_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train GRU model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train GRU model
gru_model = GRUModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)

print("\nTraining GRU model...")
print(f"Model parameters: {sum(p.numel() for p in gru_model.parameters())}")

gru_train_losses, gru_val_losses = train_gru_model(gru_model, train_loader, val_loader, num_epochs=50)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(gru_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(gru_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('GRU Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate GRU
gru_pred, gru_actual, gru_mae, gru_rmse = evaluate_model(gru_model, test_loader, scaler)

print(f"\nGRU Test Performance:")
print(f"MAE: {gru_mae:.4f}")
print(f"RMSE: {gru_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((gru_actual - gru_pred) / gru_actual))*100:.2f}%")

# Compare LSTM vs GRU
print(f"\n=== Model Comparison ===")
print(f"LSTM - MAE: {lstm_mae:.4f}, RMSE: {lstm_rmse:.4f}")
print(f"GRU  - MAE: {gru_mae:.4f}, RMSE: {gru_rmse:.4f}")
print(f"GRU is {lstm_mae/gru_mae:.2f}x better than LSTM in MAE")
print(f"GRU has {sum(p.numel() for p in gru_model.parameters()) / sum(p.numel() for p in lstm_model.parameters()):.2f}x fewer parameters")

🔍 Transformer Architecture¶

Transformer models, originally designed for NLP, are revolutionizing time series forecasting.

Key Components:¶

Self-Attention: Learn relationships between all time steps
Multi-Head Attention: Multiple attention mechanisms
Positional Encoding: Add temporal information
Feed-Forward Networks: Process attention outputs
Layer Normalization: Stabilize training

Attention Mechanism:¶

\[Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\]

Where Q, K, V are Query, Key, Value matrices.

Advantages:¶

Parallel processing: No sequential dependencies
Long-range dependencies: Attention can span entire sequence
Scalable: Handle very long sequences
Flexible: Can be adapted to various tasks

Time Series Specifics:¶

Temporal positional encoding: Sinusoidal or learned
Causal masking: Prevent future information leakage
Patch-based processing: Divide time series into patches

class PositionalEncoding(nn.Module):
    """Positional encoding for Transformer"""
    
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class TransformerModel(nn.Module):
    """Transformer model for time series forecasting"""
    
    def __init__(self, input_size=1, d_model=64, nhead=8, num_layers=2, 
                 dim_feedforward=128, output_size=1, dropout=0.1):
        super(TransformerModel, self).__init__()
        
        self.input_size = input_size
        self.d_model = d_model
        
        # Input projection
        self.input_projection = nn.Linear(input_size, d_model)
        
        # Positional encoding
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        
        # Output projection
        self.output_projection = nn.Linear(d_model, output_size)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Input projection
        x = self.input_projection(x) * np.sqrt(self.d_model)
        
        # Add positional encoding
        x = self.pos_encoder(x)
        
        # Transformer encoding
        x = self.transformer_encoder(x)
        
        # Take the last time step output
        x = x[:, -1, :]
        
        # Output projection
        x = self.dropout(x)
        x = self.output_projection(x)
        
        return x

def train_transformer_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train Transformer model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        scheduler.step()
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train Transformer model
transformer_model = TransformerModel(
    input_size=1, 
    d_model=64, 
    nhead=8, 
    num_layers=2, 
    output_size=1
)

print("\nTraining Transformer model...")
print(f"Model parameters: {sum(p.numel() for p in transformer_model.parameters())}")

transformer_train_losses, transformer_val_losses = train_transformer_model(
    transformer_model, train_loader, val_loader, num_epochs=50
)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(transformer_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(transformer_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('Transformer Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate Transformer
transformer_pred, transformer_actual, transformer_mae, transformer_rmse = evaluate_model(
    transformer_model, test_loader, scaler
)

print(f"\nTransformer Test Performance:")
print(f"MAE: {transformer_mae:.4f}")
print(f"RMSE: {transformer_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((transformer_actual - transformer_pred) / transformer_actual))*100:.2f}%")

🏆 Model Comparison¶

Comparing deep learning models against traditional methods and each other.

Performance Metrics:¶

MAE: Mean Absolute Error
RMSE: Root Mean Squared Error
MAPE: Mean Absolute Percentage Error
Training time: Computational efficiency
Parameters: Model complexity

Key Insights:¶

LSTM vs GRU: GRU often performs similarly with fewer parameters
Transformer: Can excel with large datasets and long sequences
Data size matters: Deep learning needs sufficient training data
Hyperparameter tuning: Critical for optimal performance

When to Use Each:¶

LSTM: Complex patterns, sufficient data, interpretability needed
GRU: Simpler problems, faster training, resource constraints
Transformer: Very long sequences, parallel processing, large datasets

# Compare all models
models_comparison = {
    'LSTM': {
        'MAE': lstm_mae,
        'RMSE': lstm_rmse,
        'Parameters': sum(p.numel() for p in lstm_model.parameters()),
        'Predictions': lstm_pred,
        'Actuals': lstm_actual
    },
    'GRU': {
        'MAE': gru_mae,
        'RMSE': gru_rmse,
        'Parameters': sum(p.numel() for p in gru_model.parameters()),
        'Predictions': gru_pred,
        'Actuals': gru_actual
    },
    'Transformer': {
        'MAE': transformer_mae,
        'RMSE': transformer_rmse,
        'Parameters': sum(p.numel() for p in transformer_model.parameters()),
        'Predictions': transformer_pred,
        'Actuals': transformer_actual
    }
}

# Print comparison table
print("=== Deep Learning Model Comparison ===")
print("Model         | MAE      | RMSE     | Parameters | MAPE")
print("-" * 55)
for name, metrics in models_comparison.items():
    mape = np.mean(np.abs((metrics['Actuals'] - metrics['Predictions']) / metrics['Actuals'])) * 100
    print(f"{name:12} | {metrics['MAE']:.4f} | {metrics['RMSE']:.4f} | {metrics['Parameters']:10} | {mape:.2f}%")

# Find best model
best_model = min(models_comparison.items(), key=lambda x: x[1]['MAE'])
print(f"\n🏆 Best performing model: {best_model[0]} (MAE: {best_model[1]['MAE']:.4f})")

# Plot predictions comparison
plt.figure(figsize=(15, 10))

# Plot actual values
plt.subplot(2, 1, 1)
plt.plot(models_comparison['LSTM']['Actuals'][:100], 'k-', linewidth=2, label='Actual', alpha=0.8)
plt.plot(models_comparison['LSTM']['Predictions'][:100], 'b-', linewidth=2, label='LSTM')
plt.plot(models_comparison['GRU']['Predictions'][:100], 'r-', linewidth=2, label='GRU')
plt.plot(models_comparison['Transformer']['Predictions'][:100], 'g-', linewidth=2, label='Transformer')
plt.title('Model Predictions Comparison (First 100 Test Points)')
plt.xlabel('Time Step')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot prediction errors
plt.subplot(2, 1, 2)
lstm_errors = np.abs(models_comparison['LSTM']['Actuals'] - models_comparison['LSTM']['Predictions'])
gru_errors = np.abs(models_comparison['GRU']['Actuals'] - models_comparison['GRU']['Predictions'])
transformer_errors = np.abs(models_comparison['Transformer']['Actuals'] - models_comparison['Transformer']['Predictions'])

plt.plot(lstm_errors[:100], 'b-', linewidth=1.5, label='LSTM Error', alpha=0.7)
plt.plot(gru_errors[:100], 'r-', linewidth=1.5, label='GRU Error', alpha=0.7)
plt.plot(transformer_errors[:100], 'g-', linewidth=1.5, label='Transformer Error', alpha=0.7)
plt.title('Prediction Errors Comparison')
plt.xlabel('Time Step')
plt.ylabel('Absolute Error')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Error distribution analysis
plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.hist(lstm_errors, bins=30, alpha=0.7, color='blue', density=True)
plt.title('LSTM Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.hist(gru_errors, bins=30, alpha=0.7, color='red', density=True)
plt.title('GRU Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
plt.hist(transformer_errors, bins=30, alpha=0.7, color='green', density=True)
plt.title('Transformer Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical summary
print("\n=== Error Statistics ===")
for name, metrics in models_comparison.items():
    errors = np.abs(metrics['Actuals'] - metrics['Predictions'])
    print(f"{name}:")
    print(f"  Mean Error: {errors.mean():.4f}")
    print(f"  Std Error: {errors.std():.4f}")
    print(f"  Median Error: {np.median(errors):.4f}")
    print(f"  95th Percentile: {np.percentile(errors, 95):.4f}")
    print()

🎯 Key Takeaways¶

Deep learning excels at capturing complex, non-linear patterns in time series
Sequence preprocessing is crucial - proper windowing and scaling matters
LSTM vs GRU: GRU often provides similar performance with fewer parameters
Transformers can handle very long sequences and parallel processing
Data requirements: Deep learning needs substantial training data
Hyperparameter tuning: Critical for optimal performance
Computational cost: More expensive than traditional methods

🔍 When to Use Deep Learning¶

✅ Good For:¶

Complex patterns: Non-linear relationships, interactions
Long sequences: Extended historical context needed
Large datasets: Sufficient training data available
Multiple variables: Multivariate forecasting
Real-time processing: Fast inference after training

❌ Less Ideal For:¶

Small datasets: Traditional methods work better
Simple patterns: ARIMA, Prophet may suffice
Interpretability: Black-box nature
Resource constraints: High computational requirements
Real-time training: Online learning challenges

💡 Pro Tips¶

Start simple: Try traditional methods first, then deep learning
Data quality: Clean, consistent time series performs best
Sequence length: Experiment with different lookback windows
Regularization: Use dropout, early stopping to prevent overfitting
Ensemble methods: Combine multiple models for better performance
Cross-validation: Use time series cross-validation for evaluation
Scaling: Normalize data appropriately for neural networks
Hardware: GPU acceleration speeds up training significantly

🚀 Next Steps¶

Now that you understand deep learning for time series, you’re ready for:

Advanced Forecasting: Bayesian deep learning, uncertainty quantification
Multivariate Forecasting: Multiple time series simultaneously
Production Deployment: Model serving, monitoring, retraining
Time Series Transformers: Specialized architectures like Autoformer, Informer
Reinforcement Learning: Sequential decision making

📚 Further Reading¶

“Deep Learning” by Goodfellow et al.: Neural network fundamentals
“Attention Is All You Need”: Original Transformer paper
PyTorch Documentation: Deep learning implementation
Time Series Libraries: darts, gluonts, flow-forecast

Ready to explore advanced forecasting techniques? Let’s continue! 🧠📈