04: Deep Learning for Time SeriesΒΆ

β€œThe question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” - Edsger W. Dijkstra

Welcome to the cutting edge of time series forecasting! This notebook explores how deep learning models - specifically LSTMs, GRUs, and Transformers - can capture complex patterns in sequential data.

🎯 Learning Objectives¢

By the end of this notebook, you’ll be able to:

  • Understand sequence modeling with deep learning

  • Implement LSTM and GRU networks for forecasting

  • Apply Transformer architectures to time series

  • Handle sequence preprocessing and windowing

  • Compare deep learning with traditional methods

  • Deploy and monitor deep learning forecasts

🧠 Deep Learning for Sequences¢

Traditional forecasting methods like ARIMA work well for stationary data with clear patterns, but deep learning excels at:

Advantages:ΒΆ

  • Non-linear patterns: Complex relationships in data

  • Long-term dependencies: Remembering patterns over long sequences

  • Multiple variables: Multivariate forecasting

  • Automatic feature learning: No manual feature engineering

  • Scalability: Handle large datasets efficiently

Challenges:ΒΆ

  • Data requirements: Need substantial training data

  • Computational cost: More expensive to train

  • Interpretability: Black-box nature

  • Overfitting: Risk with insufficient data

  • Hyperparameter tuning: Many parameters to optimize

Key Architectures:ΒΆ

  • RNN/LSTM/GRU: Sequential processing with memory

  • CNN: Pattern recognition in sequences

  • Transformer: Attention-based sequence modeling

  • Autoencoders: Unsupervised feature learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12]

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Deep Learning Libraries Loaded!")
print(f"PyTorch version: {torch.__version__}")
def generate_complex_time_series(n_days=1000, noise_level=0.1):
    """Generate complex time series with multiple patterns"""
    
    dates = pd.date_range('2020-01-01', periods=n_days, freq='D')
    t = np.arange(n_days)
    
    # Multiple seasonal patterns
    daily_pattern = 0.5 * np.sin(2 * np.pi * t / 1)  # Very short-term
    weekly_pattern = 2 * np.sin(2 * np.pi * t / 7)   # Weekly
    monthly_pattern = 3 * np.sin(2 * np.pi * t / 30) # Monthly
    yearly_pattern = 5 * np.sin(2 * np.pi * t / 365) # Yearly
    
    # Non-linear trend with changes
    trend = 0.001 * t + 0.00001 * t**2  # Quadratic trend
    
    # Add trend changes
    trend_changes = np.zeros(n_days)
    trend_changes[200:400] += 10  # Positive shock
    trend_changes[600:700] -= 15  # Negative shock
    
    # Complex interactions
    interaction = weekly_pattern * monthly_pattern * 0.1
    
    # External factors (simulated)
    external = np.random.choice([-2, -1, 0, 1, 2], n_days, p=[0.1, 0.2, 0.4, 0.2, 0.1])
    external = np.convolve(external, np.ones(7)/7, mode='same')  # Smooth external factors
    
    # Combine all components
    y = (trend + trend_changes + daily_pattern + weekly_pattern + 
         monthly_pattern + yearly_pattern + interaction + external)
    
    # Add noise
    noise = np.random.normal(0, noise_level * np.std(y), n_days)
    y += noise
    
    # Create DataFrame
    df = pd.DataFrame({
        'ds': dates,
        'y': y,
        'trend': trend,
        'daily': daily_pattern,
        'weekly': weekly_pattern,
        'monthly': monthly_pattern,
        'yearly': yearly_pattern,
        'interaction': interaction,
        'external': external,
        'noise': noise
    })
    
    return df

# Generate complex time series
complex_data = generate_complex_time_series(n_days=800)

print(f"Generated {len(complex_data)} days of complex time series data")
print(f"Date range: {complex_data['ds'].min()} to {complex_data['ds'].max()}")
print(f"Value range: {complex_data['y'].min():.2f} to {complex_data['y'].max():.2f}")

# Plot the data
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

# Main time series
ax1.plot(complex_data['ds'], complex_data['y'], 'b-', linewidth=1.5, alpha=0.8)
ax1.set_title('Complex Time Series with Multiple Patterns')
ax1.set_xlabel('Date')
ax1.set_ylabel('Value')
ax1.grid(True, alpha=0.3)

# Component breakdown
components = ['trend', 'weekly', 'monthly', 'yearly', 'interaction']
colors = ['red', 'green', 'orange', 'purple', 'brown']
for i, comp in enumerate(components):
    ax2.plot(complex_data['ds'], complex_data[comp], 
             color=colors[i], linewidth=1.5, label=comp.capitalize())

ax2.set_title('Time Series Components')
ax2.set_xlabel('Date')
ax2.set_ylabel('Component Value')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Show data structure
print("\nData Structure:")
print(complex_data.head())
print("\nData Info:")
print(complex_data.info())

πŸ”„ Sequence PreprocessingΒΆ

Deep learning models require data in sequences/windows. Key preprocessing steps:

Sequence Creation:ΒΆ

  • Sliding windows: Fixed-size windows of historical data

  • Lookback window: How many past steps to use for prediction

  • Forecast horizon: How many steps ahead to predict

  • Stride: How much to slide the window (usually 1)

Data Preparation:ΒΆ

  • Scaling: Normalize data to [0,1] or [-1,1] range

  • Train/Val/Test split: Respect temporal order

  • Batch processing: Group sequences into batches

  • Sequence padding: Handle variable-length sequences

PyTorch Datasets:ΒΆ

  • Custom Dataset class: Handle sequence loading

  • DataLoader: Efficient batch processing

  • Collate functions: Custom batch preparation

class TimeSeriesDataset(Dataset):
    """Custom dataset for time series forecasting"""
    
    def __init__(self, data, lookback=30, forecast_horizon=1, target_col='y'):
        """
        Args:
            data: DataFrame with time series data
            lookback: Number of past time steps to use
            forecast_horizon: Number of steps ahead to predict
            target_col: Column name of target variable
        """
        self.data = data[target_col].values
        self.lookback = lookback
        self.forecast_horizon = forecast_horizon
        
        # Create sequences
        self.sequences = []
        self.targets = []
        
        for i in range(len(self.data) - lookback - forecast_horizon + 1):
            seq = self.data[i:i + lookback]
            target = self.data[i + lookback:i + lookback + forecast_horizon]
            self.sequences.append(seq)
            self.targets.append(target)
        
        self.sequences = np.array(self.sequences)
        self.targets = np.array(self.targets)
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.sequences[idx]),
            torch.FloatTensor(self.targets[idx])
        )

def prepare_time_series_data(data, lookback=30, forecast_horizon=1, train_ratio=0.7, val_ratio=0.15):
    """Prepare time series data for deep learning"""
    
    # Scale the data
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaled_data = scaler.fit_transform(data[['y']])
    
    # Create scaled DataFrame
    scaled_df = data.copy()
    scaled_df['y_scaled'] = scaled_data
    
    # Split data respecting temporal order
    n_total = len(scaled_df)
    n_train = int(n_total * train_ratio)
    n_val = int(n_total * val_ratio)
    n_test = n_total - n_train - n_val
    
    train_data = scaled_df[:n_train]
    val_data = scaled_df[n_train:n_train + n_val]
    test_data = scaled_df[n_train + n_val:]
    
    print(f"Data split: Train={len(train_data)}, Val={len(val_data)}, Test={len(test_data)}")
    
    # Create datasets
    train_dataset = TimeSeriesDataset(train_data, lookback, forecast_horizon, 'y_scaled')
    val_dataset = TimeSeriesDataset(val_data, lookback, forecast_horizon, 'y_scaled')
    test_dataset = TimeSeriesDataset(test_data, lookback, forecast_horizon, 'y_scaled')
    
    # Create data loaders
    batch_size = 32
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    return (train_loader, val_loader, test_loader, scaler, 
            train_data, val_data, test_data)

# Prepare data
lookback = 30  # Use 30 days of history
forecast_horizon = 1  # Predict 1 day ahead

(train_loader, val_loader, test_loader, scaler, 
 train_data, val_data, test_data) = prepare_time_series_data(
    complex_data, lookback=lookback, forecast_horizon=forecast_horizon
)

# Show sample sequence
sample_seq, sample_target = next(iter(train_loader))
print(f"Sample sequence shape: {sample_seq.shape}")
print(f"Sample target shape: {sample_target.shape}")
print(f"First sequence (first 5 values): {sample_seq[0][:5].numpy()}")
print(f"Corresponding target: {sample_target[0].numpy()}")

# Visualize sequence creation
fig, ax = plt.subplots(figsize=(15, 6))

# Plot original data
ax.plot(complex_data['ds'], complex_data['y'], 'b-', alpha=0.7, linewidth=1, label='Original Data')

# Highlight a sample sequence
sample_idx = 100
seq_start = sample_idx
seq_end = sample_idx + lookback
target_idx = seq_end

ax.plot(complex_data['ds'][seq_start:seq_end], 
        complex_data['y'][seq_start:seq_end], 
        'r-', linewidth=3, label='Input Sequence')
ax.plot(complex_data['ds'][target_idx], 
        complex_data['y'][target_idx], 
        'go', markersize=10, label='Target Value')

# Add vertical lines
ax.axvline(x=complex_data['ds'][seq_start], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][seq_end-1], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][target_idx], color='g', linestyle='--', alpha=0.7)

ax.set_title('Sequence Creation Example')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

🧠 LSTM Networks¢

Long Short-Term Memory (LSTM) networks are the workhorse of sequence modeling.

LSTM Architecture:ΒΆ

  • Forget Gate: Controls what information to discard

  • Input Gate: Controls what new information to store

  • Output Gate: Controls what information to output

  • Cell State: Long-term memory of the network

  • Hidden State: Short-term memory passed to next timestep

Key Equations:ΒΆ

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\]
\[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
\[C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t\]
\[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\]
\[h_t = o_t \cdot \tanh(C_t)\]

Advantages:ΒΆ

  • Long-term dependencies: Can remember patterns over long sequences

  • Gradient flow: LSTM gates prevent vanishing gradients

  • Flexible: Can be stacked and combined with other layers

class LSTMModel(nn.Module):
    """LSTM model for time series forecasting"""
    
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
        super(LSTMModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward pass through LSTM
        out, _ = self.lstm(x, (h0, c0))
        
        # Take the last time step output
        out = out[:, -1, :]
        
        # Apply dropout and fully connected layer
        out = self.dropout(out)
        out = self.fc(out)
        
        return out

def train_lstm_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train LSTM model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            
            # Reshape for LSTM (batch_size, seq_len, input_size)
            sequences = sequences.unsqueeze(-1)
            
            # Forward pass
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train LSTM model
lstm_model = LSTMModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)

print("Training LSTM model...")
print(f"Model parameters: {sum(p.numel() for p in lstm_model.parameters())}")

train_losses, val_losses = train_lstm_model(lstm_model, train_loader, val_loader, num_epochs=50)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('LSTM Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate on test set
def evaluate_model(model, test_loader, scaler):
    """Evaluate model performance"""
    model.eval()
    predictions = []
    actuals = []
    
    with torch.no_grad():
        for sequences, targets in test_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            
            predictions.extend(outputs.cpu().numpy())
            actuals.extend(targets.cpu().numpy())
    
    # Inverse transform predictions
    predictions = np.array(predictions).reshape(-1, 1)
    actuals = np.array(actuals).reshape(-1, 1)
    
    predictions_inv = scaler.inverse_transform(predictions)
    actuals_inv = scaler.inverse_transform(actuals)
    
    mae = mean_absolute_error(actuals_inv, predictions_inv)
    rmse = np.sqrt(mean_squared_error(actuals_inv, predictions_inv))
    
    return predictions_inv, actuals_inv, mae, rmse

# Evaluate LSTM
lstm_pred, lstm_actual, lstm_mae, lstm_rmse = evaluate_model(lstm_model, test_loader, scaler)

print(f"\nLSTM Test Performance:")
print(f"MAE: {lstm_mae:.4f}")
print(f"RMSE: {lstm_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((lstm_actual - lstm_pred) / lstm_actual))*100:.2f}%")

πŸš€ GRU NetworksΒΆ

Gated Recurrent Units (GRU) are a simplified version of LSTMs with similar performance.

GRU Architecture:ΒΆ

  • Reset Gate: Controls what information to forget

  • Update Gate: Controls what information to update

  • No separate cell state: Hidden state serves both purposes

Key Equations:ΒΆ

\[r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)\]
\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)\]
\[\tilde{h}_t = \tanh(W \cdot [r_t \cdot h_{t-1}, x_t] + b)\]
\[h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t\]

Advantages over LSTM:ΒΆ

  • Fewer parameters: Simpler architecture

  • Faster training: Less computation

  • Similar performance: Often comparable to LSTMs

  • Easier to implement: Less complex gates

class GRUModel(nn.Module):
    """GRU model for time series forecasting"""
    
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
        super(GRUModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # GRU layer
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward pass through GRU
        out, _ = self.gru(x, h0)
        
        # Take the last time step output
        out = out[:, -1, :]
        
        # Apply dropout and fully connected layer
        out = self.dropout(out)
        out = self.fc(out)
        
        return out

def train_gru_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train GRU model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train GRU model
gru_model = GRUModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)

print("\nTraining GRU model...")
print(f"Model parameters: {sum(p.numel() for p in gru_model.parameters())}")

gru_train_losses, gru_val_losses = train_gru_model(gru_model, train_loader, val_loader, num_epochs=50)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(gru_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(gru_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('GRU Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate GRU
gru_pred, gru_actual, gru_mae, gru_rmse = evaluate_model(gru_model, test_loader, scaler)

print(f"\nGRU Test Performance:")
print(f"MAE: {gru_mae:.4f}")
print(f"RMSE: {gru_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((gru_actual - gru_pred) / gru_actual))*100:.2f}%")

# Compare LSTM vs GRU
print(f"\n=== Model Comparison ===")
print(f"LSTM - MAE: {lstm_mae:.4f}, RMSE: {lstm_rmse:.4f}")
print(f"GRU  - MAE: {gru_mae:.4f}, RMSE: {gru_rmse:.4f}")
print(f"GRU is {lstm_mae/gru_mae:.2f}x better than LSTM in MAE")
print(f"GRU has {sum(p.numel() for p in gru_model.parameters()) / sum(p.numel() for p in lstm_model.parameters()):.2f}x fewer parameters")

πŸ” Transformer ArchitectureΒΆ

Transformer models, originally designed for NLP, are revolutionizing time series forecasting.

Key Components:ΒΆ

  • Self-Attention: Learn relationships between all time steps

  • Multi-Head Attention: Multiple attention mechanisms

  • Positional Encoding: Add temporal information

  • Feed-Forward Networks: Process attention outputs

  • Layer Normalization: Stabilize training

Attention Mechanism:ΒΆ

\[Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\]

Where Q, K, V are Query, Key, Value matrices.

Advantages:ΒΆ

  • Parallel processing: No sequential dependencies

  • Long-range dependencies: Attention can span entire sequence

  • Scalable: Handle very long sequences

  • Flexible: Can be adapted to various tasks

Time Series Specifics:ΒΆ

  • Temporal positional encoding: Sinusoidal or learned

  • Causal masking: Prevent future information leakage

  • Patch-based processing: Divide time series into patches

class PositionalEncoding(nn.Module):
    """Positional encoding for Transformer"""
    
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class TransformerModel(nn.Module):
    """Transformer model for time series forecasting"""
    
    def __init__(self, input_size=1, d_model=64, nhead=8, num_layers=2, 
                 dim_feedforward=128, output_size=1, dropout=0.1):
        super(TransformerModel, self).__init__()
        
        self.input_size = input_size
        self.d_model = d_model
        
        # Input projection
        self.input_projection = nn.Linear(input_size, d_model)
        
        # Positional encoding
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        
        # Output projection
        self.output_projection = nn.Linear(d_model, output_size)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Input projection
        x = self.input_projection(x) * np.sqrt(self.d_model)
        
        # Add positional encoding
        x = self.pos_encoder(x)
        
        # Transformer encoding
        x = self.transformer_encoder(x)
        
        # Take the last time step output
        x = x[:, -1, :]
        
        # Output projection
        x = self.dropout(x)
        x = self.output_projection(x)
        
        return x

def train_transformer_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
    """Train Transformer model"""
    
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for sequences, targets in train_loader:
            sequences, targets = sequences.to(device), targets.to(device)
            sequences = sequences.unsqueeze(-1)
            
            outputs = model(sequences)
            loss = criterion(outputs, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for sequences, targets in val_loader:
                sequences, targets = sequences.to(device), targets.to(device)
                sequences = sequences.unsqueeze(-1)
                
                outputs = model(sequences)
                loss = criterion(outputs, targets)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        val_losses.append(val_loss)
        
        scheduler.step()
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
    
    return train_losses, val_losses

# Create and train Transformer model
transformer_model = TransformerModel(
    input_size=1, 
    d_model=64, 
    nhead=8, 
    num_layers=2, 
    output_size=1
)

print("\nTraining Transformer model...")
print(f"Model parameters: {sum(p.numel() for p in transformer_model.parameters())}")

transformer_train_losses, transformer_val_losses = train_transformer_model(
    transformer_model, train_loader, val_loader, num_epochs=50
)

# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(transformer_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(transformer_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('Transformer Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate Transformer
transformer_pred, transformer_actual, transformer_mae, transformer_rmse = evaluate_model(
    transformer_model, test_loader, scaler
)

print(f"\nTransformer Test Performance:")
print(f"MAE: {transformer_mae:.4f}")
print(f"RMSE: {transformer_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((transformer_actual - transformer_pred) / transformer_actual))*100:.2f}%")

πŸ† Model ComparisonΒΆ

Comparing deep learning models against traditional methods and each other.

Performance Metrics:ΒΆ

  • MAE: Mean Absolute Error

  • RMSE: Root Mean Squared Error

  • MAPE: Mean Absolute Percentage Error

  • Training time: Computational efficiency

  • Parameters: Model complexity

Key Insights:ΒΆ

  • LSTM vs GRU: GRU often performs similarly with fewer parameters

  • Transformer: Can excel with large datasets and long sequences

  • Data size matters: Deep learning needs sufficient training data

  • Hyperparameter tuning: Critical for optimal performance

When to Use Each:ΒΆ

  • LSTM: Complex patterns, sufficient data, interpretability needed

  • GRU: Simpler problems, faster training, resource constraints

  • Transformer: Very long sequences, parallel processing, large datasets

# Compare all models
models_comparison = {
    'LSTM': {
        'MAE': lstm_mae,
        'RMSE': lstm_rmse,
        'Parameters': sum(p.numel() for p in lstm_model.parameters()),
        'Predictions': lstm_pred,
        'Actuals': lstm_actual
    },
    'GRU': {
        'MAE': gru_mae,
        'RMSE': gru_rmse,
        'Parameters': sum(p.numel() for p in gru_model.parameters()),
        'Predictions': gru_pred,
        'Actuals': gru_actual
    },
    'Transformer': {
        'MAE': transformer_mae,
        'RMSE': transformer_rmse,
        'Parameters': sum(p.numel() for p in transformer_model.parameters()),
        'Predictions': transformer_pred,
        'Actuals': transformer_actual
    }
}

# Print comparison table
print("=== Deep Learning Model Comparison ===")
print("Model         | MAE      | RMSE     | Parameters | MAPE")
print("-" * 55)
for name, metrics in models_comparison.items():
    mape = np.mean(np.abs((metrics['Actuals'] - metrics['Predictions']) / metrics['Actuals'])) * 100
    print(f"{name:12} | {metrics['MAE']:.4f} | {metrics['RMSE']:.4f} | {metrics['Parameters']:10} | {mape:.2f}%")

# Find best model
best_model = min(models_comparison.items(), key=lambda x: x[1]['MAE'])
print(f"\nπŸ† Best performing model: {best_model[0]} (MAE: {best_model[1]['MAE']:.4f})")

# Plot predictions comparison
plt.figure(figsize=(15, 10))

# Plot actual values
plt.subplot(2, 1, 1)
plt.plot(models_comparison['LSTM']['Actuals'][:100], 'k-', linewidth=2, label='Actual', alpha=0.8)
plt.plot(models_comparison['LSTM']['Predictions'][:100], 'b-', linewidth=2, label='LSTM')
plt.plot(models_comparison['GRU']['Predictions'][:100], 'r-', linewidth=2, label='GRU')
plt.plot(models_comparison['Transformer']['Predictions'][:100], 'g-', linewidth=2, label='Transformer')
plt.title('Model Predictions Comparison (First 100 Test Points)')
plt.xlabel('Time Step')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot prediction errors
plt.subplot(2, 1, 2)
lstm_errors = np.abs(models_comparison['LSTM']['Actuals'] - models_comparison['LSTM']['Predictions'])
gru_errors = np.abs(models_comparison['GRU']['Actuals'] - models_comparison['GRU']['Predictions'])
transformer_errors = np.abs(models_comparison['Transformer']['Actuals'] - models_comparison['Transformer']['Predictions'])

plt.plot(lstm_errors[:100], 'b-', linewidth=1.5, label='LSTM Error', alpha=0.7)
plt.plot(gru_errors[:100], 'r-', linewidth=1.5, label='GRU Error', alpha=0.7)
plt.plot(transformer_errors[:100], 'g-', linewidth=1.5, label='Transformer Error', alpha=0.7)
plt.title('Prediction Errors Comparison')
plt.xlabel('Time Step')
plt.ylabel('Absolute Error')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Error distribution analysis
plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.hist(lstm_errors, bins=30, alpha=0.7, color='blue', density=True)
plt.title('LSTM Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.hist(gru_errors, bins=30, alpha=0.7, color='red', density=True)
plt.title('GRU Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
plt.hist(transformer_errors, bins=30, alpha=0.7, color='green', density=True)
plt.title('Transformer Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical summary
print("\n=== Error Statistics ===")
for name, metrics in models_comparison.items():
    errors = np.abs(metrics['Actuals'] - metrics['Predictions'])
    print(f"{name}:")
    print(f"  Mean Error: {errors.mean():.4f}")
    print(f"  Std Error: {errors.std():.4f}")
    print(f"  Median Error: {np.median(errors):.4f}")
    print(f"  95th Percentile: {np.percentile(errors, 95):.4f}")
    print()

🎯 Key Takeaways¢

  1. Deep learning excels at capturing complex, non-linear patterns in time series

  2. Sequence preprocessing is crucial - proper windowing and scaling matters

  3. LSTM vs GRU: GRU often provides similar performance with fewer parameters

  4. Transformers can handle very long sequences and parallel processing

  5. Data requirements: Deep learning needs substantial training data

  6. Hyperparameter tuning: Critical for optimal performance

  7. Computational cost: More expensive than traditional methods

πŸ” When to Use Deep LearningΒΆ

βœ… Good For:ΒΆ

  • Complex patterns: Non-linear relationships, interactions

  • Long sequences: Extended historical context needed

  • Large datasets: Sufficient training data available

  • Multiple variables: Multivariate forecasting

  • Real-time processing: Fast inference after training

❌ Less Ideal For:¢

  • Small datasets: Traditional methods work better

  • Simple patterns: ARIMA, Prophet may suffice

  • Interpretability: Black-box nature

  • Resource constraints: High computational requirements

  • Real-time training: Online learning challenges

πŸ’‘ Pro TipsΒΆ

  1. Start simple: Try traditional methods first, then deep learning

  2. Data quality: Clean, consistent time series performs best

  3. Sequence length: Experiment with different lookback windows

  4. Regularization: Use dropout, early stopping to prevent overfitting

  5. Ensemble methods: Combine multiple models for better performance

  6. Cross-validation: Use time series cross-validation for evaluation

  7. Scaling: Normalize data appropriately for neural networks

  8. Hardware: GPU acceleration speeds up training significantly

πŸš€ Next StepsΒΆ

Now that you understand deep learning for time series, you’re ready for:

  • Advanced Forecasting: Bayesian deep learning, uncertainty quantification

  • Multivariate Forecasting: Multiple time series simultaneously

  • Production Deployment: Model serving, monitoring, retraining

  • Time Series Transformers: Specialized architectures like Autoformer, Informer

  • Reinforcement Learning: Sequential decision making

πŸ“š Further ReadingΒΆ

  • β€œDeep Learning” by Goodfellow et al.: Neural network fundamentals

  • β€œAttention Is All You Need”: Original Transformer paper

  • PyTorch Documentation: Deep learning implementation

  • Time Series Libraries: darts, gluonts, flow-forecast

Ready to explore advanced forecasting techniques? Let’s continue! πŸ§ πŸ“ˆ