04: Deep Learning for Time SeriesΒΆ
βThe question of whether a computer can think is no more interesting than the question of whether a submarine can swim.β - Edsger W. Dijkstra
Welcome to the cutting edge of time series forecasting! This notebook explores how deep learning models - specifically LSTMs, GRUs, and Transformers - can capture complex patterns in sequential data.
π― Learning ObjectivesΒΆ
By the end of this notebook, youβll be able to:
Understand sequence modeling with deep learning
Implement LSTM and GRU networks for forecasting
Apply Transformer architectures to time series
Handle sequence preprocessing and windowing
Compare deep learning with traditional methods
Deploy and monitor deep learning forecasts
π§ Deep Learning for SequencesΒΆ
Traditional forecasting methods like ARIMA work well for stationary data with clear patterns, but deep learning excels at:
Advantages:ΒΆ
Non-linear patterns: Complex relationships in data
Long-term dependencies: Remembering patterns over long sequences
Multiple variables: Multivariate forecasting
Automatic feature learning: No manual feature engineering
Scalability: Handle large datasets efficiently
Challenges:ΒΆ
Data requirements: Need substantial training data
Computational cost: More expensive to train
Interpretability: Black-box nature
Overfitting: Risk with insufficient data
Hyperparameter tuning: Many parameters to optimize
Key Architectures:ΒΆ
RNN/LSTM/GRU: Sequential processing with memory
CNN: Pattern recognition in sequences
Transformer: Attention-based sequence modeling
Autoencoders: Unsupervised feature learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12]
# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print("Deep Learning Libraries Loaded!")
print(f"PyTorch version: {torch.__version__}")
def generate_complex_time_series(n_days=1000, noise_level=0.1):
"""Generate complex time series with multiple patterns"""
dates = pd.date_range('2020-01-01', periods=n_days, freq='D')
t = np.arange(n_days)
# Multiple seasonal patterns
daily_pattern = 0.5 * np.sin(2 * np.pi * t / 1) # Very short-term
weekly_pattern = 2 * np.sin(2 * np.pi * t / 7) # Weekly
monthly_pattern = 3 * np.sin(2 * np.pi * t / 30) # Monthly
yearly_pattern = 5 * np.sin(2 * np.pi * t / 365) # Yearly
# Non-linear trend with changes
trend = 0.001 * t + 0.00001 * t**2 # Quadratic trend
# Add trend changes
trend_changes = np.zeros(n_days)
trend_changes[200:400] += 10 # Positive shock
trend_changes[600:700] -= 15 # Negative shock
# Complex interactions
interaction = weekly_pattern * monthly_pattern * 0.1
# External factors (simulated)
external = np.random.choice([-2, -1, 0, 1, 2], n_days, p=[0.1, 0.2, 0.4, 0.2, 0.1])
external = np.convolve(external, np.ones(7)/7, mode='same') # Smooth external factors
# Combine all components
y = (trend + trend_changes + daily_pattern + weekly_pattern +
monthly_pattern + yearly_pattern + interaction + external)
# Add noise
noise = np.random.normal(0, noise_level * np.std(y), n_days)
y += noise
# Create DataFrame
df = pd.DataFrame({
'ds': dates,
'y': y,
'trend': trend,
'daily': daily_pattern,
'weekly': weekly_pattern,
'monthly': monthly_pattern,
'yearly': yearly_pattern,
'interaction': interaction,
'external': external,
'noise': noise
})
return df
# Generate complex time series
complex_data = generate_complex_time_series(n_days=800)
print(f"Generated {len(complex_data)} days of complex time series data")
print(f"Date range: {complex_data['ds'].min()} to {complex_data['ds'].max()}")
print(f"Value range: {complex_data['y'].min():.2f} to {complex_data['y'].max():.2f}")
# Plot the data
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
# Main time series
ax1.plot(complex_data['ds'], complex_data['y'], 'b-', linewidth=1.5, alpha=0.8)
ax1.set_title('Complex Time Series with Multiple Patterns')
ax1.set_xlabel('Date')
ax1.set_ylabel('Value')
ax1.grid(True, alpha=0.3)
# Component breakdown
components = ['trend', 'weekly', 'monthly', 'yearly', 'interaction']
colors = ['red', 'green', 'orange', 'purple', 'brown']
for i, comp in enumerate(components):
ax2.plot(complex_data['ds'], complex_data[comp],
color=colors[i], linewidth=1.5, label=comp.capitalize())
ax2.set_title('Time Series Components')
ax2.set_xlabel('Date')
ax2.set_ylabel('Component Value')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Show data structure
print("\nData Structure:")
print(complex_data.head())
print("\nData Info:")
print(complex_data.info())
π Sequence PreprocessingΒΆ
Deep learning models require data in sequences/windows. Key preprocessing steps:
Sequence Creation:ΒΆ
Sliding windows: Fixed-size windows of historical data
Lookback window: How many past steps to use for prediction
Forecast horizon: How many steps ahead to predict
Stride: How much to slide the window (usually 1)
Data Preparation:ΒΆ
Scaling: Normalize data to [0,1] or [-1,1] range
Train/Val/Test split: Respect temporal order
Batch processing: Group sequences into batches
Sequence padding: Handle variable-length sequences
PyTorch Datasets:ΒΆ
Custom Dataset class: Handle sequence loading
DataLoader: Efficient batch processing
Collate functions: Custom batch preparation
class TimeSeriesDataset(Dataset):
"""Custom dataset for time series forecasting"""
def __init__(self, data, lookback=30, forecast_horizon=1, target_col='y'):
"""
Args:
data: DataFrame with time series data
lookback: Number of past time steps to use
forecast_horizon: Number of steps ahead to predict
target_col: Column name of target variable
"""
self.data = data[target_col].values
self.lookback = lookback
self.forecast_horizon = forecast_horizon
# Create sequences
self.sequences = []
self.targets = []
for i in range(len(self.data) - lookback - forecast_horizon + 1):
seq = self.data[i:i + lookback]
target = self.data[i + lookback:i + lookback + forecast_horizon]
self.sequences.append(seq)
self.targets.append(target)
self.sequences = np.array(self.sequences)
self.targets = np.array(self.targets)
def __len__(self):
return len(self.sequences)
def __getitem__(self, idx):
return (
torch.FloatTensor(self.sequences[idx]),
torch.FloatTensor(self.targets[idx])
)
def prepare_time_series_data(data, lookback=30, forecast_horizon=1, train_ratio=0.7, val_ratio=0.15):
"""Prepare time series data for deep learning"""
# Scale the data
scaler = MinMaxScaler(feature_range=(-1, 1))
scaled_data = scaler.fit_transform(data[['y']])
# Create scaled DataFrame
scaled_df = data.copy()
scaled_df['y_scaled'] = scaled_data
# Split data respecting temporal order
n_total = len(scaled_df)
n_train = int(n_total * train_ratio)
n_val = int(n_total * val_ratio)
n_test = n_total - n_train - n_val
train_data = scaled_df[:n_train]
val_data = scaled_df[n_train:n_train + n_val]
test_data = scaled_df[n_train + n_val:]
print(f"Data split: Train={len(train_data)}, Val={len(val_data)}, Test={len(test_data)}")
# Create datasets
train_dataset = TimeSeriesDataset(train_data, lookback, forecast_horizon, 'y_scaled')
val_dataset = TimeSeriesDataset(val_data, lookback, forecast_horizon, 'y_scaled')
test_dataset = TimeSeriesDataset(test_data, lookback, forecast_horizon, 'y_scaled')
# Create data loaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
return (train_loader, val_loader, test_loader, scaler,
train_data, val_data, test_data)
# Prepare data
lookback = 30 # Use 30 days of history
forecast_horizon = 1 # Predict 1 day ahead
(train_loader, val_loader, test_loader, scaler,
train_data, val_data, test_data) = prepare_time_series_data(
complex_data, lookback=lookback, forecast_horizon=forecast_horizon
)
# Show sample sequence
sample_seq, sample_target = next(iter(train_loader))
print(f"Sample sequence shape: {sample_seq.shape}")
print(f"Sample target shape: {sample_target.shape}")
print(f"First sequence (first 5 values): {sample_seq[0][:5].numpy()}")
print(f"Corresponding target: {sample_target[0].numpy()}")
# Visualize sequence creation
fig, ax = plt.subplots(figsize=(15, 6))
# Plot original data
ax.plot(complex_data['ds'], complex_data['y'], 'b-', alpha=0.7, linewidth=1, label='Original Data')
# Highlight a sample sequence
sample_idx = 100
seq_start = sample_idx
seq_end = sample_idx + lookback
target_idx = seq_end
ax.plot(complex_data['ds'][seq_start:seq_end],
complex_data['y'][seq_start:seq_end],
'r-', linewidth=3, label='Input Sequence')
ax.plot(complex_data['ds'][target_idx],
complex_data['y'][target_idx],
'go', markersize=10, label='Target Value')
# Add vertical lines
ax.axvline(x=complex_data['ds'][seq_start], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][seq_end-1], color='r', linestyle='--', alpha=0.7)
ax.axvline(x=complex_data['ds'][target_idx], color='g', linestyle='--', alpha=0.7)
ax.set_title('Sequence Creation Example')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
π§ LSTM NetworksΒΆ
Long Short-Term Memory (LSTM) networks are the workhorse of sequence modeling.
LSTM Architecture:ΒΆ
Forget Gate: Controls what information to discard
Input Gate: Controls what new information to store
Output Gate: Controls what information to output
Cell State: Long-term memory of the network
Hidden State: Short-term memory passed to next timestep
Key Equations:ΒΆ
Advantages:ΒΆ
Long-term dependencies: Can remember patterns over long sequences
Gradient flow: LSTM gates prevent vanishing gradients
Flexible: Can be stacked and combined with other layers
class LSTMModel(nn.Module):
"""LSTM model for time series forecasting"""
def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
super(LSTMModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# LSTM layer
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
dropout=dropout if num_layers > 1 else 0,
batch_first=True
)
# Fully connected layer
self.fc = nn.Linear(hidden_size, output_size)
# Dropout for regularization
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Initialize hidden and cell states
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward pass through LSTM
out, _ = self.lstm(x, (h0, c0))
# Take the last time step output
out = out[:, -1, :]
# Apply dropout and fully connected layer
out = self.dropout(out)
out = self.fc(out)
return out
def train_lstm_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
"""Train LSTM model"""
model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training
model.train()
train_loss = 0
for sequences, targets in train_loader:
sequences, targets = sequences.to(device), targets.to(device)
# Reshape for LSTM (batch_size, seq_len, input_size)
sequences = sequences.unsqueeze(-1)
# Forward pass
outputs = model(sequences)
loss = criterion(outputs, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for sequences, targets in val_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
loss = criterion(outputs, targets)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
return train_losses, val_losses
# Create and train LSTM model
lstm_model = LSTMModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)
print("Training LSTM model...")
print(f"Model parameters: {sum(p.numel() for p in lstm_model.parameters())}")
train_losses, val_losses = train_lstm_model(lstm_model, train_loader, val_loader, num_epochs=50)
# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('LSTM Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Evaluate on test set
def evaluate_model(model, test_loader, scaler):
"""Evaluate model performance"""
model.eval()
predictions = []
actuals = []
with torch.no_grad():
for sequences, targets in test_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
predictions.extend(outputs.cpu().numpy())
actuals.extend(targets.cpu().numpy())
# Inverse transform predictions
predictions = np.array(predictions).reshape(-1, 1)
actuals = np.array(actuals).reshape(-1, 1)
predictions_inv = scaler.inverse_transform(predictions)
actuals_inv = scaler.inverse_transform(actuals)
mae = mean_absolute_error(actuals_inv, predictions_inv)
rmse = np.sqrt(mean_squared_error(actuals_inv, predictions_inv))
return predictions_inv, actuals_inv, mae, rmse
# Evaluate LSTM
lstm_pred, lstm_actual, lstm_mae, lstm_rmse = evaluate_model(lstm_model, test_loader, scaler)
print(f"\nLSTM Test Performance:")
print(f"MAE: {lstm_mae:.4f}")
print(f"RMSE: {lstm_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((lstm_actual - lstm_pred) / lstm_actual))*100:.2f}%")
π GRU NetworksΒΆ
Gated Recurrent Units (GRU) are a simplified version of LSTMs with similar performance.
GRU Architecture:ΒΆ
Reset Gate: Controls what information to forget
Update Gate: Controls what information to update
No separate cell state: Hidden state serves both purposes
Key Equations:ΒΆ
Advantages over LSTM:ΒΆ
Fewer parameters: Simpler architecture
Faster training: Less computation
Similar performance: Often comparable to LSTMs
Easier to implement: Less complex gates
class GRUModel(nn.Module):
"""GRU model for time series forecasting"""
def __init__(self, input_size=1, hidden_size=64, num_layers=2, output_size=1, dropout=0.2):
super(GRUModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# GRU layer
self.gru = nn.GRU(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
dropout=dropout if num_layers > 1 else 0,
batch_first=True
)
# Fully connected layer
self.fc = nn.Linear(hidden_size, output_size)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Initialize hidden state
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward pass through GRU
out, _ = self.gru(x, h0)
# Take the last time step output
out = out[:, -1, :]
# Apply dropout and fully connected layer
out = self.dropout(out)
out = self.fc(out)
return out
def train_gru_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
"""Train GRU model"""
model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training
model.train()
train_loss = 0
for sequences, targets in train_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for sequences, targets in val_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
loss = criterion(outputs, targets)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
return train_losses, val_losses
# Create and train GRU model
gru_model = GRUModel(input_size=1, hidden_size=64, num_layers=2, output_size=1)
print("\nTraining GRU model...")
print(f"Model parameters: {sum(p.numel() for p in gru_model.parameters())}")
gru_train_losses, gru_val_losses = train_gru_model(gru_model, train_loader, val_loader, num_epochs=50)
# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(gru_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(gru_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('GRU Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Evaluate GRU
gru_pred, gru_actual, gru_mae, gru_rmse = evaluate_model(gru_model, test_loader, scaler)
print(f"\nGRU Test Performance:")
print(f"MAE: {gru_mae:.4f}")
print(f"RMSE: {gru_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((gru_actual - gru_pred) / gru_actual))*100:.2f}%")
# Compare LSTM vs GRU
print(f"\n=== Model Comparison ===")
print(f"LSTM - MAE: {lstm_mae:.4f}, RMSE: {lstm_rmse:.4f}")
print(f"GRU - MAE: {gru_mae:.4f}, RMSE: {gru_rmse:.4f}")
print(f"GRU is {lstm_mae/gru_mae:.2f}x better than LSTM in MAE")
print(f"GRU has {sum(p.numel() for p in gru_model.parameters()) / sum(p.numel() for p in lstm_model.parameters()):.2f}x fewer parameters")
π Transformer ArchitectureΒΆ
Transformer models, originally designed for NLP, are revolutionizing time series forecasting.
Key Components:ΒΆ
Self-Attention: Learn relationships between all time steps
Multi-Head Attention: Multiple attention mechanisms
Positional Encoding: Add temporal information
Feed-Forward Networks: Process attention outputs
Layer Normalization: Stabilize training
Attention Mechanism:ΒΆ
Where Q, K, V are Query, Key, Value matrices.
Advantages:ΒΆ
Parallel processing: No sequential dependencies
Long-range dependencies: Attention can span entire sequence
Scalable: Handle very long sequences
Flexible: Can be adapted to various tasks
Time Series Specifics:ΒΆ
Temporal positional encoding: Sinusoidal or learned
Causal masking: Prevent future information leakage
Patch-based processing: Divide time series into patches
class PositionalEncoding(nn.Module):
"""Positional encoding for Transformer"""
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
# Create positional encoding matrix
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x):
return x + self.pe[:x.size(0), :]
class TransformerModel(nn.Module):
"""Transformer model for time series forecasting"""
def __init__(self, input_size=1, d_model=64, nhead=8, num_layers=2,
dim_feedforward=128, output_size=1, dropout=0.1):
super(TransformerModel, self).__init__()
self.input_size = input_size
self.d_model = d_model
# Input projection
self.input_projection = nn.Linear(input_size, d_model)
# Positional encoding
self.pos_encoder = PositionalEncoding(d_model)
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True
)
self.transformer_encoder = nn.TransformerEncoder(
encoder_layer, num_layers=num_layers
)
# Output projection
self.output_projection = nn.Linear(d_model, output_size)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Input projection
x = self.input_projection(x) * np.sqrt(self.d_model)
# Add positional encoding
x = self.pos_encoder(x)
# Transformer encoding
x = self.transformer_encoder(x)
# Take the last time step output
x = x[:, -1, :]
# Output projection
x = self.dropout(x)
x = self.output_projection(x)
return x
def train_transformer_model(model, train_loader, val_loader, num_epochs=50, learning_rate=0.001):
"""Train Transformer model"""
model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training
model.train()
train_loss = 0
for sequences, targets in train_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for sequences, targets in val_loader:
sequences, targets = sequences.to(device), targets.to(device)
sequences = sequences.unsqueeze(-1)
outputs = model(sequences)
loss = criterion(outputs, targets)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
scheduler.step()
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
return train_losses, val_losses
# Create and train Transformer model
transformer_model = TransformerModel(
input_size=1,
d_model=64,
nhead=8,
num_layers=2,
output_size=1
)
print("\nTraining Transformer model...")
print(f"Model parameters: {sum(p.numel() for p in transformer_model.parameters())}")
transformer_train_losses, transformer_val_losses = train_transformer_model(
transformer_model, train_loader, val_loader, num_epochs=50
)
# Plot training curves
plt.figure(figsize=(12, 6))
plt.plot(transformer_train_losses, 'b-', linewidth=2, label='Training Loss')
plt.plot(transformer_val_losses, 'r-', linewidth=2, label='Validation Loss')
plt.title('Transformer Training Progress')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Evaluate Transformer
transformer_pred, transformer_actual, transformer_mae, transformer_rmse = evaluate_model(
transformer_model, test_loader, scaler
)
print(f"\nTransformer Test Performance:")
print(f"MAE: {transformer_mae:.4f}")
print(f"RMSE: {transformer_rmse:.4f}")
print(f"Mean absolute percentage error: {np.mean(np.abs((transformer_actual - transformer_pred) / transformer_actual))*100:.2f}%")
π Model ComparisonΒΆ
Comparing deep learning models against traditional methods and each other.
Performance Metrics:ΒΆ
MAE: Mean Absolute Error
RMSE: Root Mean Squared Error
MAPE: Mean Absolute Percentage Error
Training time: Computational efficiency
Parameters: Model complexity
Key Insights:ΒΆ
LSTM vs GRU: GRU often performs similarly with fewer parameters
Transformer: Can excel with large datasets and long sequences
Data size matters: Deep learning needs sufficient training data
Hyperparameter tuning: Critical for optimal performance
When to Use Each:ΒΆ
LSTM: Complex patterns, sufficient data, interpretability needed
GRU: Simpler problems, faster training, resource constraints
Transformer: Very long sequences, parallel processing, large datasets
# Compare all models
models_comparison = {
'LSTM': {
'MAE': lstm_mae,
'RMSE': lstm_rmse,
'Parameters': sum(p.numel() for p in lstm_model.parameters()),
'Predictions': lstm_pred,
'Actuals': lstm_actual
},
'GRU': {
'MAE': gru_mae,
'RMSE': gru_rmse,
'Parameters': sum(p.numel() for p in gru_model.parameters()),
'Predictions': gru_pred,
'Actuals': gru_actual
},
'Transformer': {
'MAE': transformer_mae,
'RMSE': transformer_rmse,
'Parameters': sum(p.numel() for p in transformer_model.parameters()),
'Predictions': transformer_pred,
'Actuals': transformer_actual
}
}
# Print comparison table
print("=== Deep Learning Model Comparison ===")
print("Model | MAE | RMSE | Parameters | MAPE")
print("-" * 55)
for name, metrics in models_comparison.items():
mape = np.mean(np.abs((metrics['Actuals'] - metrics['Predictions']) / metrics['Actuals'])) * 100
print(f"{name:12} | {metrics['MAE']:.4f} | {metrics['RMSE']:.4f} | {metrics['Parameters']:10} | {mape:.2f}%")
# Find best model
best_model = min(models_comparison.items(), key=lambda x: x[1]['MAE'])
print(f"\nπ Best performing model: {best_model[0]} (MAE: {best_model[1]['MAE']:.4f})")
# Plot predictions comparison
plt.figure(figsize=(15, 10))
# Plot actual values
plt.subplot(2, 1, 1)
plt.plot(models_comparison['LSTM']['Actuals'][:100], 'k-', linewidth=2, label='Actual', alpha=0.8)
plt.plot(models_comparison['LSTM']['Predictions'][:100], 'b-', linewidth=2, label='LSTM')
plt.plot(models_comparison['GRU']['Predictions'][:100], 'r-', linewidth=2, label='GRU')
plt.plot(models_comparison['Transformer']['Predictions'][:100], 'g-', linewidth=2, label='Transformer')
plt.title('Model Predictions Comparison (First 100 Test Points)')
plt.xlabel('Time Step')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot prediction errors
plt.subplot(2, 1, 2)
lstm_errors = np.abs(models_comparison['LSTM']['Actuals'] - models_comparison['LSTM']['Predictions'])
gru_errors = np.abs(models_comparison['GRU']['Actuals'] - models_comparison['GRU']['Predictions'])
transformer_errors = np.abs(models_comparison['Transformer']['Actuals'] - models_comparison['Transformer']['Predictions'])
plt.plot(lstm_errors[:100], 'b-', linewidth=1.5, label='LSTM Error', alpha=0.7)
plt.plot(gru_errors[:100], 'r-', linewidth=1.5, label='GRU Error', alpha=0.7)
plt.plot(transformer_errors[:100], 'g-', linewidth=1.5, label='Transformer Error', alpha=0.7)
plt.title('Prediction Errors Comparison')
plt.xlabel('Time Step')
plt.ylabel('Absolute Error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Error distribution analysis
plt.figure(figsize=(15, 6))
plt.subplot(1, 3, 1)
plt.hist(lstm_errors, bins=30, alpha=0.7, color='blue', density=True)
plt.title('LSTM Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
plt.hist(gru_errors, bins=30, alpha=0.7, color='red', density=True)
plt.title('GRU Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
plt.hist(transformer_errors, bins=30, alpha=0.7, color='green', density=True)
plt.title('Transformer Error Distribution')
plt.xlabel('Absolute Error')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Statistical summary
print("\n=== Error Statistics ===")
for name, metrics in models_comparison.items():
errors = np.abs(metrics['Actuals'] - metrics['Predictions'])
print(f"{name}:")
print(f" Mean Error: {errors.mean():.4f}")
print(f" Std Error: {errors.std():.4f}")
print(f" Median Error: {np.median(errors):.4f}")
print(f" 95th Percentile: {np.percentile(errors, 95):.4f}")
print()
π― Key TakeawaysΒΆ
Deep learning excels at capturing complex, non-linear patterns in time series
Sequence preprocessing is crucial - proper windowing and scaling matters
LSTM vs GRU: GRU often provides similar performance with fewer parameters
Transformers can handle very long sequences and parallel processing
Data requirements: Deep learning needs substantial training data
Hyperparameter tuning: Critical for optimal performance
Computational cost: More expensive than traditional methods
π When to Use Deep LearningΒΆ
β Good For:ΒΆ
Complex patterns: Non-linear relationships, interactions
Long sequences: Extended historical context needed
Large datasets: Sufficient training data available
Multiple variables: Multivariate forecasting
Real-time processing: Fast inference after training
β Less Ideal For:ΒΆ
Small datasets: Traditional methods work better
Simple patterns: ARIMA, Prophet may suffice
Interpretability: Black-box nature
Resource constraints: High computational requirements
Real-time training: Online learning challenges
π‘ Pro TipsΒΆ
Start simple: Try traditional methods first, then deep learning
Data quality: Clean, consistent time series performs best
Sequence length: Experiment with different lookback windows
Regularization: Use dropout, early stopping to prevent overfitting
Ensemble methods: Combine multiple models for better performance
Cross-validation: Use time series cross-validation for evaluation
Scaling: Normalize data appropriately for neural networks
Hardware: GPU acceleration speeds up training significantly
π Next StepsΒΆ
Now that you understand deep learning for time series, youβre ready for:
Advanced Forecasting: Bayesian deep learning, uncertainty quantification
Multivariate Forecasting: Multiple time series simultaneously
Production Deployment: Model serving, monitoring, retraining
Time Series Transformers: Specialized architectures like Autoformer, Informer
Reinforcement Learning: Sequential decision making
π Further ReadingΒΆ
βDeep Learningβ by Goodfellow et al.: Neural network fundamentals
βAttention Is All You Needβ: Original Transformer paper
PyTorch Documentation: Deep learning implementation
Time Series Libraries: darts, gluonts, flow-forecast
Ready to explore advanced forecasting techniques? Letβs continue! π§ π