06: Practical Applications & ExercisesΒΆ
βThe only way to learn a new programming language is by writing programs in it.β - Dennis Ritchie
Welcome to the practical applications and exercises notebook! This is where theory meets practice. Youβll work through real-world scenarios, implement complete forecasting solutions, and build your time series expertise through hands-on exercises.
π― Learning ObjectivesΒΆ
By the end of this notebook, youβll be able to:
Apply time series methods to real-world business problems
Build end-to-end forecasting pipelines
Handle common forecasting challenges and edge cases
Evaluate and compare different forecasting approaches
Present forecasting results to stakeholders
Debug and troubleshoot forecasting models
π’ Case Study 1: E-commerce Sales ForecastingΒΆ
Scenario: Youβre a data scientist at an e-commerce company. The company wants to forecast daily sales for the next quarter to optimize inventory and marketing spend.
Business Requirements:ΒΆ
Forecast horizon: 90 days
Business metrics: Minimize stockouts, reduce excess inventory
Seasonal patterns: Weekly cycles, holiday spikes
External factors: Marketing campaigns, competitor actions
Accuracy target: MAPE < 15%
Data Available:ΒΆ
2 years of daily sales data
Marketing spend by channel
Holiday calendar
Competitor pricing data
Website traffic metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from prophet import Prophet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Set random seeds
np.random.seed(42)
# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12]
print("E-commerce Forecasting Case Study")
print("=====================================")
def generate_ecommerce_data():
"""Generate realistic e-commerce sales data"""
# 2 years of daily data
dates = pd.date_range('2021-01-01', periods=730, freq='D')
n_days = len(dates)
t = np.arange(n_days)
# Base sales trend (growing business)
trend = 1000 + 0.5 * t + 0.0001 * t**2
# Weekly seasonality (weekend spikes)
weekly = 200 * np.sin(2 * np.pi * t / 7)
# Monthly seasonality
monthly = 150 * np.sin(2 * np.pi * t / 30.44)
# Holiday effects
holidays = np.zeros(n_days)
# Christmas period (biggest sales period)
christmas_start = pd.Timestamp('2021-12-20')
christmas_end = pd.Timestamp('2022-01-02')
christmas_mask = (dates >= christmas_start) & (dates <= christmas_end)
holidays[christmas_mask] = 800
# Black Friday
bf_2021 = pd.Timestamp('2021-11-26')
bf_2022 = pd.Timestamp('2022-11-25')
holidays[dates == bf_2021] = 1200
holidays[dates == bf_2022] = 1200
# Valentine's Day
valentine_2021 = pd.Timestamp('2021-02-14')
valentine_2022 = pd.Timestamp('2022-02-14')
holidays[dates == valentine_2021] = 400
holidays[dates == valentine_2022] = 400
# Marketing campaigns (random but realistic)
marketing = np.zeros(n_days)
campaign_days = np.random.choice(n_days, size=int(n_days * 0.1), replace=False)
marketing[campaign_days] = np.random.uniform(100, 500, len(campaign_days))
# Smooth marketing effects
marketing = np.convolve(marketing, np.ones(3)/3, mode='same')
# Competitor effects (negative impact)
competitor = np.zeros(n_days)
comp_days = np.random.choice(n_days, size=int(n_days * 0.05), replace=False)
competitor[comp_days] = -np.random.uniform(50, 200, len(comp_days))
competitor = np.convolve(competitor, np.ones(5)/5, mode='same')
# Website traffic (correlated with sales)
traffic_base = trend * 0.1 + weekly * 0.05
traffic_noise = np.random.normal(0, 50, n_days)
traffic = traffic_base + traffic_noise
# Combine all factors
sales = (trend + weekly + monthly + holidays + marketing + competitor)
# Add realistic noise (higher on weekends)
day_of_week = pd.to_datetime(dates).dayofweek
noise_std = np.where(day_of_week >= 5, 150, 100) # Higher noise on weekends
noise = np.random.normal(0, noise_std)
sales += noise
# Ensure non-negative sales
sales = np.maximum(sales, 0)
# Create DataFrame
df = pd.DataFrame({
'ds': dates,
'y': sales,
'trend': trend,
'weekly': weekly,
'monthly': monthly,
'holidays': holidays,
'marketing': marketing,
'competitor': competitor,
'traffic': traffic,
'day_of_week': day_of_week
})
return df
# Generate e-commerce data
ecommerce_data = generate_ecommerce_data()
print(f"Generated {len(ecommerce_data)} days of e-commerce sales data")
print(f"Date range: {ecommerce_data['ds'].min()} to {ecommerce_data['ds'].max()}")
print(f"Total sales: ${ecommerce_data['y'].sum():,.0f}")
print(f"Average daily sales: ${ecommerce_data['y'].mean():.0f}")
print(f"Sales range: ${ecommerce_data['y'].min():.0f} - ${ecommerce_data['y'].max():.0f}")
# Plot the data
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
# Main sales plot
ax1.plot(ecommerce_data['ds'], ecommerce_data['y'], 'b-', linewidth=1, alpha=0.8)
ax1.set_title('E-commerce Daily Sales (2 Years)')
ax1.set_xlabel('Date')
ax1.set_ylabel('Sales ($)')
ax1.grid(True, alpha=0.3)
# Component breakdown
components = ['trend', 'weekly', 'monthly', 'holidays', 'marketing']
colors = ['red', 'green', 'orange', 'purple', 'brown']
for i, comp in enumerate(components):
if comp in ecommerce_data.columns:
ax2.plot(ecommerce_data['ds'], ecommerce_data[comp],
color=colors[i], linewidth=1, label=comp.capitalize())
ax2.set_title('Sales Components')
ax2.set_xlabel('Date')
ax2.set_ylabel('Component Value ($)')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Show data statistics
print("\nSales Statistics:")
print(ecommerce_data['y'].describe())
# Show key dates
print("\nKey Business Events:")
print(f"Black Friday 2021: ${ecommerce_data.loc[ecommerce_data['ds'] == '2021-11-26', 'y'].values[0]:.0f}")
print(f"Christmas 2021 peak: ${ecommerce_data['holidays'].max():.0f}")
print(f"Marketing campaign impact: ${ecommerce_data['marketing'].max():.0f}")
def build_ecommerce_forecasting_pipeline(data):
"""Build complete forecasting pipeline for e-commerce"""
print("Building E-commerce Forecasting Pipeline")
print("=========================================")
# Split data (keep last 90 days for testing)
train_data = data[:-90]
test_data = data[-90:]
print(f"Training data: {len(train_data)} days")
print(f"Test data: {len(test_data)} days")
# Model 1: Prophet with holidays and seasonality
print("\n1. Training Prophet Model...")
prophet_model = Prophet(
seasonality_mode='multiplicative',
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=False,
changepoint_prior_scale=0.05
)
# Add holiday effects
holidays_df = pd.DataFrame({
'holiday': 'christmas',
'ds': pd.to_datetime(['2021-12-25', '2022-12-25']),
'lower_window': -7,
'upper_window': 7
})
holidays_df = pd.concat([holidays_df, pd.DataFrame({
'holiday': 'black_friday',
'ds': pd.to_datetime(['2021-11-26', '2022-11-25']),
'lower_window': -1,
'upper_window': 1
})])
prophet_model.holidays = holidays_df
prophet_model.fit(train_data[['ds', 'y']])
# Generate Prophet forecast
future = prophet_model.make_future_dataframe(periods=90, freq='D')
prophet_forecast = prophet_model.predict(future)
# Model 2: SARIMA with exogenous variables
print("2. Training SARIMA Model...")
# Prepare exogenous variables
exog_vars = ['marketing', 'competitor', 'traffic']
exog_train = train_data[exog_vars]
exog_test = test_data[exog_vars]
try:
sarima_model = SARIMAX(
train_data['y'],
exog=exog_train,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 7)
)
sarima_fitted = sarima_model.fit(disp=False)
# Generate SARIMA forecast
sarima_forecast = sarima_fitted.forecast(steps=90, exog=exog_test)
sarima_forecast_values = sarima_forecast
except:
print("SARIMA training failed, using naive forecast")
sarima_forecast_values = train_data['y'].iloc[-90:].values
# Model 3: Machine Learning with features
print("3. Training Random Forest Model...")
# Create features for ML model
def create_ml_features(df):
features = df.copy()
# Lag features
for lag in [1, 7, 14, 30]:
features[f'sales_lag_{lag}'] = features['y'].shift(lag)
# Rolling statistics
features['sales_ma_7'] = features['y'].rolling(7).mean()
features['sales_ma_30'] = features['y'].rolling(30).mean()
features['sales_std_7'] = features['y'].rolling(7).std()
# Time features
features['day_of_week'] = features['ds'].dt.dayofweek
features['month'] = features['ds'].dt.month
features['quarter'] = features['ds'].dt.quarter
features['is_weekend'] = features['day_of_week'].isin([5, 6]).astype(int)
# External features
features['marketing_scaled'] = features['marketing']
features['traffic_scaled'] = features['traffic']
return features.dropna()
train_features = create_ml_features(train_data)
test_features = create_ml_features(test_data)
# Prepare X and y
feature_cols = [col for col in train_features.columns
if col not in ['ds', 'y', 'trend', 'weekly', 'monthly', 'holidays', 'competitor']]
X_train = train_features[feature_cols]
y_train = train_features['y']
X_test = test_features[feature_cols]
y_test = test_features['y']
# Train Random Forest
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42
)
rf_model.fit(X_train, y_train)
rf_forecast = rf_model.predict(X_test)
# Ensemble forecast (simple average)
prophet_test = prophet_forecast[prophet_forecast['ds'].isin(test_data['ds'])]['yhat'].values
ensemble_forecast = (prophet_test + sarima_forecast_values + rf_forecast) / 3
# Calculate accuracies
models_results = {
'Prophet': prophet_test,
'SARIMA': sarima_forecast_values,
'Random Forest': rf_forecast,
'Ensemble': ensemble_forecast
}
accuracies = {}
for name, preds in models_results.items():
mae = mean_absolute_error(test_data['y'], preds)
rmse = np.sqrt(mean_squared_error(test_data['y'], preds))
mape = np.mean(np.abs((test_data['y'] - preds) / test_data['y'])) * 100
accuracies[name] = {'MAE': mae, 'RMSE': rmse, 'MAPE': mape}
return {
'models': {
'prophet': prophet_model,
'sarima': sarima_fitted if 'sarima_fitted' in locals() else None,
'rf': rf_model
},
'forecasts': {
'prophet': prophet_forecast,
'sarima': sarima_forecast_values,
'rf': rf_forecast,
'ensemble': ensemble_forecast
},
'accuracies': accuracies,
'test_data': test_data
}
# Build the forecasting pipeline
pipeline_results = build_ecommerce_forecasting_pipeline(ecommerce_data)
# Display results
print("\nModel Performance Comparison:")
print("Model | MAE | RMSE | MAPE")
print("-" * 45)
for model_name, metrics in pipeline_results['accuracies'].items():
print(f"{model_name:14} | ${metrics['MAE']:8.0f} | ${metrics['RMSE']:8.0f} | {metrics['MAPE']:5.1f}%")
# Find best model
best_model = min(pipeline_results['accuracies'].items(), key=lambda x: x[1]['MAE'])
print(f"\nπ Best performing model: {best_model[0]} (MAE: ${best_model[1]['MAE']:.0f})")
# Plot comparison
plt.figure(figsize=(15, 8))
# Plot actual sales
plt.plot(pipeline_results['test_data']['ds'], pipeline_results['test_data']['y'],
'k-', linewidth=2, label='Actual Sales')
# Plot forecasts
colors = ['red', 'blue', 'green', 'orange']
for i, (model_name, forecast) in enumerate(pipeline_results['forecasts'].items()):
if model_name != 'prophet': # Prophet forecast has different structure
plt.plot(pipeline_results['test_data']['ds'], forecast,
color=colors[i], linewidth=2, label=f'{model_name}')
# Plot Prophet forecast
prophet_test_forecast = pipeline_results['forecasts']['prophet']
prophet_test = prophet_test_forecast[prophet_test_forecast['ds'].isin(pipeline_results['test_data']['ds'])]
plt.plot(prophet_test['ds'], prophet_test['yhat'], color='purple', linewidth=2, label='Prophet')
plt.title('E-commerce Sales Forecasting: Model Comparison')
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
π₯ Case Study 2: Hospital Patient Volume ForecastingΒΆ
Scenario: Youβre working with a hospital system to forecast daily patient volume. This helps with staffing decisions, resource allocation, and emergency preparedness.
Business Requirements:ΒΆ
Forecast horizon: 30 days
Critical metrics: Avoid understaffing, optimize resource use
Seasonal patterns: Weekly cycles, holiday effects
External factors: Weather, flu season, local events
Accuracy target: MAPE < 10% for reliable staffing
Data Available:ΒΆ
3 years of daily patient visit data
Weather data (temperature, precipitation)
Flu surveillance data
Holiday calendar
Local event data
def generate_hospital_data():
"""Generate realistic hospital patient volume data"""
# 3 years of daily data
dates = pd.date_range('2020-01-01', periods=1095, freq='D')
n_days = len(dates)
t = np.arange(n_days)
# Base patient volume (growing slightly)
base_volume = 150 + 0.02 * t
# Weekly seasonality (lower on weekends)
weekly_pattern = -30 * np.sin(2 * np.pi * t / 7) # Peak mid-week
# Monthly seasonality
monthly_pattern = 20 * np.sin(2 * np.pi * t / 30.44)
# Holiday effects (reduced volume)
holidays = np.zeros(n_days)
# Major holidays
holiday_dates = [
'2020-01-01', '2020-07-04', '2020-11-26', '2020-12-25', # 2020
'2021-01-01', '2021-07-04', '2021-11-25', '2021-12-25', # 2021
'2022-01-01', '2022-07-04', '2022-11-24', '2022-12-25' # 2022
]
for holiday in holiday_dates:
holiday_idx = dates.get_loc(pd.Timestamp(holiday))
start_idx = max(0, holiday_idx - 3)
end_idx = min(n_days, holiday_idx + 4)
holidays[start_idx:end_idx] = -50 # Reduced volume around holidays
# Flu season (winter months)
flu_season = np.zeros(n_days)
for year in [2020, 2021, 2022]:
winter_start = pd.Timestamp(f'{year}-12-01')
winter_end = pd.Timestamp(f'{year}-03-01')
if year < 2022:
next_year = year + 1
winter_mask = ((dates >= winter_start) & (dates <= pd.Timestamp(f'{next_year}-03-01')))
else:
winter_mask = ((dates >= winter_start) | (dates <= winter_end))
flu_season[winter_mask] = 40 # Increased volume during flu season
# Weather effects
# Simulate temperature (colder weather = more visits)
temperature = 20 + 15 * np.sin(2 * np.pi * (t + 80) / 365) + np.random.normal(0, 5, n_days)
weather_effect = np.where(temperature < 10, 25, 0) # Cold weather increases visits
# Precipitation effect
precipitation = np.maximum(0, np.random.exponential(2, n_days) - 1)
rain_effect = precipitation * 5 # Rain increases accident-related visits
# COVID-19 effect (2020-2021)
covid_effect = np.zeros(n_days)
covid_start = pd.Timestamp('2020-03-15')
covid_peak = pd.Timestamp('2020-04-15')
covid_end = pd.Timestamp('2021-06-01')
covid_period = (dates >= covid_start) & (dates <= covid_end)
covid_effect[covid_period] = -80 # Major reduction during COVID
# Combine all factors
patient_volume = (base_volume + weekly_pattern + monthly_pattern +
holidays + flu_season + weather_effect + rain_effect + covid_effect)
# Add realistic noise
noise = np.random.normal(0, 15, n_days)
patient_volume += noise
# Ensure non-negative
patient_volume = np.maximum(patient_volume, 0)
# Create DataFrame
df = pd.DataFrame({
'ds': dates,
'y': patient_volume,
'base_volume': base_volume,
'weekly': weekly_pattern,
'monthly': monthly_pattern,
'holidays': holidays,
'flu_season': flu_season,
'temperature': temperature,
'precipitation': precipitation,
'weather_effect': weather_effect,
'rain_effect': rain_effect,
'covid_effect': covid_effect,
'day_of_week': dates.dayofweek
})
return df
# Generate hospital data
hospital_data = generate_hospital_data()
print(f"Generated {len(hospital_data)} days of hospital patient volume data")
print(f"Date range: {hospital_data['ds'].min()} to {hospital_data['ds'].max()}")
print(f"Total patients: {hospital_data['y'].sum():,.0f}")
print(f"Average daily patients: {hospital_data['y'].mean():.0f}")
print(f"Patient range: {hospital_data['y'].min():.0f} - {hospital_data['y'].max():.0f}")
# Plot the data
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
# Main patient volume plot
ax1.plot(hospital_data['ds'], hospital_data['y'], 'b-', linewidth=1, alpha=0.8)
ax1.set_title('Hospital Daily Patient Volume (3 Years)')
ax1.set_xlabel('Date')
ax1.set_ylabel('Patient Volume')
ax1.grid(True, alpha=0.3)
# Component breakdown
components = ['base_volume', 'flu_season', 'covid_effect', 'weather_effect']
colors = ['red', 'green', 'orange', 'purple']
for i, comp in enumerate(components):
ax2.plot(hospital_data['ds'], hospital_data[comp],
color=colors[i], linewidth=1, label=comp.replace('_', ' ').title())
ax2.set_title('Patient Volume Components')
ax2.set_xlabel('Date')
ax2.set_ylabel('Component Effect')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Show key statistics
print("\nPatient Volume Statistics:")
print(hospital_data['y'].describe())
print("\nKey Events:")
print(f"COVID-19 impact (max reduction): {hospital_data['covid_effect'].min():.0f} patients")
print(f"Flu season impact (max increase): {hospital_data['flu_season'].max():.0f} patients")
print(f"Cold weather impact: {hospital_data['weather_effect'].max():.0f} patients")
print(f"Holiday reduction: {hospital_data['holidays'].min():.0f} patients")
π» Hands-On ExercisesΒΆ
Now itβs your turn! Apply what youβve learned to solve these forecasting challenges.
Exercise 1: Energy Consumption ForecastingΒΆ
Build a forecasting model for hourly electricity consumption with weather variables.
Exercise 2: Stock Price PredictionΒΆ
Create a model to forecast stock prices using technical indicators and market data.
Exercise 3: Website Traffic ForecastingΒΆ
Predict website traffic patterns for capacity planning and resource allocation.
Exercise 4: Supply Chain OptimizationΒΆ
Forecast product demand for inventory management and supplier planning.
def exercise_energy_forecasting():
"""Exercise 1: Energy consumption forecasting"""
print("\n=== Exercise 1: Energy Consumption Forecasting ===")
print("Build a forecasting model for hourly electricity consumption")
# Generate synthetic energy data
dates = pd.date_range('2023-01-01', periods=168, freq='H') # 1 week hourly
n_hours = len(dates)
t = np.arange(n_hours)
# Base consumption with daily/weekly patterns
daily_pattern = 1000 + 500 * np.sin(2 * np.pi * (t % 24) / 24)
weekly_pattern = np.where((t // 24) % 7 >= 5, 800, 1200) # Weekend reduction
# Weather effects
temperature = 20 + 10 * np.sin(2 * np.pi * t / 24) + np.random.normal(0, 2, n_hours)
temp_effect = temperature * 20 # Higher temp = higher AC usage
# Time of day effects
hour_of_day = t % 24
peak_hours = ((hour_of_day >= 8) & (hour_of_day <= 18)).astype(int) * 300
# Combine and add noise
consumption = daily_pattern + weekly_pattern + temp_effect + peak_hours
consumption += np.random.normal(0, 50, n_hours)
consumption = np.maximum(consumption, 0)
energy_data = pd.DataFrame({
'ds': dates,
'y': consumption,
'temperature': temperature,
'hour_of_day': hour_of_day,
'is_weekend': ((t // 24) % 7 >= 5).astype(int)
})
# Split data
train_data = energy_data[:-24] # Last 24 hours for testing
test_data = energy_data[-24:]
print(f"Training on {len(train_data)} hours, testing on {len(test_data)} hours")
# Build forecasting model (Prophet with regressors)
model = Prophet(
seasonality_mode='additive',
daily_seasonality=True,
weekly_seasonality=True
)
# Add temperature as regressor
model.add_regressor('temperature')
# Prepare data for Prophet
prophet_train = train_data[['ds', 'y', 'temperature']].copy()
model.fit(prophet_train)
# Create future dataframe with regressors
future = model.make_future_dataframe(periods=24, freq='H')
future['temperature'] = test_data['temperature'].values
forecast = model.predict(future)
# Calculate accuracy
test_forecast = forecast[forecast['ds'].isin(test_data['ds'])]['yhat']
mae = mean_absolute_error(test_data['y'], test_forecast)
mape = np.mean(np.abs((test_data['y'] - test_forecast) / test_data['y'])) * 100
print(f"Energy Forecasting Results:")
print(f"MAE: {mae:.2f} kWh")
print(f"MAPE: {mape:.2f}%")
# Plot results
plt.figure(figsize=(15, 6))
plt.plot(test_data['ds'], test_data['y'], 'k-', linewidth=2, label='Actual')
plt.plot(test_data['ds'], test_forecast, 'r-', linewidth=2, label='Forecast')
plt.title('Energy Consumption Forecasting (24-hour horizon)')
plt.xlabel('Time')
plt.ylabel('Consumption (kWh)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.show()
return model, forecast
# Run exercise 1
energy_model, energy_forecast = exercise_energy_forecasting()
def exercise_stock_forecasting():
"""Exercise 2: Stock price forecasting with technical indicators"""
print("\n=== Exercise 2: Stock Price Forecasting ===")
print("Predict stock prices using technical indicators")
# Generate synthetic stock data
dates = pd.date_range('2023-01-01', periods=252, freq='B') # Business days
n_days = len(dates)
t = np.arange(n_days)
# Base price with trend
base_price = 100 + 0.1 * t + 0.001 * t**2
# Market volatility (higher on some periods)
volatility = 1 + 0.5 * np.sin(2 * np.pi * t / 63) # Quarterly volatility
# Random walk with drift
random_walk = np.cumsum(np.random.normal(0.05, 1, n_days) * volatility)
# Combine components
price = base_price + random_walk
# Create technical indicators
price_series = pd.Series(price)
# Moving averages
sma_10 = price_series.rolling(10).mean()
sma_30 = price_series.rolling(30).mean()
# RSI (Relative Strength Index)
delta = price_series.diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
rsi = 100 - (100 / (1 + rs))
# MACD
ema_12 = price_series.ewm(span=12).mean()
ema_26 = price_series.ewm(span=26).mean()
macd = ema_12 - ema_26
signal = macd.ewm(span=9).mean()
# Volume (simulated)
volume = np.random.uniform(100000, 500000, n_days)
volume = volume * (1 + 0.5 * np.abs(np.random.normal(0, 1, n_days))) # Higher on volatile days
stock_data = pd.DataFrame({
'ds': dates,
'y': price,
'sma_10': sma_10,
'sma_30': sma_30,
'rsi': rsi,
'macd': macd,
'signal': signal,
'volume': volume
}).dropna()
# Split data
train_data = stock_data[:-30]
test_data = stock_data[-30:]
print(f"Training on {len(train_data)} days, testing on {len(test_data)} days")
# Build model with technical indicators
from sklearn.ensemble import GradientBoostingRegressor
# Features for prediction
feature_cols = ['sma_10', 'sma_30', 'rsi', 'macd', 'signal', 'volume']
# Create lag features
for col in feature_cols + ['y']:
for lag in [1, 3, 5]:
stock_data[f'{col}_lag_{lag}'] = stock_data[col].shift(lag)
stock_data = stock_data.dropna()
train_data = stock_data[:-30]
test_data = stock_data[-30:]
# Prepare features
lag_features = [col for col in stock_data.columns if 'lag' in col]
X_train = train_data[lag_features]
y_train = train_data['y']
X_test = test_data[lag_features]
y_test = test_data['y']
# Train model
gb_model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
gb_model.fit(X_train, y_train)
# Generate predictions
predictions = gb_model.predict(X_test)
# Calculate accuracy
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mape = np.mean(np.abs((y_test - predictions) / y_test)) * 100
print(f"Stock Forecasting Results:")
print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
# Plot results
plt.figure(figsize=(15, 8))
# Plot actual vs predicted
plt.subplot(2, 1, 1)
plt.plot(test_data['ds'], y_test, 'k-', linewidth=2, label='Actual Price')
plt.plot(test_data['ds'], predictions, 'r-', linewidth=2, label='Predicted Price')
plt.title('Stock Price Forecasting')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot prediction errors
plt.subplot(2, 1, 2)
errors = y_test - predictions
plt.plot(test_data['ds'], errors, 'b-', linewidth=1)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.7)
plt.title('Prediction Errors')
plt.xlabel('Date')
plt.ylabel('Error ($)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))
return gb_model, predictions
# Run exercise 2
stock_model, stock_predictions = exercise_stock_forecasting()
π― Troubleshooting GuideΒΆ
Common forecasting problems and solutions:
Problem 1: Poor Model AccuracyΒΆ
Symptom: High MAPE, poor fit on test data
Possible causes: Overfitting, insufficient data, wrong model
Solutions: Cross-validation, simpler models, more data
Problem 2: Non-Stationary DataΒΆ
Symptom: Trends, changing variance over time
Solutions: Differencing, detrending, seasonal decomposition
Problem 3: Missing DataΒΆ
Symptom: Gaps in time series
Solutions: Interpolation, forward-fill, model-based imputation
Problem 4: OutliersΒΆ
Symptom: Extreme values affecting model
Solutions: Robust statistics, outlier detection, winsorization
Problem 5: Changing PatternsΒΆ
Symptom: Model works on old data but fails on new
Solutions: Regular retraining, online learning, model updates
def troubleshooting_examples():
"""Demonstrate common troubleshooting scenarios"""
print("=== Forecasting Troubleshooting Examples ===\n")
# Example 1: Handling missing data
print("1. Handling Missing Data")
# Create data with missing values
dates = pd.date_range('2023-01-01', periods=100, freq='D')
values = 100 + np.sin(2 * np.pi * np.arange(100) / 10) + np.random.normal(0, 5, 100)
# Introduce missing values
missing_mask = np.random.random(100) < 0.1 # 10% missing
values[missing_mask] = np.nan
missing_data = pd.DataFrame({'ds': dates, 'y': values})
print(f"Original data has {missing_data['y'].isna().sum()} missing values")
# Different imputation methods
data_forward_fill = missing_data.fillna(method='ffill')
data_interpolate = missing_data.interpolate(method='linear')
data_mean = missing_data.fillna(missing_data['y'].mean())
print("Imputation methods:")
print(f" Forward fill: {data_forward_fill['y'].isna().sum()} missing")
print(f" Linear interpolation: {data_interpolate['y'].isna().sum()} missing")
print(f" Mean imputation: {data_mean['y'].isna().sum()} missing")
# Example 2: Detecting outliers
print("\n2. Outlier Detection")
# Create data with outliers
clean_data = np.random.normal(100, 10, 100)
outlier_indices = [20, 50, 80]
clean_data[outlier_indices] = [50, 200, 30] # Extreme values
# Z-score method
z_scores = (clean_data - np.mean(clean_data)) / np.std(clean_data)
outliers_z = np.abs(z_scores) > 3
# IQR method
Q1 = np.percentile(clean_data, 25)
Q3 = np.percentile(clean_data, 75)
IQR = Q3 - Q1
outliers_iqr = (clean_data < Q1 - 1.5 * IQR) | (clean_data > Q3 + 1.5 * IQR)
print(f"Z-score method detected {np.sum(outliers_z)} outliers")
print(f"IQR method detected {np.sum(outliers_iqr)} outliers")
print(f"Actual outliers: {len(outlier_indices)}")
# Example 3: Model validation
print("\n3. Model Validation Techniques")
# Generate validation data
val_dates = pd.date_range('2023-01-01', periods=200, freq='D')
val_t = np.arange(200)
val_y = 100 + 0.1 * val_t + np.sin(2 * np.pi * val_t / 30) + np.random.normal(0, 5, 200)
# Rolling window validation
def rolling_validation(data, window_size=50, step=10):
"""Perform rolling window validation"""
errors = []
for start in range(0, len(data) - window_size - step, step):
train_end = start + window_size
test_end = train_end + step
if test_end > len(data):
break
# Train on window
train_data = data[start:train_end]
test_data = data[train_end:test_end]
# Simple model (moving average)
pred = np.mean(train_data)
actual = np.mean(test_data)
error = abs(pred - actual)
errors.append(error)
return errors
rolling_errors = rolling_validation(val_y, window_size=50, step=10)
print(f"Rolling validation - Mean error: {np.mean(rolling_errors):.4f}")
print(f"Rolling validation - Error std: {np.std(rolling_errors):.4f}")
# Example 4: Residual analysis
print("\n4. Residual Analysis")
# Generate predictions and residuals
predictions = val_y + np.random.normal(0, 3, 200) # Add some error
residuals = val_y - predictions
# Residual statistics
print(f"Residual mean: {np.mean(residuals):.4f}")
print(f"Residual std: {np.std(residuals):.4f}")
print(f"Residual skewness: {pd.Series(residuals).skew():.4f}")
print(f"Residual kurtosis: {pd.Series(residuals).kurtosis():.4f}")
# Autocorrelation of residuals
from statsmodels.tsa.stattools import acf
residual_acf = acf(residuals, nlags=10)
significant_autocorr = np.sum(np.abs(residual_acf[1:]) > 0.2) # Significant lags
print(f"Residuals show autocorrelation in {significant_autocorr} lags")
print("\nTroubleshooting examples completed!")
# Run troubleshooting examples
troubleshooting_examples()
π― Key TakeawaysΒΆ
Real-world forecasting requires understanding business context
Data quality is more important than fancy algorithms
Model validation prevents overfitting and ensures reliability
Ensemble methods often outperform single models
Domain knowledge improves feature engineering and interpretation
Monitoring and maintenance ensure long-term performance
Communication of uncertainty is crucial for decision making
π When to Use Different ApproachesΒΆ
β Statistical Methods (ARIMA, Prophet):ΒΆ
Interpretable results needed
Limited data available
Stationary patterns dominate
Business stakeholders require explanations
β Machine Learning Methods:ΒΆ
Complex patterns with many variables
Large datasets available
Feature engineering possible
Accuracy is top priority
β Deep Learning Methods:ΒΆ
Very long sequences needed
Automatic feature learning desired
Scalable computation available
Multiple related series to forecast
π‘ Best PracticesΒΆ
Start simple: Try basic methods before complex ones
Validate thoroughly: Use multiple validation techniques
Monitor performance: Track metrics over time
Update regularly: Retrain models with new data
Document everything: Keep track of decisions and results
Communicate uncertainty: Always provide confidence intervals
Test assumptions: Verify model assumptions hold
Consider business impact: Focus on actionable insights
π Next StepsΒΆ
Youβve completed the comprehensive time series forecasting curriculum! Hereβs whatβs next:
Advanced Topics: Bayesian forecasting, online learning
Multivariate Methods: Vector autoregression, neural networks
Production Systems: MLOps, model serving, monitoring
Specialized Areas: Financial forecasting, IoT time series
Research: Stay updated with latest methods and papers
π Additional ResourcesΒΆ
Books: βTime Series Analysis and Its Applicationsβ, βForecasting: Principles and Practiceβ
Courses: Coursera Time Series, Udacity Machine Learning
Libraries: statsmodels, prophet, darts, gluonts
Communities: Kaggle, Towards Data Science, r/MachineLearning
Congratulations! π Youβve mastered time series forecasting from fundamentals to advanced applications. The patterns and techniques youβve learned apply to countless real-world problems. Keep practicing, stay curious, and continue building amazing forecasting solutions!