Run this notebook: Open in Colab Open in Kaggle

Recommender Systems¶

📹 Related Topic (covered in various lectures)

Predicting User Preferences at Scale¶

What are Recommender Systems?¶

Goal: Predict items a user will like based on past behavior

Applications:

E-commerce: Amazon product recommendations
Streaming: Netflix movie suggestions, Spotify playlists
Social Media: Facebook friend suggestions, YouTube videos
News: Personalized article feeds

Key Challenge: Most users have rated/interacted with only a tiny fraction of items

Two Main Approaches¶

1. Content-Based Filtering¶

Idea: Recommend items similar to what user liked before

How it works:

Extract features from items (genre, director, keywords)
Build user profile based on features of liked items
Recommend items matching user profile

Example - Movies:

User liked action movies with Tom Cruise
Recommend other Tom Cruise action movies

Advantages:

No need for other users’ data
Can recommend new/unpopular items
Can provide explanations

Disadvantages:

Limited to features you can extract
Can’t learn from other users
Tends to recommend similar items (filter bubble)

2. Collaborative Filtering¶

Idea: Use wisdom of the crowd - “Users who liked this also liked…”

User-User Collaborative Filtering:

Find users similar to you
Recommend items they liked

Item-Item Collaborative Filtering:

Find items similar to ones you liked
Based on what other users rated similarly

Matrix Factorization:

The Problem:

User-Item Rating Matrix (sparse):
        Movie1  Movie2  Movie3  Movie4
User1     5       ?       ?       3
User2     ?       4       5       ?
User3     3       ?       ?       ?
User4     ?       5       ?       4

The Solution: Factor into two low-rank matrices:

R ≈ U × V^T

where:

U: User feature matrix (users × k factors)
V: Item feature matrix (items × k factors)
k: Number of latent factors (much smaller than # users or items)

Latent Factors:

Not explicitly defined
Learned automatically
Might capture: genre, era, style, etc.

Cost Function for Matrix Factorization¶

Minimize:

J(U,V) = Σ_{(i,j):r_ij exists} (u_i^T v_j - r_ij)² + λ(||U||² + ||V||²)

Optimization:

Alternating Least Squares (ALS)
Stochastic Gradient Descent

Advanced Techniques¶

1. Deep Learning for Recommendations

Neural Collaborative Filtering
Autoencoders for collaborative filtering
Two-tower models

2. Hybrid Systems

Combine content-based + collaborative filtering
Netflix Prize winning solution used ensemble of 100+ models

3. Context-Aware Recommendations

Time of day
Device type
Current location
Social context

Evaluation Metrics¶

Explicit Feedback (ratings):

RMSE (Root Mean Squared Error)
MAE (Mean Absolute Error)

Implicit Feedback (clicks, views):

Precision@K
Recall@K
MAP (Mean Average Precision)
NDCG (Normalized Discounted Cumulative Gain)

Business Metrics:

Click-through rate
Conversion rate
User engagement time

Challenges¶

Cold Start Problem:

New users: No history to base recommendations on
New items: No ratings yet
Solutions: Hybrid approaches, ask for initial preferences

Scalability:

Millions of users × millions of items
Real-time recommendations
Solutions: Approximate methods, distributed computing

Filter Bubble:

Only recommend similar items
Reduce serendipity and discovery
Solutions: Exploration vs exploitation, diversity metrics

Real-World Considerations¶

A/B Testing:

Test recommendation algorithms live
Measure impact on business metrics

Diversity vs Accuracy:

Most accurate ≠ most useful
Need variety in recommendations

Explainability:

“Because you watched X”
Builds trust, helps users discover

Fairness:

Avoid amplifying biases
Equal exposure for all items
Fair treatment of all users

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
import pandas as pd

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

13.1 Create Movie Rating Dataset¶

What: Generating a synthetic user-movie rating matrix¶

We create a simulated recommendation problem by generating a sparse rating matrix from known latent factors. Each user \(u\) has a latent preference vector \(p_u \in \mathbb{R}^k\) and each movie \(m\) has a latent feature vector \(q_m \in \mathbb{R}^k\); the true rating is \(r_{um} = p_u^T q_m\). We then observe only 30% of entries (simulating the sparsity of real rating data) while retaining the full matrix as ground truth for evaluation.

Why: Understanding the data structure that recommender systems exploit¶

In real systems like Netflix or Amazon, the user-item interaction matrix is extremely sparse – a typical user rates fewer than 1% of available items. The fundamental assumption of collaborative filtering is that this matrix is approximately low-rank: user preferences and item characteristics can be described by a small number of latent factors \(k\) (e.g., genre preference, quality, complexity). Generating data from a known low-rank model lets us verify that our algorithm can recover the true structure and predict missing entries.

# Create synthetic movie ratings
n_users = 50
n_movies = 30
n_factors = 5

# True latent factors
np.random.seed(42)
true_user_factors = np.random.randn(n_users, n_factors)
true_movie_factors = np.random.randn(n_movies, n_factors)

# Generate ratings
true_ratings = true_user_factors @ true_movie_factors.T
true_ratings = 1 + 4 * (true_ratings - true_ratings.min()) / (true_ratings.max() - true_ratings.min())

# Create sparse observed ratings (30% observed)
mask = np.random.rand(n_users, n_movies) < 0.3
observed_ratings = true_ratings.copy()
observed_ratings[~mask] = 0

print(f"Total possible ratings: {n_users * n_movies}")
print(f"Observed ratings: {mask.sum()} ({100*mask.sum()/(n_users*n_movies):.1f}%)")
print(f"Missing ratings: {(~mask).sum()}")

# Visualize rating matrix
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# True ratings
im1 = axes[0].imshow(true_ratings, cmap='YlOrRd', aspect='auto')
axes[0].set_xlabel('Movies', fontsize=12)
axes[0].set_ylabel('Users', fontsize=12)
axes[0].set_title('True Ratings (Complete)', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=axes[0], label='Rating (1-5)')

# Observed ratings (with missing)
masked_ratings = observed_ratings.copy()
masked_ratings[~mask] = np.nan
im2 = axes[1].imshow(masked_ratings, cmap='YlOrRd', aspect='auto')
axes[1].set_xlabel('Movies', fontsize=12)
axes[1].set_ylabel('Users', fontsize=12)
axes[1].set_title('Observed Ratings (Sparse)', fontsize=13, fontweight='bold')
plt.colorbar(im2, ax=axes[1], label='Rating (1-5)')

plt.tight_layout()
plt.show()

13.2 Matrix Factorization from Scratch¶

What: Implementing SGD-based matrix factorization with biases¶

We implement the core recommendation algorithm: decompose the sparse rating matrix \(R \approx PQ^T + b_u + b_m + \mu\), where \(P\) contains user latent factors, \(Q\) contains movie latent factors, \(b_u\) and \(b_m\) are user and movie biases, and \(\mu\) is the global mean rating. Parameters are optimized by stochastic gradient descent on the regularized squared error: \(J = \sum_{(u,m) \in \text{observed}} (r_{um} - \hat{r}_{um})^2 + \lambda(\|P\|^2 + \|Q\|^2)\).

Why: The algorithm behind Netflix-scale recommendations¶

Matrix factorization was the breakthrough technique in the Netflix Prize competition (2006-2009) and remains the backbone of most production recommendation systems. The biases capture systematic effects (some users rate generously, some movies are universally popular), while the latent factor dot product \(p_u^T q_m\) captures the personalized preference matching. Understanding this decomposition – and why regularization is essential to prevent overfitting on the sparse observed entries – is foundational for working on any recommendation system.

class MatrixFactorization:
    def __init__(self, n_factors=10, learning_rate=0.01, regularization=0.1, n_epochs=100):
        self.n_factors = n_factors
        self.lr = learning_rate
        self.reg = regularization
        self.n_epochs = n_epochs
        
    def fit(self, R, mask):
        """
        R: rating matrix (n_users x n_movies)
        mask: boolean matrix indicating observed ratings
        """
        n_users, n_movies = R.shape
        
        # Initialize factors
        self.P = np.random.randn(n_users, self.n_factors) * 0.01
        self.Q = np.random.randn(n_movies, self.n_factors) * 0.01
        self.user_bias = np.zeros(n_users)
        self.movie_bias = np.zeros(n_movies)
        self.global_mean = R[mask].mean()
        
        self.train_errors = []
        
        # Gradient descent
        for epoch in range(self.n_epochs):
            # Shuffle order
            users, movies = np.where(mask)
            indices = np.arange(len(users))
            np.random.shuffle(indices)
            
            for idx in indices:
                u, i = users[idx], movies[idx]
                r_ui = R[u, i]
                
                # Prediction
                pred = self.global_mean + self.user_bias[u] + self.movie_bias[i] + \
                       self.P[u] @ self.Q[i]
                
                # Error
                error = r_ui - pred
                
                # Update biases
                self.user_bias[u] += self.lr * (error - self.reg * self.user_bias[u])
                self.movie_bias[i] += self.lr * (error - self.reg * self.movie_bias[i])
                
                # Update factors
                P_u = self.P[u].copy()
                self.P[u] += self.lr * (error * self.Q[i] - self.reg * self.P[u])
                self.Q[i] += self.lr * (error * P_u - self.reg * self.Q[i])
            
            # Compute RMSE
            predictions = self.predict()
            train_rmse = np.sqrt(np.mean((R[mask] - predictions[mask])**2))
            self.train_errors.append(train_rmse)
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, RMSE: {train_rmse:.4f}")
    
    def predict(self):
        """Predict all ratings"""
        predictions = self.global_mean + \
                     self.user_bias[:, np.newaxis] + \
                     self.movie_bias[np.newaxis, :] + \
                     self.P @ self.Q.T
        return np.clip(predictions, 1, 5)

# Train model
mf = MatrixFactorization(n_factors=5, learning_rate=0.01, regularization=0.02, n_epochs=100)
mf.fit(observed_ratings, mask)

# Plot training curve
plt.figure(figsize=(10, 6))
plt.plot(mf.train_errors, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('RMSE', fontsize=12)
plt.title('Matrix Factorization Training', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

13.3 Evaluate Predictions¶

What: Measuring recommendation quality with RMSE on observed and missing ratings¶

We evaluate the trained matrix factorization model on two sets: the observed ratings (which should have low error after training) and the held-out missing ratings (which test whether the model has learned meaningful latent structure). The gap between these two errors reveals whether the model is generalizing or merely memorizing the training entries.

Why: Proper evaluation of recommender systems is surprisingly tricky¶

In a real system, you never observe the missing ratings, so you cannot directly compute prediction error on them. Instead, practitioners use temporal splits (train on past, test on future) or held-out random splits of the observed ratings. Our simulated setup, where ground truth exists for all entries, lets us directly verify that the latent factor model is recovering the true underlying preferences – a luxury that motivates building and testing on synthetic data before deploying on real data.

# Get predictions
predicted_ratings = mf.predict()

# Evaluate on observed ratings
observed_true = observed_ratings[mask]
observed_pred = predicted_ratings[mask]
rmse_observed = np.sqrt(mean_squared_error(observed_true, observed_pred))

# Evaluate on missing ratings (we have ground truth from simulation)
missing_true = true_ratings[~mask]
missing_pred = predicted_ratings[~mask]
rmse_missing = np.sqrt(mean_squared_error(missing_true, missing_pred))

print(f"RMSE on observed ratings: {rmse_observed:.4f}")
print(f"RMSE on missing ratings: {rmse_missing:.4f}")
print(f"\nNote: In practice, we don't have ground truth for missing ratings!")

# Visualize predictions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

im1 = axes[0].imshow(true_ratings, cmap='YlOrRd', aspect='auto')
axes[0].set_title('True Ratings', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Movies')
axes[0].set_ylabel('Users')
plt.colorbar(im1, ax=axes[0])

im2 = axes[1].imshow(predicted_ratings, cmap='YlOrRd', aspect='auto')
axes[1].set_title('Predicted Ratings', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Movies')
axes[1].set_ylabel('Users')
plt.colorbar(im2, ax=axes[1])

im3 = axes[2].imshow(np.abs(true_ratings - predicted_ratings), cmap='Reds', aspect='auto')
axes[2].set_title('Absolute Error', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Movies')
axes[2].set_ylabel('Users')
plt.colorbar(im3, ax=axes[2])

plt.tight_layout()
plt.show()

13.4 Top-N Recommendations¶

What: Generating personalized movie recommendations for a specific user¶

For a target user, we predict ratings for all movies the user has not yet seen, rank them by predicted rating, and present the top \(N\) items as recommendations. This is the core output of any recommender system – converting a matrix of predicted scores into an actionable ranked list.

Why: The shift from rating prediction to ranking¶

While RMSE measures the accuracy of individual rating predictions, users ultimately experience recommendations as a ranked list. A model that accurately predicts the top few movies matters more than one with slightly lower RMSE across all predictions. In production systems, ranking metrics like Precision@K, NDCG, and MAP better capture user satisfaction than RMSE. This example bridges the gap between the optimization objective (minimize squared error) and the user-facing output (a personalized ranked list).

# Generate recommendations for a user
user_id = 0
user_predictions = predicted_ratings[user_id]
user_observed = mask[user_id]

# Get top-5 unseen movies
unseen_movies = np.where(~user_observed)[0]
unseen_predictions = user_predictions[unseen_movies]
top_indices = np.argsort(unseen_predictions)[::-1][:5]
top_movies = unseen_movies[top_indices]

print(f"Top 5 Recommendations for User {user_id}:\n")
for rank, movie_id in enumerate(top_movies, 1):
    pred_rating = user_predictions[movie_id]
    true_rating = true_ratings[user_id, movie_id]
    print(f"{rank}. Movie {movie_id}: Predicted={pred_rating:.2f}, True={true_rating:.2f}")

# Visualize user's ratings
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(n_movies)
width = 0.35

# Observed ratings
observed_vals = np.where(user_observed, observed_ratings[user_id], 0)
ax.bar(x - width/2, observed_vals, width, label='Observed Ratings', alpha=0.7)

# Predictions for unseen
unseen_vals = np.where(~user_observed, user_predictions, 0)
ax.bar(x + width/2, unseen_vals, width, label='Predicted Ratings (unseen)', alpha=0.7)

# Highlight recommendations
ax.scatter(top_movies, user_predictions[top_movies], s=200, c='red', 
          marker='*', zorder=5, label='Top 5 Recommendations')

ax.set_xlabel('Movie ID', fontsize=12)
ax.set_ylabel('Rating', fontsize=12)
ax.set_title(f'User {user_id} - Ratings and Recommendations', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

13.5 Latent Factor Interpretation¶

What: Visualizing the learned user and movie embeddings¶

We display the learned latent factor matrices \(P\) (users) and \(Q\) (movies) as heatmaps. Each row of \(P\) represents a user’s position in the latent space, and each row of \(Q\) represents a movie’s position. The predicted rating for user \(u\) and movie \(m\) is essentially the dot product of their respective rows (plus biases), so users and movies that align along the same latent dimensions will have high predicted ratings.

Why: From black-box predictions to interpretable factors¶

Although the latent factors are learned purely from rating patterns (no content information), they often correspond to interpretable concepts like genre, era, or quality. Visualizing these factors helps us understand why the model makes certain recommendations and builds trust in the system. In production recommender systems, interpretable factors enable features like “Because you watched action movies…” explanations, which research shows significantly increase user engagement and satisfaction.

# Visualize learned factors
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# User factors
im1 = axes[0].imshow(mf.P, cmap='RdBu_r', aspect='auto')
axes[0].set_xlabel('Latent Factor', fontsize=12)
axes[0].set_ylabel('User ID', fontsize=12)
axes[0].set_title('User Latent Factors', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=axes[0])

# Movie factors
im2 = axes[1].imshow(mf.Q, cmap='RdBu_r', aspect='auto')
axes[1].set_xlabel('Latent Factor', fontsize=12)
axes[1].set_ylabel('Movie ID', fontsize=12)
axes[1].set_title('Movie Latent Factors', fontsize=13, fontweight='bold')
plt.colorbar(im2, ax=axes[1])

plt.tight_layout()
plt.show()

print("\nLatent Factor Interpretation:")
print("- Each factor represents a hidden preference dimension")
print("- Examples: genre, era, director style, complexity, etc.")
print("- User factors: How much they like each dimension")
print("- Movie factors: How much each movie embodies each dimension")
print("- Rating prediction: Dot product of user and movie factors")

Key Takeaways¶

1. Matrix Factorization¶

Core idea: Decompose rating matrix R ≈ P @ Q^T
P: User latent factors (n_users × k)
Q: Movie latent factors (n_movies × k)
k: Number of latent dimensions (hyperparameter)

2. Collaborative Filtering¶

Uses only user-item interactions (no content features)
Learns from similar users and similar items
Cold start problem: Can’t recommend for new users/items

3. Optimization¶

SGD: Update after each rating (used above)
- Fast, online learning
- Can be unstable
ALS: Alternating Least Squares
- Fix Q, solve for P (closed form)
- Fix P, solve for Q
- More stable, parallelizable

4. Regularization¶

Essential to prevent overfitting
L2 penalty on factors: λ(||P||² + ||Q||²)
Tune λ using validation set

5. Extensions¶

Biases: Add user and item bias terms
Implicit feedback: “Watched” vs “Not watched”
Temporal dynamics: Preferences change over time
Deep learning: Neural collaborative filtering

6. Evaluation¶

Rating prediction: RMSE, MAE
Ranking: Precision@k, NDCG, MAP
Online: A/B testing, click-through rate
Always use held-out test set

7. Practical Considerations¶

Sparsity: Most users rate few items
Scalability: Millions of users × items
Cold start: New users/items have no history
Solution: Hybrid systems (content + collaborative)

References¶

“Matrix Factorization Techniques for Recommender Systems” - Koren et al. (2009)
“Collaborative Filtering for Implicit Feedback Datasets” - Hu et al. (2008)
CS229 Lecture Notes on Recommender Systems

Next: Lecture 14: Reinforcement Learning