Collaborative Filtering: Building Recommender Systems from User BehaviorΒΆ

Netflix-style recommendations start with collaborative filtering β€” the idea that users who agreed in the past will agree in the future. This notebook covers user-based CF, item-based CF, matrix factorization with SVD, and ALS.

1. Setup β€” Synthetic User-Item Ratings MatrixΒΆ

We create a sparse matrix of 100 users Γ— 200 items with ratings from 0–5 (0 = not rated).

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

N_USERS = 100
N_ITEMS = 200
DENSITY  = 0.05   # 5% of ratings are observed

# Generate sparse ratings
mask = np.random.rand(N_USERS, N_ITEMS) < DENSITY
ratings_raw = np.random.randint(1, 6, size=(N_USERS, N_ITEMS)).astype(float)
ratings_raw[~mask] = 0.0

user_ids  = [f'user_{i:03d}' for i in range(N_USERS)]
item_ids  = [f'item_{j:03d}' for j in range(N_ITEMS)]
ratings_df = pd.DataFrame(ratings_raw, index=user_ids, columns=item_ids)

print(f'Matrix shape : {ratings_df.shape}')
print(f'Observed ratings: {mask.sum()} / {N_USERS * N_ITEMS} ({mask.mean()*100:.1f}%)')
ratings_df.iloc[:5, :8]

2. User-Based Collaborative FilteringΒΆ

Idea: Find users similar to the target user, then predict a rating as a weighted average of their ratings.

Cosine similarity between two user vectors (ignoring unrated items): $\(\text{sim}(u, v) = \frac{\mathbf{r}_u \cdot \mathbf{r}_v}{\|\mathbf{r}_u\| \|\mathbf{r}_v\|}\)$

from sklearn.metrics.pairwise import cosine_similarity

R = ratings_df.values.copy()  # shape (N_USERS, N_ITEMS)

# User-user cosine similarity
user_sim = cosine_similarity(R)          # (100, 100)
np.fill_diagonal(user_sim, 0)            # exclude self-similarity

user_sim_df = pd.DataFrame(user_sim, index=user_ids, columns=user_ids)

def user_based_predict(ratings, user_sim_matrix, user_idx, item_idx, k=10):
    """Predict rating for user_idx on item_idx using top-k similar users."""
    sims = user_sim_matrix[user_idx]             # similarities to all other users
    # Only consider users who have rated the item
    rated_mask = ratings[:, item_idx] > 0
    if rated_mask.sum() == 0:
        return 0.0
    sims_rated = sims * rated_mask.astype(float)
    top_k_idx  = np.argsort(sims_rated)[::-1][:k]
    top_k_sims = sims_rated[top_k_idx]
    top_k_rats = ratings[top_k_idx, item_idx]
    denom = np.abs(top_k_sims).sum()
    if denom == 0:
        return 0.0
    return (top_k_sims * top_k_rats).sum() / denom

# Example prediction
pred = user_based_predict(R, user_sim, user_idx=0, item_idx=5, k=10)
print(f'User-based predicted rating for user_000 on item_005: {pred:.3f}')
print(f'Actual rating (0=unrated):                            {R[0, 5]:.1f}')

3. Item-Based Collaborative FilteringΒΆ

Idea: Instead of finding similar users, find items similar to what the user has already rated. This is more scalable β€” item similarities are stable over time and can be precomputed.

Amazon’s original recommendation patent was item-based CF.

# Item-item cosine similarity (transpose: items are rows)
item_sim = cosine_similarity(R.T)        # (200, 200)
np.fill_diagonal(item_sim, 0)

def item_based_predict(ratings, item_sim_matrix, user_idx, item_idx, k=10):
    """Predict rating for user_idx on item_idx using top-k similar items the user rated."""
    sims = item_sim_matrix[item_idx]             # similarity of item_idx to all items
    user_rated_mask = ratings[user_idx, :] > 0   # items the user has rated
    if user_rated_mask.sum() == 0:
        return 0.0
    sims_rated = sims * user_rated_mask.astype(float)
    top_k_idx  = np.argsort(sims_rated)[::-1][:k]
    top_k_sims = sims_rated[top_k_idx]
    top_k_rats = ratings[user_idx, top_k_idx]
    denom = np.abs(top_k_sims).sum()
    if denom == 0:
        return 0.0
    return (top_k_sims * top_k_rats).sum() / denom

pred_item = item_based_predict(R, item_sim, user_idx=0, item_idx=5, k=10)
print(f'Item-based predicted rating for user_000 on item_005: {pred_item:.3f}')

# Top 5 items similar to item_005
top5 = np.argsort(item_sim[5])[::-1][:5]
print(f'Items most similar to item_005: {[item_ids[i] for i in top5]}')

4. Matrix Factorization with SVDΒΆ

Idea: Decompose the ratings matrix \(R \approx U \Sigma V^T\) where \(U\) = user factors, \(V\) = item factors. Missing ratings are predicted by the reconstruction.

We use Truncated SVD (only keep top-\(k\) singular values) to get compact latent factors.

from sklearn.decomposition import TruncatedSVD

# Fill missing ratings with row mean before SVD
R_filled = R.copy()
row_means = np.true_divide(R.sum(1), (R != 0).sum(1), where=(R != 0).sum(1) != 0)
for i in range(N_USERS):
    R_filled[i, R[i] == 0] = row_means[i] if row_means[i] > 0 else 3.0

K = 20  # latent factors
svd = TruncatedSVD(n_components=K, random_state=42)
U_k = svd.fit_transform(R_filled)          # (100, 20)
V_k = svd.components_                       # (20, 200)
Sigma_k = np.diag(svd.singular_values_)     # (20, 20)

R_reconstructed = U_k @ V_k                 # (100, 200) β€” predicted ratings

# Clip to valid rating range
R_pred_svd = np.clip(R_reconstructed, 1, 5)

print(f'Explained variance ratio (top {K} components): {svd.explained_variance_ratio_.sum():.3f}')
print(f'SVD predicted rating for user_000 on item_005: {R_pred_svd[0, 5]:.3f}')

# Show latent factor shape
print(f'User factor matrix U_k: {U_k.shape}  β€” each user is a {K}-dim vector')
print(f'Item factor matrix V_k: {V_k.T.shape} β€” each item is a {K}-dim vector')

5. ALS β€” Alternating Least SquaresΒΆ

Idea: Factorize \(R \approx P Q^T\) by alternately fixing \(Q\) and solving for \(P\) in closed form, then fixing \(P\) and solving for \(Q\). Each step is a simple ridge regression.

ALS handles missing data better than vanilla SVD because it only tries to reconstruct observed ratings.

\[P_u = \left( Q_I^T Q_I + \lambda I \right)^{-1} Q_I^T r_{u,I}\]

where \(I\) is the set of items rated by user \(u\).

def als(R, K=10, n_epochs=20, lam=0.1, verbose=True):
    """
    Minimal ALS implementation.
    R      : (n_users, n_items) rating matrix (0 = unobserved)
    K      : number of latent factors
    n_epochs: number of ALS iterations
    lam    : regularization strength
    """
    n_users, n_items = R.shape
    observed = R > 0

    # Random initialization
    rng = np.random.default_rng(42)
    P = rng.normal(0, 0.1, (n_users, K))  # user factors
    Q = rng.normal(0, 0.1, (n_items, K))  # item factors

    reg = lam * np.eye(K)

    for epoch in range(n_epochs):
        # Fix Q, solve for P
        for u in range(n_users):
            I_u = np.where(observed[u])[0]
            if len(I_u) == 0:
                continue
            Q_u = Q[I_u]                          # (|I_u|, K)
            r_u = R[u, I_u]                        # (|I_u|,)
            P[u] = np.linalg.solve(Q_u.T @ Q_u + reg, Q_u.T @ r_u)

        # Fix P, solve for Q
        for i in range(n_items):
            U_i = np.where(observed[:, i])[0]
            if len(U_i) == 0:
                continue
            P_i = P[U_i]                          # (|U_i|, K)
            r_i = R[U_i, i]                        # (|U_i|,)
            Q[i] = np.linalg.solve(P_i.T @ P_i + reg, P_i.T @ r_i)

        # RMSE on observed ratings
        R_hat = P @ Q.T
        obs_idx = observed
        rmse = np.sqrt(((R[obs_idx] - R_hat[obs_idx]) ** 2).mean())
        if verbose and (epoch + 1) % 5 == 0:
            print(f'  Epoch {epoch+1:3d}  RMSE (observed): {rmse:.4f}')

    return P, Q

print('Running ALS (K=10, 20 epochs)...')
P, Q = als(R, K=10, n_epochs=20, lam=0.1)
R_pred_als = np.clip(P @ Q.T, 1, 5)
print(f'ALS predicted rating for user_000 on item_005: {R_pred_als[0, 5]:.3f}')

6. EvaluationΒΆ

Train/test split for recommenders: randomly hide some observed ratings, treat them as the test set.

Metrics:

  • RMSE β€” how accurate are predicted ratings?

  • Precision@K β€” of the top-K recommendations, how many are relevant?

  • Recall@K β€” of all relevant items, how many are in the top-K?

  • NDCG@K β€” normalized discounted cumulative gain, rewards putting relevant items higher

# --- Train / Test Split ---
observed_indices = list(zip(*np.where(R > 0)))
rng = np.random.default_rng(0)
rng.shuffle(observed_indices)
n_test = int(0.2 * len(observed_indices))
test_idx  = observed_indices[:n_test]
train_idx = observed_indices[n_test:]

R_train = np.zeros_like(R)
for u, i in train_idx:
    R_train[u, i] = R[u, i]

# Retrain ALS on train split
P_tr, Q_tr = als(R_train, K=10, n_epochs=20, lam=0.1, verbose=False)
R_hat_tr = np.clip(P_tr @ Q_tr.T, 1, 5)

# --- RMSE ---
test_u = [u for u, i in test_idx]
test_i = [i for u, i in test_idx]
rmse = np.sqrt(np.mean((R[test_u, test_i] - R_hat_tr[test_u, test_i]) ** 2))
print(f'Test RMSE: {rmse:.4f}')

# --- Precision@K, Recall@K, NDCG@K ---
def precision_recall_ndcg_at_k(R_true, R_pred, k=10, threshold=4.0):
    precisions, recalls, ndcgs = [], [], []
    for u in range(R_true.shape[0]):
        relevant = set(np.where(R_true[u] >= threshold)[0])
        if not relevant:
            continue
        top_k = np.argsort(R_pred[u])[::-1][:k]
        hits  = [1 if i in relevant else 0 for i in top_k]
        precisions.append(sum(hits) / k)
        recalls.append(sum(hits) / len(relevant))
        # NDCG
        dcg  = sum(h / np.log2(rank + 2) for rank, h in enumerate(hits))
        idcg = sum(1.0 / np.log2(rank + 2) for rank in range(min(len(relevant), k)))
        ndcgs.append(dcg / idcg if idcg > 0 else 0)
    return np.mean(precisions), np.mean(recalls), np.mean(ndcgs)

p, r, ndcg = precision_recall_ndcg_at_k(R, R_hat_tr, k=10, threshold=4.0)
print(f'Precision@10 : {p:.4f}')
print(f'Recall@10    : {r:.4f}')
print(f'NDCG@10      : {ndcg:.4f}')

7. The Cold Start ProblemΒΆ

Collaborative filtering breaks down in two scenarios:

Scenario

Problem

Solutions

New user

No ratings yet β€” can’t compute similarity

Ask for preferences onboarding, use demographic/contextual features, default to popularity

New item

No ratings yet β€” never recommended

Content-based fallback, use item metadata for initial recommendations

Long tail

Obscure items rarely rated

Hybrid models, content features for rare items

# Simulate cold start: new user with 0, 1, 3, 5 ratings
def cold_start_demo(R, item_sim, n_seed_ratings=3, k=10):
    """
    For a new user with only n_seed_ratings, show item-based CF recommendations.
    Content-based would be preferable at very low counts.
    """
    # Pick a random existing user as ground truth
    rng = np.random.default_rng(7)
    rated_items = np.where(R[0] > 0)[0]
    seed_items  = rng.choice(rated_items, size=min(n_seed_ratings, len(rated_items)), replace=False)

    new_user_ratings = np.zeros(N_ITEMS)
    new_user_ratings[seed_items] = R[0, seed_items]

    # Item-based score: sum of similarities to rated items, weighted by rating
    scores = item_sim @ new_user_ratings
    scores[seed_items] = -np.inf   # don't recommend already-rated items
    top_k  = np.argsort(scores)[::-1][:k]
    return top_k

for n_seed in [0, 1, 3, 5]:
    recs = cold_start_demo(R, item_sim, n_seed_ratings=n_seed)
    print(f'Seed ratings={n_seed:1d}  β†’ top-3 recs: {[item_ids[i] for i in recs[:3]]}')

print('\nAt 0 seed ratings β†’ fall back to global popularity ranking:')
popularity = (R > 0).sum(axis=0)         # number of users who rated each item
popular_items = np.argsort(popularity)[::-1][:5]
print(f'  Most popular items: {[item_ids[i] for i in popular_items]}')

8. Cheat Sheet + ExercisesΒΆ

Collaborative Filtering Cheat SheetΒΆ

Method

Similarity

Prediction

Scalability

Cold Start

User-based CF

cosine / pearson on user vectors

weighted avg of similar users’ ratings

O(UΒ²I) β€” slow

Bad

Item-based CF

cosine on item vectors

weighted avg of similar items user rated

O(IΒ²U) β€” precomputable

Moderate

SVD

β€”

low-rank matrix reconstruction

O(UKI) at inference

Bad

ALS

β€”

latent factor dot product

scalable, parallelizable

Bad

# Core patterns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

# User sim
user_sim = cosine_similarity(R)          # R: (users, items)
# Item sim
item_sim = cosine_similarity(R.T)        # R.T: (items, users)
# SVD
svd  = TruncatedSVD(n_components=20)
U_k  = svd.fit_transform(R_filled)
R_hat = U_k @ svd.components_

ExercisesΒΆ

  1. Mean-centered CF: Subtract each user’s mean rating before computing cosine similarity (this accounts for users who rate everything high or low). Compare RMSE with and without mean-centering.

  2. Pearson correlation: Replace cosine similarity with Pearson correlation for user-based CF. When does Pearson outperform cosine?

  3. ALS with bias terms: Extend the ALS implementation to include user bias \(b_u\) and item bias \(b_i\), so the prediction is \(\hat{r}_{ui} = \mu + b_u + b_i + p_u^T q_i\). Retrain and compare RMSE.

  4. Implicit feedback ALS: Modify ALS to handle binary implicit feedback (clicked=1, not clicked=0) using the confidence weighting \(c_{ui} = 1 + \alpha \cdot r_{ui}\).

  5. Coverage and serendipity: Compute catalog coverage (% of items ever recommended in top-10 across all users) and serendipity (% of recommendations that are not obviously similar to a user’s history). Compare user-based vs item-based CF.