Content-Based Filtering: Recommending by Item Similarity¶
Content-based recommenders don’t need user history — they recommend items similar to what a user has liked before, based on item features. This notebook covers TF-IDF similarity, feature engineering for items, and hybrid approaches.
1. Setup — Synthetic Movie Dataset¶
200 movies with: title, genre(s), description, director, cast, year, average rating.
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
N_MOVIES = 200
GENRES = ['Action', 'Comedy', 'Drama', 'Sci-Fi', 'Thriller',
'Romance', 'Horror', 'Animation', 'Documentary', 'Fantasy']
DIRECTORS = [f'Director_{i}' for i in range(20)]
ACTORS = [f'Actor_{i}' for i in range(50)]
desc_templates = [
"A gripping tale of {adj} heroes who must {verb} against overwhelming {noun}.",
"In a {adj} world, two {noun}s find themselves on a {adj} journey of {verb}ing.",
"An {adj} story about {noun} and the {adj} consequences of {verb}ing.",
"When a {adj} {noun} threatens the city, an unlikely hero must {verb} to save the day.",
"A {adj} drama exploring the {noun} of human connection and the art of {verb}ing.",
]
adjs = ['brave','dark','thrilling','mysterious','heartwarming','epic','quiet','haunting','vibrant','absurd']
verbs = ['fight','escape','discover','survive','transform','confront','explore','rebuild','question','celebrate']
nouns = ['villain','society','memory','identity','time','love','chaos','power','truth','nature']
def make_description(rng):
tmpl = rng.choice(desc_templates)
return tmpl.format(
adj=rng.choice(adjs),
verb=rng.choice(verbs),
noun=rng.choice(nouns)
)
rng = np.random.default_rng(42)
records = []
for i in range(N_MOVIES):
n_genres = rng.integers(1, 4)
n_cast = rng.integers(2, 6)
records.append({
'movie_id' : f'movie_{i:03d}',
'title' : f'Movie Title {i}',
'genres' : '|'.join(rng.choice(GENRES, size=int(n_genres), replace=False).tolist()),
'description': make_description(rng),
'director' : rng.choice(DIRECTORS),
'cast' : '|'.join(rng.choice(ACTORS, size=int(n_cast), replace=False).tolist()),
'year' : int(rng.integers(1980, 2024)),
'avg_rating' : round(float(rng.uniform(2.5, 5.0)), 2),
'n_votes' : int(rng.integers(100, 50000)),
})
movies = pd.DataFrame(records)
print(f'Dataset: {movies.shape}')
movies.head(3)
2. Feature Engineering for Items¶
Three types of features:
Text features — TF-IDF on description
Categorical features — one-hot encoded genres, director
Numeric features — year (normalized), avg_rating, log(n_votes)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import scipy.sparse as sp
# --- Text: TF-IDF on description ---
tfidf = TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2))
desc_matrix = tfidf.fit_transform(movies['description']) # sparse (200, 500)
# --- Categorical: one-hot genres ---
genre_dummies = movies['genres'].str.get_dummies(sep='|') # (200, n_genres)
# --- Categorical: director one-hot ---
director_dummies = pd.get_dummies(movies['director'], prefix='dir') # (200, 20)
# --- Numeric: year, avg_rating, log(n_votes) ---
numeric_feats = movies[['year', 'avg_rating', 'n_votes']].copy()
numeric_feats['log_votes'] = np.log1p(numeric_feats['n_votes'])
numeric_feats = numeric_feats[['year', 'avg_rating', 'log_votes']]
scaler = MinMaxScaler()
numeric_scaled = scaler.fit_transform(numeric_feats) # (200, 3)
print(f'TF-IDF features : {desc_matrix.shape}')
print(f'Genre features : {genre_dummies.shape}')
print(f'Director features: {director_dummies.shape}')
print(f'Numeric features : {numeric_scaled.shape}')
3. TF-IDF Cosine Similarity — Find Similar Movies by Description¶
TF-IDF (Term Frequency–Inverse Document Frequency) weights words that are frequent in a document but rare across the corpus — capturing the document’s distinctive vocabulary.
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
# Cosine similarity on TF-IDF (linear_kernel is faster for sparse TF-IDF)
desc_sim = linear_kernel(desc_matrix, desc_matrix) # (200, 200)
def get_similar_movies_by_description(movie_idx, sim_matrix, top_k=5):
sims = sim_matrix[movie_idx].copy()
sims[movie_idx] = -1 # exclude self
top_idx = np.argsort(sims)[::-1][:top_k]
result = movies.iloc[top_idx][['movie_id', 'title', 'genres', 'description']].copy()
result['similarity'] = sims[top_idx]
return result
print('Movie 0:', movies.iloc[0]['description'])
print('\nMost similar by description:')
print(get_similar_movies_by_description(0, desc_sim)[['movie_id', 'similarity', 'description']].to_string())
4. Combined Feature Similarity — Weighted Combination¶
Combine text, categorical, and numeric features with tunable weights:
# Compute individual similarity matrices
genre_sim = cosine_similarity(genre_dummies.values) # (200, 200)
director_sim = cosine_similarity(director_dummies.values) # (200, 200)
numeric_sim = cosine_similarity(numeric_scaled) # (200, 200)
# Weights (must sum to 1)
W_DESC = 0.40
W_GENRE = 0.30
W_DIR = 0.15
W_NUM = 0.15
combined_sim = (W_DESC * desc_sim +
W_GENRE * genre_sim +
W_DIR * director_sim +
W_NUM * numeric_sim)
np.fill_diagonal(combined_sim, 0)
def get_similar_movies_combined(movie_idx, top_k=5):
sims = combined_sim[movie_idx].copy()
top_idx = np.argsort(sims)[::-1][:top_k]
result = movies.iloc[top_idx][['movie_id', 'genres', 'director']].copy()
result['combined_sim'] = sims[top_idx]
return result
print(f'Movie 0: genres={movies.iloc[0]["genres"]} director={movies.iloc[0]["director"]}')
print('\nTop-5 similar (combined):')
print(get_similar_movies_combined(0).to_string())
5. User Profile Building¶
Idea: Represent a user as the weighted average of the feature vectors of items they’ve liked. Then find items closest to this profile vector.
# Build a combined dense feature matrix for all movies
# Stack: TF-IDF (dense) + genre + director + numeric
feature_matrix = np.hstack([
desc_matrix.toarray() * W_DESC,
genre_dummies.values * W_GENRE,
director_dummies.values * W_DIR,
numeric_scaled * W_NUM,
]) # (200, 500+10+20+3)
print(f'Combined feature matrix: {feature_matrix.shape}')
# Simulate a user who liked movies 5, 12, 34, 78
liked_movies = [5, 12, 34, 78]
liked_ratings = np.array([5, 4, 5, 3])
# User profile = weighted average of liked movie features
weights = liked_ratings / liked_ratings.sum()
user_profile = (feature_matrix[liked_movies] * weights[:, None]).sum(axis=0) # (n_features,)
# Find most similar movies to the user profile
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
scores = cos_sim(user_profile.reshape(1, -1), feature_matrix).flatten()
scores[liked_movies] = -1 # exclude already-liked
top_recs = np.argsort(scores)[::-1][:5]
print('\nUser liked:')
print(movies.iloc[liked_movies][['movie_id', 'genres', 'director']].to_string())
print('\nContent-based recommendations:')
rec_df = movies.iloc[top_recs][['movie_id', 'genres', 'director']].copy()
rec_df['score'] = scores[top_recs]
print(rec_df.to_string())
6. Hybrid: Content + Collaborative¶
Neither method is best alone. Hybrid recommenders combine them:
Weighted hybrid: \(\text{score} = \alpha \cdot \text{CF\_score} + (1-\alpha) \cdot \text{CB\_score}\)
Switching hybrid: use CB for new users/items, CF when enough ratings
Feature augmentation: use CB predictions as features in CF model
# Simulate CF scores (e.g., ALS predicted ratings normalized to [0,1])
rng2 = np.random.default_rng(99)
cf_scores_user0 = rng2.uniform(0, 1, N_MOVIES)
# Content-based scores for user 0 (from profile above)
cb_scores_user0 = scores.copy()
cb_scores_user0 = np.clip(cb_scores_user0, 0, None)
# Normalize to [0,1]
cb_norm = (cb_scores_user0 - cb_scores_user0.min()) / (cb_scores_user0.max() - cb_scores_user0.min() + 1e-9)
cf_norm = (cf_scores_user0 - cf_scores_user0.min()) / (cf_scores_user0.max() - cf_scores_user0.min() + 1e-9)
alpha = 0.6 # weight for CF
hybrid_scores = alpha * cf_norm + (1 - alpha) * cb_norm
hybrid_scores[liked_movies] = -1
hybrid_top5 = np.argsort(hybrid_scores)[::-1][:5]
cb_top5 = np.argsort(cb_norm)[::-1][:5]
cf_top5 = np.argsort(cf_norm)[::-1][:5]
print(f'CF-only top-5 : {[movies.iloc[i]["movie_id"] for i in cf_top5]}')
print(f'CB-only top-5 : {[movies.iloc[i]["movie_id"] for i in cb_top5]}')
print(f'Hybrid top-5 : {[movies.iloc[i]["movie_id"] for i in hybrid_top5]}')
overlap_cf_hybrid = len(set(cf_top5) & set(hybrid_top5))
overlap_cb_hybrid = len(set(cb_top5) & set(hybrid_top5))
print(f'\nOverlap CF∩Hybrid: {overlap_cf_hybrid}/5')
print(f'Overlap CB∩Hybrid: {overlap_cb_hybrid}/5')
7. Cold Start Advantage — Why Content-Based Beats CF for New Items¶
When a new movie is added with zero ratings, collaborative filtering cannot recommend it at all (it has no latent factors). Content-based filtering works immediately from metadata.
# Simulate a brand-new item (no ratings yet)
new_item = {
'movie_id' : 'movie_new',
'title' : 'The New Release',
'genres' : 'Sci-Fi|Thriller',
'description': 'A thrilling sci-fi story about brave heroes who must fight against dark villains.',
'director' : 'Director_3',
'cast' : 'Actor_1|Actor_7',
'year' : 2024,
'avg_rating' : 0.0,
'n_votes' : 0,
}
# Extract features for the new item using the SAME fitted transformers
new_desc_vec = tfidf.transform([new_item['description']]).toarray() * W_DESC
new_genre_vec = np.zeros((1, genre_dummies.shape[1]))
for g in new_item['genres'].split('|'):
if g in genre_dummies.columns:
new_genre_vec[0, list(genre_dummies.columns).index(g)] = 1
new_genre_vec *= W_GENRE
new_dir_vec = np.zeros((1, director_dummies.shape[1]))
dir_col = f"dir_{new_item['director']}"
if dir_col in director_dummies.columns:
new_dir_vec[0, list(director_dummies.columns).index(dir_col)] = 1
new_dir_vec *= W_DIR
new_num = scaler.transform([[new_item['year'], 3.0, np.log1p(100)]]) * W_NUM
new_feature_vec = np.hstack([new_desc_vec, new_genre_vec, new_dir_vec, new_num]) # (1, n_feats)
new_scores = cos_sim(new_feature_vec, feature_matrix).flatten()
new_top5 = np.argsort(new_scores)[::-1][:5]
print('New item genres:', new_item['genres'], ' director:', new_item['director'])
print('\nContent-based similar items (works with ZERO ratings):')
res = movies.iloc[new_top5][['movie_id', 'genres', 'director']].copy()
res['sim'] = new_scores[new_top5]
print(res.to_string())
print('\nCF would return nothing (no ratings → no latent factors). CB works immediately.')
8. Cheat Sheet + Exercises¶
Content-Based Filtering Cheat Sheet¶
Step |
Tool |
Code |
|---|---|---|
Text similarity |
TF-IDF + cosine |
|
Categorical features |
One-hot |
|
Numeric features |
Normalize |
|
User profile |
Weighted avg |
|
Recommend |
Nearest neighbor |
|
# Full CB pipeline in 5 lines
tfidf = TfidfVectorizer(max_features=500).fit_transform(df['description'])
genres = df['genre'].str.get_dummies('|')
features = np.hstack([tfidf.toarray(), genres.values])
profile = features[liked_ids].mean(axis=0)
scores = cosine_similarity(profile.reshape(1,-1), features).flatten()
Exercises¶
Weight tuning: Systematically vary the weights
(W_DESC, W_GENRE, W_DIR, W_NUM)using a grid search. Define a simple evaluation: for users with known liked movies, how often does the top-5 contain at least one movie of the same primary genre?Word embeddings: Replace TF-IDF with averaged word embeddings from
gensim(or use a pre-trainedsentence-transformersmodel on the descriptions). Compare the quality of retrieved similar movies.Decay over time: Add a temporal decay to the user profile — older liked movies should contribute less. Implement an exponential decay: weight ∝ exp(-λ·days_since_rating). Test with λ ∈ {0.001, 0.01, 0.1}.
Diversity: A top-5 of nearly identical movies is boring. Implement Maximal Marginal Relevance (MMR): iteratively select the next recommendation that maximizes
λ·relevance - (1-λ)·max_similarity_to_already_selected. Tune λ to balance relevance vs diversity.Hybrid weight learning: Treat
alphain the hybrid model as a learnable parameter. For users who have enough ratings, find thealphathat minimizes RMSE on a held-out set usingscipy.optimize.minimize_scalar.