Content-Based Filtering: Recommending by Item Similarity

Content-based recommenders don’t need user history — they recommend items similar to what a user has liked before, based on item features. This notebook covers TF-IDF similarity, feature engineering for items, and hybrid approaches.

1. Setup — Synthetic Movie Dataset

200 movies with: title, genre(s), description, director, cast, year, average rating.

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

N_MOVIES = 200

GENRES = ['Action', 'Comedy', 'Drama', 'Sci-Fi', 'Thriller',
          'Romance', 'Horror', 'Animation', 'Documentary', 'Fantasy']
DIRECTORS = [f'Director_{i}' for i in range(20)]
ACTORS    = [f'Actor_{i}'    for i in range(50)]

desc_templates = [
    "A gripping tale of {adj} heroes who must {verb} against overwhelming {noun}.",
    "In a {adj} world, two {noun}s find themselves on a {adj} journey of {verb}ing.",
    "An {adj} story about {noun} and the {adj} consequences of {verb}ing.",
    "When a {adj} {noun} threatens the city, an unlikely hero must {verb} to save the day.",
    "A {adj} drama exploring the {noun} of human connection and the art of {verb}ing.",
]
adjs  = ['brave','dark','thrilling','mysterious','heartwarming','epic','quiet','haunting','vibrant','absurd']
verbs = ['fight','escape','discover','survive','transform','confront','explore','rebuild','question','celebrate']
nouns = ['villain','society','memory','identity','time','love','chaos','power','truth','nature']

def make_description(rng):
    tmpl = rng.choice(desc_templates)
    return tmpl.format(
        adj=rng.choice(adjs),
        verb=rng.choice(verbs),
        noun=rng.choice(nouns)
    )

rng = np.random.default_rng(42)
records = []
for i in range(N_MOVIES):
    n_genres = rng.integers(1, 4)
    n_cast   = rng.integers(2, 6)
    records.append({
        'movie_id'   : f'movie_{i:03d}',
        'title'      : f'Movie Title {i}',
        'genres'     : '|'.join(rng.choice(GENRES, size=int(n_genres), replace=False).tolist()),
        'description': make_description(rng),
        'director'   : rng.choice(DIRECTORS),
        'cast'       : '|'.join(rng.choice(ACTORS, size=int(n_cast), replace=False).tolist()),
        'year'       : int(rng.integers(1980, 2024)),
        'avg_rating' : round(float(rng.uniform(2.5, 5.0)), 2),
        'n_votes'    : int(rng.integers(100, 50000)),
    })

movies = pd.DataFrame(records)
print(f'Dataset: {movies.shape}')
movies.head(3)

2. Feature Engineering for Items

Three types of features:

  • Text features — TF-IDF on description

  • Categorical features — one-hot encoded genres, director

  • Numeric features — year (normalized), avg_rating, log(n_votes)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import scipy.sparse as sp

# --- Text: TF-IDF on description ---
tfidf = TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(1, 2))
desc_matrix = tfidf.fit_transform(movies['description'])  # sparse (200, 500)

# --- Categorical: one-hot genres ---
genre_dummies = movies['genres'].str.get_dummies(sep='|')  # (200, n_genres)

# --- Categorical: director one-hot ---
director_dummies = pd.get_dummies(movies['director'], prefix='dir')  # (200, 20)

# --- Numeric: year, avg_rating, log(n_votes) ---
numeric_feats = movies[['year', 'avg_rating', 'n_votes']].copy()
numeric_feats['log_votes'] = np.log1p(numeric_feats['n_votes'])
numeric_feats = numeric_feats[['year', 'avg_rating', 'log_votes']]
scaler = MinMaxScaler()
numeric_scaled = scaler.fit_transform(numeric_feats)  # (200, 3)

print(f'TF-IDF features  : {desc_matrix.shape}')
print(f'Genre features   : {genre_dummies.shape}')
print(f'Director features: {director_dummies.shape}')
print(f'Numeric features : {numeric_scaled.shape}')

3. TF-IDF Cosine Similarity — Find Similar Movies by Description

TF-IDF (Term Frequency–Inverse Document Frequency) weights words that are frequent in a document but rare across the corpus — capturing the document’s distinctive vocabulary.

\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}\]
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

# Cosine similarity on TF-IDF (linear_kernel is faster for sparse TF-IDF)
desc_sim = linear_kernel(desc_matrix, desc_matrix)  # (200, 200)

def get_similar_movies_by_description(movie_idx, sim_matrix, top_k=5):
    sims = sim_matrix[movie_idx].copy()
    sims[movie_idx] = -1  # exclude self
    top_idx = np.argsort(sims)[::-1][:top_k]
    result = movies.iloc[top_idx][['movie_id', 'title', 'genres', 'description']].copy()
    result['similarity'] = sims[top_idx]
    return result

print('Movie 0:', movies.iloc[0]['description'])
print('\nMost similar by description:')
print(get_similar_movies_by_description(0, desc_sim)[['movie_id', 'similarity', 'description']].to_string())

4. Combined Feature Similarity — Weighted Combination

Combine text, categorical, and numeric features with tunable weights:

\[\text{sim}_{combined}(a, b) = w_1 \cdot \text{sim}_{text} + w_2 \cdot \text{sim}_{genre} + w_3 \cdot \text{sim}_{numeric}\]
# Compute individual similarity matrices
genre_sim    = cosine_similarity(genre_dummies.values)      # (200, 200)
director_sim = cosine_similarity(director_dummies.values)  # (200, 200)
numeric_sim  = cosine_similarity(numeric_scaled)            # (200, 200)

# Weights (must sum to 1)
W_DESC  = 0.40
W_GENRE = 0.30
W_DIR   = 0.15
W_NUM   = 0.15

combined_sim = (W_DESC  * desc_sim +
                W_GENRE * genre_sim +
                W_DIR   * director_sim +
                W_NUM   * numeric_sim)

np.fill_diagonal(combined_sim, 0)

def get_similar_movies_combined(movie_idx, top_k=5):
    sims = combined_sim[movie_idx].copy()
    top_idx = np.argsort(sims)[::-1][:top_k]
    result = movies.iloc[top_idx][['movie_id', 'genres', 'director']].copy()
    result['combined_sim'] = sims[top_idx]
    return result

print(f'Movie 0: genres={movies.iloc[0]["genres"]}  director={movies.iloc[0]["director"]}')
print('\nTop-5 similar (combined):')
print(get_similar_movies_combined(0).to_string())

5. User Profile Building

Idea: Represent a user as the weighted average of the feature vectors of items they’ve liked. Then find items closest to this profile vector.

# Build a combined dense feature matrix for all movies
# Stack: TF-IDF (dense) + genre + director + numeric
feature_matrix = np.hstack([
    desc_matrix.toarray() * W_DESC,
    genre_dummies.values  * W_GENRE,
    director_dummies.values * W_DIR,
    numeric_scaled        * W_NUM,
])  # (200, 500+10+20+3)

print(f'Combined feature matrix: {feature_matrix.shape}')

# Simulate a user who liked movies 5, 12, 34, 78
liked_movies = [5, 12, 34, 78]
liked_ratings = np.array([5, 4, 5, 3])

# User profile = weighted average of liked movie features
weights = liked_ratings / liked_ratings.sum()
user_profile = (feature_matrix[liked_movies] * weights[:, None]).sum(axis=0)  # (n_features,)

# Find most similar movies to the user profile
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
scores = cos_sim(user_profile.reshape(1, -1), feature_matrix).flatten()
scores[liked_movies] = -1  # exclude already-liked
top_recs = np.argsort(scores)[::-1][:5]

print('\nUser liked:')
print(movies.iloc[liked_movies][['movie_id', 'genres', 'director']].to_string())
print('\nContent-based recommendations:')
rec_df = movies.iloc[top_recs][['movie_id', 'genres', 'director']].copy()
rec_df['score'] = scores[top_recs]
print(rec_df.to_string())

6. Hybrid: Content + Collaborative

Neither method is best alone. Hybrid recommenders combine them:

  • Weighted hybrid: \(\text{score} = \alpha \cdot \text{CF\_score} + (1-\alpha) \cdot \text{CB\_score}\)

  • Switching hybrid: use CB for new users/items, CF when enough ratings

  • Feature augmentation: use CB predictions as features in CF model

# Simulate CF scores (e.g., ALS predicted ratings normalized to [0,1])
rng2 = np.random.default_rng(99)
cf_scores_user0 = rng2.uniform(0, 1, N_MOVIES)

# Content-based scores for user 0 (from profile above)
cb_scores_user0 = scores.copy()
cb_scores_user0 = np.clip(cb_scores_user0, 0, None)
# Normalize to [0,1]
cb_norm = (cb_scores_user0 - cb_scores_user0.min()) / (cb_scores_user0.max() - cb_scores_user0.min() + 1e-9)
cf_norm = (cf_scores_user0 - cf_scores_user0.min()) / (cf_scores_user0.max() - cf_scores_user0.min() + 1e-9)

alpha = 0.6  # weight for CF
hybrid_scores = alpha * cf_norm + (1 - alpha) * cb_norm
hybrid_scores[liked_movies] = -1

hybrid_top5 = np.argsort(hybrid_scores)[::-1][:5]
cb_top5     = np.argsort(cb_norm)[::-1][:5]
cf_top5     = np.argsort(cf_norm)[::-1][:5]

print(f'CF-only top-5    : {[movies.iloc[i]["movie_id"] for i in cf_top5]}')
print(f'CB-only top-5    : {[movies.iloc[i]["movie_id"] for i in cb_top5]}')
print(f'Hybrid top-5     : {[movies.iloc[i]["movie_id"] for i in hybrid_top5]}')

overlap_cf_hybrid = len(set(cf_top5) & set(hybrid_top5))
overlap_cb_hybrid = len(set(cb_top5) & set(hybrid_top5))
print(f'\nOverlap CF∩Hybrid: {overlap_cf_hybrid}/5')
print(f'Overlap CB∩Hybrid: {overlap_cb_hybrid}/5')

7. Cold Start Advantage — Why Content-Based Beats CF for New Items

When a new movie is added with zero ratings, collaborative filtering cannot recommend it at all (it has no latent factors). Content-based filtering works immediately from metadata.

# Simulate a brand-new item (no ratings yet)
new_item = {
    'movie_id'   : 'movie_new',
    'title'      : 'The New Release',
    'genres'     : 'Sci-Fi|Thriller',
    'description': 'A thrilling sci-fi story about brave heroes who must fight against dark villains.',
    'director'   : 'Director_3',
    'cast'       : 'Actor_1|Actor_7',
    'year'       : 2024,
    'avg_rating' : 0.0,
    'n_votes'    : 0,
}

# Extract features for the new item using the SAME fitted transformers
new_desc_vec  = tfidf.transform([new_item['description']]).toarray() * W_DESC
new_genre_vec = np.zeros((1, genre_dummies.shape[1]))
for g in new_item['genres'].split('|'):
    if g in genre_dummies.columns:
        new_genre_vec[0, list(genre_dummies.columns).index(g)] = 1
new_genre_vec *= W_GENRE

new_dir_vec = np.zeros((1, director_dummies.shape[1]))
dir_col = f"dir_{new_item['director']}"
if dir_col in director_dummies.columns:
    new_dir_vec[0, list(director_dummies.columns).index(dir_col)] = 1
new_dir_vec *= W_DIR

new_num = scaler.transform([[new_item['year'], 3.0, np.log1p(100)]]) * W_NUM

new_feature_vec = np.hstack([new_desc_vec, new_genre_vec, new_dir_vec, new_num])  # (1, n_feats)
new_scores = cos_sim(new_feature_vec, feature_matrix).flatten()
new_top5   = np.argsort(new_scores)[::-1][:5]

print('New item genres:', new_item['genres'], '  director:', new_item['director'])
print('\nContent-based similar items (works with ZERO ratings):')
res = movies.iloc[new_top5][['movie_id', 'genres', 'director']].copy()
res['sim'] = new_scores[new_top5]
print(res.to_string())
print('\nCF would return nothing (no ratings → no latent factors). CB works immediately.')

8. Cheat Sheet + Exercises

Content-Based Filtering Cheat Sheet

Step

Tool

Code

Text similarity

TF-IDF + cosine

TfidfVectorizer + linear_kernel

Categorical features

One-hot

pd.get_dummies or `str.get_dummies(sep=’

Numeric features

Normalize

MinMaxScaler

User profile

Weighted avg

(features[liked] * weights[:, None]).sum(0)

Recommend

Nearest neighbor

cosine_similarity(profile, features)

# Full CB pipeline in 5 lines
tfidf = TfidfVectorizer(max_features=500).fit_transform(df['description'])
genres = df['genre'].str.get_dummies('|')
features = np.hstack([tfidf.toarray(), genres.values])
profile = features[liked_ids].mean(axis=0)
scores  = cosine_similarity(profile.reshape(1,-1), features).flatten()

Exercises

  1. Weight tuning: Systematically vary the weights (W_DESC, W_GENRE, W_DIR, W_NUM) using a grid search. Define a simple evaluation: for users with known liked movies, how often does the top-5 contain at least one movie of the same primary genre?

  2. Word embeddings: Replace TF-IDF with averaged word embeddings from gensim (or use a pre-trained sentence-transformers model on the descriptions). Compare the quality of retrieved similar movies.

  3. Decay over time: Add a temporal decay to the user profile — older liked movies should contribute less. Implement an exponential decay: weight ∝ exp(-λ·days_since_rating). Test with λ ∈ {0.001, 0.01, 0.1}.

  4. Diversity: A top-5 of nearly identical movies is boring. Implement Maximal Marginal Relevance (MMR): iteratively select the next recommendation that maximizes λ·relevance - (1-λ)·max_similarity_to_already_selected. Tune λ to balance relevance vs diversity.

  5. Hybrid weight learning: Treat alpha in the hybrid model as a learnable parameter. For users who have enough ratings, find the alpha that minimizes RMSE on a held-out set using scipy.optimize.minimize_scalar.