Run this notebook: Open in Colab Open in Kaggle

Sparse Encoders: SPLADE and Learned Sparse Representations¶

What: Load a SPLADE sparse encoder model and generate sparse embeddings where most dimensions are zero, then compute similarity and measure sparsity statistics.

Why: Dense embeddings (like those from Sentence Transformers) represent text as compact vectors where every dimension is non-zero. Sparse embeddings take the opposite approach: they produce vectors with the same dimensionality as the vocabulary (e.g., 30,522 for BERT’s vocabulary) but with 99%+ of values being zero. Each non-zero dimension corresponds to a vocabulary term that the model considers relevant to the input, weighted by learned importance. This combines the semantic understanding of neural models with the interpretability and efficiency of traditional keyword-based methods like BM25.

How: SPLADE (Sparse Lexical and Expansion) applies a log-saturated activation to the MLM logits: \(w_j = \log(1 + \text{ReLU}(z_j))\) where \(z_j\) is the logit for vocabulary term \(j\). This produces a sparse vector where only relevant terms have non-zero weights. The model also performs term expansion – it can assign non-zero weight to related terms not in the original text (e.g., “automobile” for an input containing “car”).

Connection: Sparse encoders are increasingly used alongside dense embeddings in hybrid search systems. The sparse component provides keyword-level precision (exact term matching) while the dense component provides semantic recall (meaning-based matching). Systems like ColBERT, SPLADE, and Pinecone’s hybrid search combine both for state-of-the-art retrieval.

# https://sbert.net/docs/quickstart.html#sparse-encoder


from sentence_transformers import SparseEncoder

# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions

# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}")  # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")