Run this notebook: Open in Colab Open in Kaggle

Semantic Search With Sentence Transformers¶

What: Build a semantic search system that finds the most relevant documents in a corpus for a given natural-language query, using all-MiniLM-L6-v2 embeddings and cosine similarity ranking.

Why: Semantic search goes beyond keyword matching by understanding the meaning of queries and documents. A query like “How do artificial neural networks work?” will match documents about “neural networks” and “deep learning” even without exact word overlap. The approach uses asymmetric search – queries and documents may differ in length and style, but the shared embedding space places semantically related content nearby.

How: The corpus is encoded once using encode_document() and stored as tensors (optionally on GPU). At query time, encode_query() produces a query embedding, and model.similarity() computes cosine similarity against all corpus embeddings. torch.topk() efficiently extracts the top-\(k\) results without sorting the entire list.

Connection: This is the core retrieval mechanism in RAG (Retrieval Augmented Generation) pipelines. In production, the corpus embeddings would be stored in a vector database (FAISS, Pinecone, Qdrant) for sub-millisecond retrieval over millions of documents.

# https://sbert.net/examples/sentence_transformer/applications/semantic-search/README.html

"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""

import torch

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Corpus with example documents
corpus = [
    "Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
    "Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
    "Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
    "Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
    "The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
    "SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
    "Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
    "Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
    "Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = embedder.encode_document(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
    "How do artificial neural networks work?",
    "What technology is used for modern space exploration?",
    "How can we address climate change challenges?",
]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode_query(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(f"(Score: {score:.4f})", corpus[idx])

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """