Run this notebook: Open in Colab Open in Kaggle

Paraphrase Mining With Sentence Transformers¶

What: Use sentence_transformers.util.paraphrase_mining() to automatically discover pairs of semantically similar sentences within a corpus.

Why: Paraphrase mining solves a common real-world problem: given a large collection of sentences, find all pairs that express the same meaning in different words. The paraphrase_mining() function encodes all sentences, computes pairwise cosine similarities, and returns the top scoring pairs sorted by similarity score. For \(n\) sentences, the brute-force approach requires \(O(n^2)\) comparisons, but the library uses optimized batch processing to handle tens of thousands of sentences efficiently.

How: Each sentence is encoded into a dense vector using all-MiniLM-L6-v2, then all pairwise cosine similarities are computed. Pairs above a threshold are returned as (score, index_i, index_j) tuples, sorted by descending similarity.

Connection: Paraphrase mining is used in production for duplicate question detection (Stack Overflow, customer support), plagiarism checking, data deduplication before model training, and clustering semantically equivalent user intents in chatbot systems.

# https://sbert.net/examples/sentence_transformer/applications/paraphrase-mining/README.html

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import paraphrase_mining

model = SentenceTransformer("all-MiniLM-L6-v2")

# Single list of sentences - Possible tens of thousands of sentences
sentences = [
    "The cat sits outside",
    "A man is playing guitar",
    "I love pasta",
    "The new movie is awesome",
    "The cat plays in the garden",
    "A woman watches TV",
    "The new movie is so great",
    "Do you like pizza?",
]

paraphrases = paraphrase_mining(model, sentences)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))