Run this notebook: Open in Colab Open in Kaggle

Chroma – Local Vector Database¶

A Developer-Friendly Embedded Vector Store¶

Chroma is an open-source embedding database designed for rapid prototyping and local development. It runs in-process (no separate server needed), stores data to disk with SQLite, and provides built-in document embedding via Sentence Transformers so you can pass raw text strings directly without managing an external encoder. Chroma supports cosine, L2, and inner-product distance metrics and uses an HNSW index for fast approximate nearest-neighbor search. It is the ideal starting point for learning vector database concepts before moving to distributed, production-grade systems like Qdrant or Milvus.

Installation¶

Chroma is installed as a single Python package (chromadb) with no external dependencies on running services. The sentence-transformers library is optional but recommended – Chroma can use it as a default embedding function so you can pass plain text strings to add() and query() without pre-computing embeddings yourself.

# !pip install chromadb sentence-transformers

import chromadb
from chromadb.config import Settings

print('✅ Imports successful')

1. Initialize Chroma¶

The chromadb.Client constructor creates an in-process database instance. By setting a persist_directory, Chroma writes data to disk so that collections survive between sessions. Without a persist directory, Chroma operates in ephemeral (in-memory) mode – useful for quick experiments but data is lost when the process ends. The anonymized_telemetry=False setting disables optional usage analytics.

client = chromadb.Client(Settings(
    persist_directory="./chroma_db",
    anonymized_telemetry=False
))

print(f"Collections: {len(client.list_collections())}")

2. Create Collection¶

A collection in Chroma is analogous to a table in a relational database – it groups related vectors under a single namespace. The metadata={"hnsw:space": "cosine"} parameter tells the underlying HNSW index to use cosine distance for similarity comparisons. Other options include "l2" (Euclidean) and "ip" (inner product). Using get_or_create_collection is idempotent, so re-running this cell will not overwrite existing data.

collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

print("✅ Collection created")

3. Add Documents¶

When you call collection.add() with a list of documents, Chroma automatically embeds each text string using its default embedding function (Sentence Transformers). You must provide unique ids for each document so you can update or delete them later. Optional metadatas attach structured key-value pairs to each document, enabling metadata-based filtering at query time (e.g., “only search documents where source equals ‘doc1’”).

documents = [
    "Machine learning is AI",
    "Deep learning uses neural networks",
    "NLP processes text"
]

ids = [f"doc{i}" for i in range(len(documents))]
metadatas = [{"source": f"doc{i}"} for i in range(len(documents))]

collection.add(
    documents=documents,
    ids=ids,
    metadatas=metadatas
)

print(f"✅ Added {len(documents)} documents")

4. Query¶

Querying in Chroma mirrors the add workflow: you pass a query_texts list and Chroma embeds the query, runs approximate nearest-neighbor search against the collection’s HNSW index, and returns the top n_results matches along with their distances and metadata. The returned distances are in the space specified at collection creation (here cosine distance, where 0 means identical and 2 means maximally dissimilar). You can also supply where filters to restrict results by metadata fields, combining semantic similarity with structured constraints.

results = collection.query(
    query_texts=["What is deep learning?"],
    n_results=3
)

for i, (doc, distance) in enumerate(zip(results["documents"][0], results["distances"][0])):
    print(f"{i+1}. Distance: {distance:.4f}\n   {doc}\n")