Milvus – Large-Scale Vector DatabaseΒΆ

InstallationΒΆ

The pymilvus package is the official Python SDK for Milvus. It communicates with a Milvus server over gRPC. For local development, you can run Milvus Lite (in-process) or spin up a standalone instance via Docker. The SDK provides high-level collection management, data insertion, index building, and search APIs.

# !pip install pymilvus

from pymilvus import connections, utility, FieldSchema, CollectionSchema, DataType, Collection
import numpy as np

print('βœ… Imports successful')

1. Connect to MilvusΒΆ

connections.connect() establishes a gRPC connection to a running Milvus instance. The alias parameter lets you manage multiple connections. The default address is localhost:19530 for a standalone Docker deployment. For Milvus Lite (no Docker), you can use MilvusClient("./milvus_local.db") which runs entirely in-process with local file storage.

connections.connect(
    alias="default",
    host='localhost',
    port='19530'
)

print("βœ… Connected to Milvus")

2. Create CollectionΒΆ

Milvus requires an explicit schema with typed fields. Every collection must have a primary key field (INT64 or VARCHAR), at least one FLOAT_VECTOR field with a specified dimensionality, and optionally scalar fields for metadata. The auto_id=True option lets Milvus generate unique primary keys automatically. This schema-based approach ensures data integrity and enables efficient columnar storage under the hood.

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]

schema = CollectionSchema(fields=fields, description="Documents")
collection = Collection(name="documents", schema=schema)

print("βœ… Collection created")

3. Insert DataΒΆ

Data is inserted as a list of column arrays (one array per field in schema order). Milvus stores vectors in a columnar format optimized for batch operations. After insertion, data is initially in a β€œgrowing segment” and becomes searchable only after the segment is sealed or you explicitly call collection.flush(). Primary keys are returned so you can reference specific records for updates or deletes.

entities = [
    ["Machine learning", "Deep learning", "NLP"],
    [np.random.random(384).tolist() for _ in range(3)]
]

insert_result = collection.insert(entities)
print(f"βœ… Inserted {len(insert_result.primary_keys)} entities")