Vector Databases & Indexing – Miguel Armengol

Definition

A vector database is a storage and retrieval system optimized for high-dimensional numerical vectors and nearest-neighbor search.

Its purpose is to store embeddings and return the most relevant records for a query vector using similarity search. It is a database for coordinates in high-dimensional space, where original text, metadata, and permissions are attached to those coordinates.

The vector index is the computational structure that makes similarity search feasible at production scale. Without indexing, nearest-neighbor search becomes computationally prohibitive.

Core Data Model

Each stored unit consists of:

{
  id,
  vector,
  metadata,
  payload
}

Components

ID: A unique identifier for the chunk (e.g., doc_143_chunk_08). Used for updates, deletes, and traceability.
Vector: A fixed-length numerical array generated by an embedding model (e.g., OpenAI text-embedding-3-small at 1536 dims).
Metadata: Structured fields used for filtering (e.g., tenantId, documentType). Metadata is used for pre-filtering and access control, not embedding.
Payload: The original retrievable content (raw chunk text or markdown). This is what the LLM actually processes.

Exact Search vs. Approximate Search

Exact KNN (K-Nearest Neighbors)

The system compares the query vector against every stored vector (brute force).

Complexity: O(N).
Use Case: Accurate but expensive; practical only for small datasets or evaluation baselines.

ANN (Approximate Nearest Neighbor)

ANN sacrifices perfect recall for speed by searching likely candidate regions first.

Performance: Standard production approach, typically offering 99% recall for 100x lower latency.
Trade-off: Almost always acceptable for semantic retrieval.

Similarity Metrics

Nearest-neighbor search depends on a distance function matching the embedding model's assumptions.

Cosine Similarity: Measures angular similarity. Most common for text embeddings where direction matters more than magnitude.
Dot Product: Measures directional similarity without normalization. Common in Vertex AI and OpenAI systems.
Euclidean Distance: Measures straight-line geometric distance. More common in recommendation systems and vision embeddings.

Index Structures

HNSW (Hierarchical Navigable Small World): The dominant production index. Builds a multi-layer graph of vector proximity. Low latency and high recall, but high memory usage.
IVF (Inverted File Index): Partitions vectors into clusters. Search checks only the most relevant clusters. Efficient for very large datasets with lower memory usage.
PQ (Product Quantization): Compresses vectors by approximating them with codebooks. Major memory reduction at the cost of retrieval precision.
Disk-Based Indexes: Keeps portions of the index on disk (e.g., DiskANN). Used when RAM cost becomes the primary constraint at billion-scale.

Implementation Ownership

The division of labor between the system and the developer is critical for security and performance.

Component	Responsibility	Ownership
Vector Quantization	System	Compressing vectors for storage efficiency.
Index Construction	System	Maintaining HNSW or IVF graphs for sub-second retrieval.
Metadata Schema	Developer	Defining which fields are indexed for filtering (e.g., tenant isolation).
Similarity Metric	Developer	Selecting the metric (Cosine, Dot) that matches the embedding model.
Update Strategy	Developer	Managing document versions, re-embedding, and vector invalidation.

Managed vs. Custom Indexing

Managed (e.g., Pinecone, Vertex AI Vector Search): System handles scaling, high-availability, and indexing algorithms automatically.
Self-Hosted (e.g., Qdrant, Milvus, pgvector): Developer handles infrastructure tuning, memory allocation, and index parameter optimization (e.g., ef_construction in HNSW).

Filtering and Multi-Tenancy

Vector similarity alone is insufficient for enterprise systems. Metadata filtering is mandatory.

Enforcement: Filtering must be applied before or during the vector search to prevent data leakage across tenants.
Namespace Strategy: Large systems use namespaces (e.g., prod, customer_a) to isolate indexes for security and operational control.

Examples and Tools

Node.js / TypeScript

Libraries: langchain, llamaindex, zod.
Databases: pgvector (PostgreSQL), qdrant-js, pinecone-sdk.

Google Stack

Vertex AI Vector Search: High-scale ANN infrastructure.
AlloyDB / Cloud SQL: Integrated pgvector support for relational + vector joins.

Local Stack

Qdrant: Efficient HNSW implementation for Docker/local.
Milvus: Distributed vector database for massive workloads.
PostgreSQL + pgvector: Best for moderate scale and operational simplicity.