Definition
A vector database is a storage and retrieval system optimized for high-dimensional numerical vectors and nearest-neighbor search.
Its purpose is to store embeddings and return the most relevant records for a query vector using similarity search. It is a database for coordinates in high-dimensional space, where original text, metadata, and permissions are attached to those coordinates.
The vector index is the computational structure that makes similarity search feasible at production scale. Without indexing, nearest-neighbor search becomes computationally prohibitive.
Core Data Model
Each stored unit consists of:
{
id,
vector,
metadata,
payload
}
Components
- ID: A unique identifier for the chunk (e.g.,
doc_143_chunk_08). Used for updates, deletes, and traceability. - Vector: A fixed-length numerical array generated by an embedding model (e.g., OpenAI
text-embedding-3-smallat 1536 dims). - Metadata: Structured fields used for filtering (e.g.,
tenantId,documentType). Metadata is used for pre-filtering and access control, not embedding. - Payload: The original retrievable content (raw chunk text or markdown). This is what the LLM actually processes.
Exact Search vs. Approximate Search
Exact KNN (K-Nearest Neighbors)
The system compares the query vector against every stored vector (brute force).
- Complexity: O(N).
- Use Case: Accurate but expensive; practical only for small datasets or evaluation baselines.
ANN (Approximate Nearest Neighbor)
ANN sacrifices perfect recall for speed by searching likely candidate regions first.
- Performance: Standard production approach, typically offering 99% recall for 100x lower latency.
- Trade-off: Almost always acceptable for semantic retrieval.
Similarity Metrics
Nearest-neighbor search depends on a distance function matching the embedding model's assumptions.
- Cosine Similarity: Measures angular similarity. Most common for text embeddings where direction matters more than magnitude.
- Dot Product: Measures directional similarity without normalization. Common in Vertex AI and OpenAI systems.
- Euclidean Distance: Measures straight-line geometric distance. More common in recommendation systems and vision embeddings.
Index Structures
- HNSW (Hierarchical Navigable Small World): The dominant production index. Builds a multi-layer graph of vector proximity. Low latency and high recall, but high memory usage.
- IVF (Inverted File Index): Partitions vectors into clusters. Search checks only the most relevant clusters. Efficient for very large datasets with lower memory usage.
- PQ (Product Quantization): Compresses vectors by approximating them with codebooks. Major memory reduction at the cost of retrieval precision.
- Disk-Based Indexes: Keeps portions of the index on disk (e.g., DiskANN). Used when RAM cost becomes the primary constraint at billion-scale.
Implementation Ownership
The division of labor between the system and the developer is critical for security and performance.
| Component | Responsibility | Ownership |
|---|---|---|
| Vector Quantization | System | Compressing vectors for storage efficiency. |
| Index Construction | System | Maintaining HNSW or IVF graphs for sub-second retrieval. |
| Metadata Schema | Developer | Defining which fields are indexed for filtering (e.g., tenant isolation). |
| Similarity Metric | Developer | Selecting the metric (Cosine, Dot) that matches the embedding model. |
| Update Strategy | Developer | Managing document versions, re-embedding, and vector invalidation. |
Managed vs. Custom Indexing
- Managed (e.g., Pinecone, Vertex AI Vector Search): System handles scaling, high-availability, and indexing algorithms automatically.
- Self-Hosted (e.g., Qdrant, Milvus, pgvector): Developer handles infrastructure tuning, memory allocation, and index parameter optimization (e.g.,
ef_constructionin HNSW).
Filtering and Multi-Tenancy
Vector similarity alone is insufficient for enterprise systems. Metadata filtering is mandatory.
- Enforcement: Filtering must be applied before or during the vector search to prevent data leakage across tenants.
- Namespace Strategy: Large systems use namespaces (e.g.,
prod,customer_a) to isolate indexes for security and operational control.
Examples and Tools
Node.js / TypeScript
- Libraries:
langchain,llamaindex,zod. - Databases:
pgvector(PostgreSQL),qdrant-js,pinecone-sdk.
Google Stack
- Vertex AI Vector Search: High-scale ANN infrastructure.
- AlloyDB / Cloud SQL: Integrated
pgvectorsupport for relational + vector joins.
Local Stack
- Qdrant: Efficient HNSW implementation for Docker/local.
- Milvus: Distributed vector database for massive workloads.
- PostgreSQL + pgvector: Best for moderate scale and operational simplicity.