Vector Databases Explained from Scratch

With Generative AI tools like GPT-4 and Claude, vector databases act as long-term memory. They help computers understand intent, not just match words. In this article, I’ll explain vector databases from scratch and build a simple semantic search engine in Python for free.

What is a Vector Database?

Before we talk about databases, we have to talk about vectors. In Data Science and Machine Learning, a vector is simply a list of numbers. These numbers aren’t random. They represent the meaning of a piece of text, image, or audio file in a multi-dimensional space.

A vector database is designed to store these lists of numbers, called vectors, and to quickly find the ones that are most similar to each other.

This makes semantic search possible, so you can find results based on meaning instead of just matching exact words.

Vector Databases Explained

Now, let’s start coding. We’ll build a system that takes a user’s question and finds the most relevant document, even if they don’t use the same words.

To make sure everything stays free, here’s the tech stack we’ll use:

  1. Python: This is the main language used in AI.
  2. Sentence-Transformers (from Hugging Face): This tool turns text into vectors.
  3. ChromaDB: A free, open-source vector database built for AI that runs on your own computer.

Step 1: The Setup

First, install the required libraries. Open your terminal, Jupyter Notebook, or Google Colab:

pip install sentence-transformers chromadb

Step 2: Generating Embeddings

We need a way to turn our text into numbers. We’ll use a lightweight and powerful model called all-MiniLM-L6-v2 from Hugging Face:

from sentence_transformers import SentenceTransformer

# Load the model
# This downloads a small pre-trained model to your local machine.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Let's test it out
sentence = "The future of AI is bright."
vector = model.encode(sentence)

print(f"Dimension of the vector: {len(vector)}")
print(f"First 5 numbers: {vector[:5]}")
Dimension of the vector: 384
First 5 numbers: [-0.03986629 -0.03020746 0.03228509 -0.00241172 0.01722387]

Here, SentenceTransformer sets up our embedding model. The model.encode function translates English text into a list of 384 floating-point numbers. These numbers capture the main meaning of the sentence.

Step 3: Setting up the Vector Database

Next, we need somewhere to store these vectors and search them quickly:

import chromadb

# Initialize a local client
client = chromadb.Client()

# Create a collection (think of this like a table in SQL)
collection = client.create_collection(name="my_knowledge_base")

The chromadb.Client() command starts an in-memory database. This means the data disappears when you close the script, which is great for learning. The create_collection function makes a container to hold our data.

Step 4: Adding Data

Let’s add some data to our database. Imagine we’re building a documentation helper for a tech company:

documents = [
    "Machine learning is a subset of AI that focuses on data.",
    "A neural network mimics the human brain to learn patterns.",
    "Python is a popular programming language for data science.",
    "Docker helps developers containerize applications.",
    "Vegetables are good for your health and provide vitamins."
]

ids = ["doc1", "doc2", "doc3", "doc4", "doc5"]

# Embed the documents
embeddings = model.encode(documents)

# Add them to ChromaDB
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=ids
)

print("Documents added to the Vector Database!")

We made a list of raw text strings. Doc5 isn’t related to tech, so it acts as a control to test if our AI can ignore it. We turned all the documents into vectors with model.encode.

Finally, we saved the text, vectors, and an ID into ChromaDB.

Step 5: The Semantic Search

Now comes the real test. We’ll query the database using a question that doesn’t include the words “machine learning” or “data”:

query = "How do computers learn from experience?"

# 1. Convert the query into a vector (embedding)
query_vector = model.encode(query)

# 2. Search the database for the closest vectors
results = collection.query(
    query_embeddings=[query_vector],
    n_results=2 # Return the top 2 matches
)

# 3. Print the results
for i in range(len(results['documents'][0])):
    print(f"Result {i+1}: {results['documents'][0][i]}")
    print(f"Distance: {results['distances'][0][i]}")
    print("---")

The database can’t understand plain English. It needs the question turned into a vector first.

When you use collection.query, Chroma compares the query vector to every document vector in the database using a math formula, usually Cosine Similarity or Euclidean Distance.

When you run this, you should see something like:

Result 1: A neural network mimics the human brain to learn patterns.
Distance: 1.1007020473480225
---
Result 2: Machine learning is a subset of AI that focuses on data.
Distance: 1.2410591840744019
---

The question “How do computers learn from experience?” doesn’t use the words “Machine” or “subset.” Still, the system recognized that learning from experience is closest in meaning to the definition of Machine Learning.

Closing Thoughts

For decades, computing was rigid. You had to know the exact password, syntax, and keyword. It demanded that humans speak the language of machines.

Vector databases represent a shift where machines are learning to speak the language of humans. They allow us to search for vibes, concepts, and ideas.

If you found this article useful, you can follow me on Instagram for daily AI tips and practical resources. You might also like my latest book, Hands-On GenAI, LLMs & AI Agents. It’s a step-by-step guide to help you get ready for jobs in today’s AI field.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2056

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading