Build Your First RAG System From Scratch

The single biggest problem with the AI everyone is so excited about is that it’s confidently lying to you. You ask a powerful LLM a simple question about your data: a PDF, a company doc, or even just this morning’s news, and it either invents a plausible-sounding, completely wrong answer or just gives up. We call this hallucination, and it’s the single biggest problem holding AI back from being truly useful. The fix is surprisingly simple. We don’t need a bigger model; we need to give our model an open-book exam. This is the core idea behind a technique called Retrieval-Augmented Generation (RAG). Today, I’ll show you exactly how to build your first RAG system from scratch using Python.

Let’s Build Your First RAG System From Scratch

Forget expensive APIs and proprietary databases. We’re doing this with the tools that real engineers use to build powerful, scalable systems. Here are the tools we will be using:

transformers (Hugging Face): To get our powerful, free LLM.
sentence-transformers: The easiest way to get a top-tier embedding model.
faiss-cpu: Facebook AI’s blazing-fast, free vector search library. It’s our vector store.
langchain: We’ll only use its text splitter, which is a smart shortcut that saves us hours of regex pain.

Open your Google Colab and let’s get set up:

!pip install transformers sentence-transformers faiss-cpu langchain

Step 1: Our Data

First, we need some custom knowledge. Let’s create a simple text file named my_knowledge.txt. Put this inside and upload it on your Colab notebook:

Company Policy Manual:
- WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays and Fridays are optional remote days.
- PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per year. PTO accrues monthly.
- Tech Stack: The official backend language is Python, and the official frontend framework is React. For mobile development, we use React Native.

This is our book. The LLM has no idea this information exists.

Step 2: Chunking

We can’t feed the whole book to the model at once. We need to split it into index cards (chunks). Don’t just split by \n (newlines). You’ll cut sentences in half. We’ll use a smart splitter:

import os
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load our document
with open("my_knowledge.txt") as f:
    knowledge_text = f.read()

# 1. Initialize the Text Splitter
# This splitter is smart. It tries to split on paragraphs ("\n\n"),
# then newlines ("\n"), then spaces (" "), to keep semantically
# related text together as much as possible.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,  # Max size of a chunk
    chunk_overlap=20, # Overlap to maintain context between chunks
    length_function=len
)

# 2. Create the chunks
chunks = text_splitter.split_text(knowledge_text)

print(f"We have {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")

We have 4 chunks:
--- Chunk 1 ---
Company Policy Manual: - WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays,

--- Chunk 2 ---
Wednesdays, and Thursdays. Mondays and Fridays are optional remote days. - PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per

--- Chunk 3 ---
Time Off (PTO) per year. PTO accrues monthly. - Tech Stack: The official backend language is Python, and the official frontend framework is React.

--- Chunk 4 ---
framework is React. For mobile development, we use React Native.}

You’ll see it intelligently broke our file into small, overlapping pieces.

Step 3: Embeddings

Now we turn those text chunks into numbers (vectors). We’ll use a popular, lightweight sentence-transformer model. It’s brilliant at understanding the meaning of a sentence:

from sentence_transformers import SentenceTransformer

# 1. Load the embedding model
# 'all-MiniLM-L6-v2' is a fantastic, fast, and small model.
# It runs 100% on your local machine.
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Embed all our chunks
# This will take a moment as it "reads" and "understands" each chunk.
chunk_embeddings = model.encode(chunks)

print(f"Shape of our embeddings: {chunk_embeddings.shape}")

Shape of our embeddings: (4, 384)

Step 4: Vector Store with FAISS

We have our vectors. Now we need a database to store them in a way we can search by similarity. It is where FAISS comes in. Don’t be intimidated; it’s just a few lines of code:

import faiss
import numpy as np

# Get the dimension of our vectors (e.g., 384)
d = chunk_embeddings.shape[1]

# 1. Create a FAISS index
# IndexFlatL2 is the simplest, most basic index. It calculates
# the exact distance (L2 distance) between our query and all vectors.
index = faiss.IndexFlatL2(d)

# 2. Add our chunk embeddings to the index
# We must convert to float32 for FAISS
index.add(np.array(chunk_embeddings).astype('float32'))

print(f"FAISS index created with {index.ntotal} vectors.")

That’s it. You just created an in-memory vector database.

FAISS index created with 4 vectors.

Step 5: Retrieve, Augment, Generate

This is the final part. Here the user will ask a question. Let’s trace the full pipeline:

from transformers import pipeline

# 1. Load a "Question-Answering" or "Text-Generation" model
# We'll use a small, instruction-tuned model from Google.
generator = pipeline('text2text-generation', model='google/flan-t5-small')

# --- This is our RAG pipeline function ---
def answer_question(query):
    # 1. RETRIEVE
    # Embed the user's query
    query_embedding = model.encode([query]).astype('float32')

    # Search the FAISS index for the top k (e.g., k=2) most similar chunks
    k = 2
    distances, indices = index.search(query_embedding, k)

    # Get the actual text chunks from our original 'chunks' list
    retrieved_chunks = [chunks[i] for i in indices[0]]
    context = "\n\n".join(retrieved_chunks)

    # 2. AUGMENT
    # This is the "magic prompt." We combine the retrieved context
    # with the user's query.
    prompt_template = f"""
    Answer the following question using *only* the provided context.
    If the answer is not in the context, say "I don't have that information."

    Context:
    {context}

    Question:
    {query}

    Answer:
    """

    # 3. GENERATE
    # Feed the augmented prompt to our generative model
    answer = generator(prompt_template, max_length=100)
    print(f"--- CONTEXT ---\n{context}\n")
    return answer[0]['generated_text']

Now, let’s ask our system some questions:

query_1 = "What is the WFH policy?"
print(f"Query: {query_1}")
print(f"Answer: {answer_question(query_1)}\n")

Query: What is the WFH policy?

--- CONTEXT ---
Company Policy Manual: - WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays,

Wednesdays, and Thursdays. Mondays and Fridays are optional remote days. - PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per

Answer: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, Wednesdays, and Thursdays. Mondays and Fridays are optional remote days. - PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO)

It worked! It didn’t just guess, it found the exact text and synthesized the answer.

Now, let’s ask a question the context cannot answer:

query_2 = "What is the company's dental plan?"
print(f"Query: {query_2}")
print(f"Answer: {answer_question(query_2)}\n")

Query: What is the company's dental plan?

--- CONTEXT ---
Company Policy Manual: - WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays,

Wednesdays, and Thursdays. Mondays and Fridays are optional remote days. - PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per

Answer: I don't have that information.

It is critical. Because of our prompt (“only use the provided context”), the LLM didn’t hallucinate. It correctly stated it couldn’t find the answer.

Final Words

Take a step back. What you just built in a few dozen lines of Python is the foundation of the next generation of AI. You solved the three biggest problems with LLMs:

Hallucinations: You grounded the model in reality.
Stale Knowledge: You can update the knowledge! Just re-run the indexing (Steps 1-4) on new documents.
Data Privacy: No data ever left your computer. The embedding model and the LLM all ran locally.

This blueprint is how you chat with your codebase, summarize your legal documents, or ask questions about your 1,000 unread emails.

I hope you liked this article on how to build your first RAG system from scratch using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.