Building a Multi-Document RAG System

RAG connects the powerful reasoning of an LLM with the unique information in your own documents. Today, I’ll teach you how to build a Multi-Document RAG System using Python. By the end, you’ll have an app that reads a folder of documents and answers your questions accurately.

Multi-Document RAG System: Getting Started

We are going to build a Multi-Document RAG system from scratch using Python, LangChain, and Ollama. It sounds complex, but I promise you, it’s just a series of logical steps.

We’ll use LangChain for orchestration, Chroma for storage, and Ollama to run the Llama 3 model locally.

First, install these libraries. In your terminal, run:

pip install langchain langchain-community langchain-huggingface langchain-chroma langchain-ollama pypdf

You’ll also need Ollama running locally with the Llama 3 model. After installing Ollama, run ollama pull llama3.

Step 1: Loading the Raw Knowledge

First, gather your source materials. We need to extract text from PDF files. PyPDFLoader is a good choice because it handles the tricky formatting of PDFs well:

import os
from langchain_community.document_loaders import PyPDFLoader

def load_documents(folder_path: str):
    if not os.path.exists(folder_path):
        raise FileNotFoundError(f"Folder '{folder_path}' does not exist")

    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            print(f"📄 Loading: {filename}")
            try:
                loader = PyPDFLoader(file_path)
                documents.extend(loader.load())
            except Exception as e:
                print(f"❌ Error loading {filename}: {e}")
    return documents

Data is rarely perfect. Make sure your loading logic skips non-PDFs and handles errors, so your pipeline keeps running even if one file is bad.

Step 2: Chunking

You can’t give a 100-page document to an LLM all at once because it goes over the memory limit. So, we need to break it into smaller parts.

We use RecursiveCharacterTextSplitter, which is a smart tool. It tries to split text by paragraphs first, then by sentences, so related text stays together:

from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_text(documents):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    chunks = splitter.split_documents(documents)
    print(f"✂️ Created {len(chunks)} chunks")
    return chunks

Pay attention to chunk_overlap=200. This setting is important because it creates a sliding window, making sure you don’t lose context if a sentence is split between chunks.

Step 3: Embeddings

Computers understand numbers, not words. So, we need to turn our text chunks into lists of numbers, called vectors or embeddings.

This means that if two chunks have similar meanings, like “Dog” and “Puppy,” their numbers will be close to each other:

from langchain_huggingface import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

We’ll use all-MiniLM-L6-v2, a lightweight, open-source model that runs quickly on your CPU.

Step 4: The Vector Store

Now we need somewhere to store these numbers for fast searching. A regular SQL database isn’t good for this, so we’ll use a Vector Database. We’ll use Chroma:

from langchain_chroma import Chroma

def create_vector_store(chunks):
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_function,
        persist_directory="./chroma_db",
        collection_name="rag_docs"
    )
    return vector_store

This function saves the database in a folder called ./chroma_db. That way, you don’t have to rebuild the database every time you restart the app; it stays saved.

Step 5: The Brain

This is the most important part. This function links the user, the database, and the LLM:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


def query_rag_system(query_text, vector_store):
    llm = ChatOllama(model="llama3") # Make sure you have Ollama installed and running!

    retriever = vector_store.as_retriever(search_kwargs={"k": 3})

    prompt = ChatPromptTemplate.from_template(
        """
        You are a helpful assistant.
        Answer ONLY using the context below.
        If the answer is not present, say "I don't know."

        Context:
        {context}

        Question:
        {question}
        """
    )

    chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain.invoke(query_text)

First, it looks at the user’s question and finds the top 3 most relevant chunks (k=3). Then, it puts those chunks into a strict prompt: “Answer ONLY using the context below.” This helps stop the AI from making things up.

Step 6: Putting It All Together

Finally, the main loop checks if a database already exists. If it doesn’t, it processes the PDFs. Then, it starts a chat loop so you can ask questions:

def main():
    folder_path = "/Users/amankharwal/aiagent/data" # CHANGE THIS to your folder path

    if not os.path.exists("./chroma_db"):
        print("📦 No vector DB found. Creating one...")
        docs = load_documents(folder_path)
        chunks = split_text(docs)
        vector_store = create_vector_store(chunks)
        print("Vector database created")
    else:
        print("📦 Loading existing vector DB...")
        vector_store = Chroma(
            persist_directory="./chroma_db",
            embedding_function=embedding_function,
            collection_name="rag_docs"
        )

    while True:
        query = input("\n❓ Ask a question (or type 'exit'): ")
        if query.lower() == "exit":
            break

        print("🤔 Thinking...")
        answer = query_rag_system(query, vector_store)
        print("\n🧠 Answer:\n", answer)

if __name__ == "__main__":
    main()

Here’s the answer I got for my and my friends’ resumes:

Multi-Document RAG System: final output

Closing Thoughts

Building systems like this shows me that AI isn’t meant to replace our curiosity; it helps fuel it. When it’s easier to find answers, we can ask better, deeper, and more creative questions.

Don’t be afraid to experiment with this code. Try changing the chunk size, swap llama3 for Mistral, or use a different embedding model. That’s the best way to learn.

If you found this article useful, you can follow me on Instagram for daily AI tips and practical resources. You might also like my latest book, Hands-On GenAI, LLMs & AI Agents. It’s a step-by-step guide to help you get ready for jobs in today’s AI field.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2073

6 Comments

  1. Hi,

    I am getting issue while installing langchain_chroma on PyCharm

    Details :
    note: This error originates from a subprocess, and is likely not a problem with pip.
    error: metadata-generation-failed

    × Encountered error while generating package metadata.
    ╰─> numpy

    Can you please help?. Rest of the packages are installed.

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading