Build a Local RAG System with Open-Source LLMs

You can’t send sensitive company data to a public API. This includes things like HR guidelines, financial records, or unreleased product specs. Data privacy is the main reason many companies hesitate to use generative AI. If you want to create AI tools that organizations trust, you need to know how to build systems that work completely offline. In this article, I’ll show you how to build a local RAG system with open-source LLMs that runs offline.

Local RAG System with Open-Source LLMs: Getting Started

Today, we’ll build a Local Retrieval-Augmented Generation (RAG) system. We’ll use open-source tools like Ollama and ChromaDB to safely get answers from your own documents, all on your local computer. You won’t need paid APIs, worry about data leaks, or sign up for cloud services.

Before running the code, you need to install the required libraries. First, install Python dependencies:

pip install langchain langchain-community langchain-core
pip install langchain-text-splitters
pip install langchain-huggingface
pip install langchain-chroma
pip install langchain-ollama
pip install chromadb
pip install pypdf

Next, install Ollama to run open-source LLMs locally. Once installed, pull the Llama 3 model:

ollama pull llama3

This downloads the model so it can run right on your computer. Now, let’s start building our local RAG system with open-source LLMs.

Step 1: Load the PDF Document

First, we need to load the data. We’ll use LangChain’s PyPDFLoader to read a PDF file from your computer:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("/Users/amankharwal/local_rag/Community-Guidelines.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages")

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("/Users/amankharwal/local_rag/Community-Guidelines.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages")

In this example, we are loading a “Community-Guidelines.pdf” file.

Step 2: Split Documents into Chunks

A common mistake in early AI projects is putting a whole document into an LLM at once. This can overload the model’s context window and hurt accuracy. Instead, we break the text into smaller chunks:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

We use an overlap of 200 characters to make sure sentences or ideas aren’t cut off between chunks.

Step 3: Create Embeddings

To search through our text chunks, we need to turn them into vector embeddings:

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en"
)

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en"
)

We used HuggingFaceEmbeddings with the BAAI/bge-small-en model. It’s lightweight, fast, and works completely offline, which is great for local setups.

Step 4: Create the Vector Database

Next, we’ll save these embeddings in Chroma, an open-source vector database:

from langchain_chroma import Chroma

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./local_chroma_db"
)

print("Vector database created")

from langchain_chroma import Chroma

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./local_chroma_db"
)

print("Vector database created")

By setting a persist_directory, we save the database to your computer. This way, you won’t need to process the PDF again each time you run the script.

Step 5: Set Up the Retriever

The retriever acts as the search engine for your RAG system:

retriever = vector_store.as_retriever(
    search_kwargs={"k": 3}
)

retriever = vector_store.as_retriever(
    search_kwargs={"k": 3}
)

When you ask a question, it searches the vector database and returns the top three most relevant chunks.

Step 6: Connect to the Local LLM

Here, we connect LangChain to the Ollama instance running Llama 3 on your computer:

from langchain_ollama import OllamaLLM

llm = OllamaLLM(
    model="llama3"
)

from langchain_ollama import OllamaLLM

llm = OllamaLLM(
    model="llama3"
)

This is the part that reads the retrieved context and creates the final answer.

Step 7: Define the Prompt Template

A clear prompt is important:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
"""
You are a helpful AI assistant.

Use the following retrieved context to answer the question.

Context:
{context}

Question:
{question}

Answer concisely in no more than three sentences.
"""
)

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
"""
You are a helpful AI assistant.

Use the following retrieved context to answer the question.

Context:
{context}

Question:
{question}

Answer concisely in no more than three sentences.
"""
)

We tell the LLM to use only the retrieved context to answer the question and to keep the answer short, no more than three sentences.

Step 8: Build the RAG Pipeline

This is the most important part of the project:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

We use LangChain Expression Language (LCEL) to connect all the parts. The pipeline takes your question, finds the documents, puts them into one string, sends the prompt to the LLM, and then gives you a clean answer.

Finally, we’ll put our pipeline in a simple while loop to make an interactive terminal app:

print("\nLocal RAG System Ready!")
print("Type 'exit' to quit\n")

while True:
    question = input("Ask a question: ")

    if question.lower() == "exit":
        break

    response = rag_chain.invoke(question)

    print("\nAnswer:")
    print(response)
    print("\n")

print("\nLocal RAG System Ready!")
print("Type 'exit' to quit\n")

while True:
    question = input("Ask a question: ")

    if question.lower() == "exit":
        break

    response = rag_chain.invoke(question)

    print("\nAnswer:")
    print(response)
    print("\n")

Local RAG System Ready!
Type 'exit' to quit

Ask a question: Explain the key community guidelines from the document

Answer:
According to the YouTube Community Guidelines, the key guidelines are: 

* Respect others' freedom of speech and opinion. 
* Don't engage in hate speech, harassment, or bullying. 
* Be truthful and authentic in your content.

These guidelines aim to create a safe and respectful environment for users on the platform.

Closing Thoughts

That’s how you can build a Local RAG system with open-source LLMs. Building it from scratch helps you learn how AI systems really work, not just how to send a JSON payload to an external API.

When you learn how chunking affects retrieval, how vector distance shows semantic similarity, and how context windows work, you move from being just an AI user to becoming an AI engineer.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.

Build a Local RAG System with Open-Source LLMs

Local RAG System with Open-Source LLMs: Getting Started

Step 1: Load the PDF Document

Step 2: Split Documents into Chunks

Step 3: Create Embeddings

Step 4: Create the Vector Database

Step 5: Set Up the Retriever

Step 6: Connect to the Local LLM

Step 7: Define the Prompt Template

Step 8: Build the RAG Pipeline

Closing Thoughts

Aman Kharwal

Leave a ReplyCancel reply

Local RAG System with Open-Source LLMs: Getting Started

Step 1: Load the PDF Document

Step 2: Split Documents into Chunks

Step 3: Create Embeddings

Step 4: Create the Vector Database

Step 5: Set Up the Retriever

Step 6: Connect to the Local LLM

Step 7: Define the Prompt Template

Step 8: Build the RAG Pipeline

Closing Thoughts

Aman Kharwal

Recommended For You

10 GenAI Projects for Data Scientists

Free Datasets for Building Real AI Projects

Building a Web UI for Your Local AI Agent

4 LLM-Based Data Analysis Projects

Leave a ReplyCancel reply

Discover more from AmanXai by Aman Kharwal