You can’t send sensitive company data to a public API. This includes things like HR guidelines, financial records, or unreleased product specs. Data privacy is the main reason many companies hesitate to use generative AI. If you want to create AI tools that organizations trust, you need to know how to build systems that work completely offline. In this article, I’ll show you how to build a local RAG system with open-source LLMs that runs offline.
Local RAG System with Open-Source LLMs: Getting Started
Today, we’ll build a Local Retrieval-Augmented Generation (RAG) system. We’ll use open-source tools like Ollama and ChromaDB to safely get answers from your own documents, all on your local computer. You won’t need paid APIs, worry about data leaks, or sign up for cloud services.
Before running the code, you need to install the required libraries. First, install Python dependencies:
pip install langchain langchain-community langchain-core
pip install langchain-text-splitters
pip install langchain-huggingface
pip install langchain-chroma
pip install langchain-ollama
pip install chromadb
pip install pypdf
Next, install Ollama to run open-source LLMs locally. Once installed, pull the Llama 3 model:
ollama pull llama3
This downloads the model so it can run right on your computer. Now, let’s start building our local RAG system with open-source LLMs.
Step 1: Load the PDF Document
First, we need to load the data. We’ll use LangChain’s PyPDFLoader to read a PDF file from your computer:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = PyPDFLoader("/Users/amankharwal/local_rag/Community-Guidelines.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")In this example, we are loading a “Community-Guidelines.pdf” file.
Step 2: Split Documents into Chunks
A common mistake in early AI projects is putting a whole document into an LLM at once. This can overload the model’s context window and hurt accuracy. Instead, we break the text into smaller chunks:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)We use an overlap of 200 characters to make sure sentences or ideas aren’t cut off between chunks.
Step 3: Create Embeddings
To search through our text chunks, we need to turn them into vector embeddings:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-small-en"
)We used HuggingFaceEmbeddings with the BAAI/bge-small-en model. It’s lightweight, fast, and works completely offline, which is great for local setups.
Step 4: Create the Vector Database
Next, we’ll save these embeddings in Chroma, an open-source vector database:
from langchain_chroma import Chroma
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./local_chroma_db"
)
print("Vector database created")By setting a persist_directory, we save the database to your computer. This way, you won’t need to process the PDF again each time you run the script.
Step 5: Set Up the Retriever
The retriever acts as the search engine for your RAG system:
retriever = vector_store.as_retriever(
search_kwargs={"k": 3}
)When you ask a question, it searches the vector database and returns the top three most relevant chunks.
Step 6: Connect to the Local LLM
Here, we connect LangChain to the Ollama instance running Llama 3 on your computer:
from langchain_ollama import OllamaLLM
llm = OllamaLLM(
model="llama3"
)This is the part that reads the retrieved context and creates the final answer.
Step 7: Define the Prompt Template
A clear prompt is important:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"""
You are a helpful AI assistant.
Use the following retrieved context to answer the question.
Context:
{context}
Question:
{question}
Answer concisely in no more than three sentences.
"""
)We tell the LLM to use only the retrieved context to answer the question and to keep the answer short, no more than three sentences.
Step 8: Build the RAG Pipeline
This is the most important part of the project:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)We use LangChain Expression Language (LCEL) to connect all the parts. The pipeline takes your question, finds the documents, puts them into one string, sends the prompt to the LLM, and then gives you a clean answer.
Finally, we’ll put our pipeline in a simple while loop to make an interactive terminal app:
print("\nLocal RAG System Ready!")
print("Type 'exit' to quit\n")
while True:
question = input("Ask a question: ")
if question.lower() == "exit":
break
response = rag_chain.invoke(question)
print("\nAnswer:")
print(response)
print("\n")Local RAG System Ready!
Type 'exit' to quit
Ask a question: Explain the key community guidelines from the document
Answer:
According to the YouTube Community Guidelines, the key guidelines are:
* Respect others' freedom of speech and opinion.
* Don't engage in hate speech, harassment, or bullying.
* Be truthful and authentic in your content.
These guidelines aim to create a safe and respectful environment for users on the platform.
Closing Thoughts
That’s how you can build a Local RAG system with open-source LLMs. Building it from scratch helps you learn how AI systems really work, not just how to send a JSON payload to an external API.
When you learn how chunking affects retrieval, how vector distance shows semantic similarity, and how context windows work, you move from being just an AI user to becoming an AI engineer.
If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.





