Building a RAG Pipeline for LLMs

Large Language Models (LLMs) are powerful, but they have a major limitation: their knowledge is static and limited to the data they were trained on. This is where Retrieval-Augmented Generation (RAG) comes in. So, if you want to learn how to enhance LLMs by retrieving relevant external knowledge before generating responses, this article is for you. In this article, I’ll take you through building a RAG Pipeline for LLMs using Hugging Face Transformers and Python.

So, What is a RAG Pipeline?

A Retrieval-Augmented Generation (RAG) pipeline consists of two key components:

Retriever: Searches a knowledge base for relevant documents based on the user’s query.
Generator: Uses retrieved documents as context to generate accurate and relevant responses.

RAG improves LLMs by reducing hallucinations through real-world context, ensuring responses are more accurate and grounded in factual information. It also keeps answers up-to-date by retrieving the latest knowledge, eliminating the need for frequent retraining. Additionally, by incorporating external data sources, RAG significantly enhances the factual accuracy of AI-generated responses, which makes LLMs more reliable and context-aware.

Building a RAG Pipeline for LLMs: Getting Started

In our implementation, we will:

Use Wikipedia as our external knowledge source.
Employ Sentence Transformers for embedding text and FAISS for efficient similarity search.
Utilize Hugging Face’s question-answering pipeline to extract answers from retrieved documents.

Let’s import the necessary Python libraries to get started:

import wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

Step 1: Retrieving Knowledge

To simulate an external knowledge base, we’ll fetch relevant Wikipedia articles based on a given topic:

def get_wikipedia_content(topic):
    try:
        page = wikipedia.page(topic)
        return page.content
    except wikipedia.exceptions.PageError:
        return None
    except wikipedia.exceptions.DisambiguationError as e:
        # handle cases where the topic is ambiguous
        print(f"Ambiguous topic. Please be more specific. Options: {e.options}")
        return None

# user input
topic = input("Enter a topic to learn about: ")
document = get_wikipedia_content(topic)

if not document:
    print("Could not retrieve information.")
    exit()

Enter a topic to learn about: Apple Computers

Here, we are retrieving Wikipedia content based on a user-provided topic using the Wikipedia API. If the topic is valid, the function returns the page content; otherwise, it handles errors by either notifying the user of an ambiguous topic with multiple options or exiting if no relevant page is found.

Since Wikipedia articles can be long, we will split the text into smaller overlapping chunks for better retrieval:

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

def split_text(text, chunk_size=256, chunk_overlap=20):
    tokens = tokenizer.tokenize(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(tokenizer.convert_tokens_to_string(tokens[start:end]))
        if end == len(tokens):
            break
        start = end - chunk_overlap
    return chunks

chunks = split_text(document)
print(f"Number of chunks: {len(chunks)}")

Here, we are tokenizing the retrieved Wikipedia content and splitting it into smaller overlapping chunks for efficient retrieval. We used a pre-trained tokenizer (all-mpnet-base-v2) to break the text into tokens, then divided it into fixed-size segments (256 tokens each) with an overlap of 20 tokens to maintain context between chunks.

Step 2: Storing and Retrieving Knowledge

To efficiently search for relevant chunks, we will use Sentence Transformers to convert text into embeddings and store them in a FAISS index:

embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = embedding_model.encode(chunks)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

Here, we converted the text chunks into numerical embeddings using the Sentence Transformer model (all-mpnet-base-v2), which captures their semantic meaning. We then created a FAISS index with an L2 (Euclidean) distance metric and stored the embeddings in it. This will allow us to efficiently retrieve the most relevant chunks based on a user’s query.

Step 3: Querying the RAG Pipeline

Now, we will take user input for the RAG pipeline. When a user asks a question, we will:

Convert the query into an embedding.
Retrieve the top-k most relevant chunks using FAISS.
Use an LLM-powered question-answering model to generate the answer.

query = input("Ask a question about the topic: ")
query_embedding = embedding_model.encode([query])

k = 3
distances, indices = index.search(np.array(query_embedding), k)
retrieved_chunks = [chunks[i] for i in indices[0]]
print("Retrieved chunks:")
for chunk in retrieved_chunks:
    print("- " + chunk)

Ask a question about the topic: Legal Cases Against Apple Computers

Step 4: Answering the Question with an LLM

Now, we will use a pre-trained question-answering model to extract the final answer from the retrieved context:

qa_model_name = "deepset/roberta-base-squad2"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer)

context = " ".join(retrieved_chunks)
answer = qa_pipeline(question=query, context=context)
print(f"Answer: {answer['answer']}")

Answer: siri assistant violated user privacy

So, this is how you now have a fully functional RAG pipeline for LLMs that can be used in real-world AI applications.

Summary

In this article, we built a Retrieval-Augmented Generation (RAG) pipeline for LLMs using:

Wikipedia as an external knowledge base
Sentence Transformers for embedding generation
FAISS for fast and efficient retrieval
Hugging Face’s QA pipeline to extract final answers

I hope you liked this article on building a RAG Pipeline for LLMs. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.