Build a GraphRAG Pipeline for Smart Retrieval

Standard RAG uses Vector Search, which works like searching a library by matching keywords or general meaning. It’s good for finding specific facts, but not for making connections. That’s where GraphRAG comes in. Instead of seeing your data as separate documents, GraphRAG views it as a network of connected facts. In this article, I’ll show you how to build a GraphRAG Pipeline for smarter retrieval.

What is GraphRAG?

Picture yourself at a dinner party. Vector RAG is like the guest who has memorized many encyclopedias. If you ask, “Who is Sam Altman?”, they simply recite his biography.

GraphRAG is like the guest who knows all the connections between people. If you ask, “Who is Sam Altman?”, they reply, “He started OpenAI, which got billions from Microsoft. By the way, Microsoft also owns GitHub.”

GraphRAG uses a Knowledge Graph, which is a network of entities (nodes) and relationships (edges). This lets it move from one fact to another and find hidden connections that vector search can’t catch.

Let’s get a better understanding by building a GraphRAG pipeline from the ground up.

Build a GraphRAG Pipeline

We are going to build a pipeline that:

Reads text.
Extracts relationships (Subject -> Predicate -> Object).
Builds a Graph using NetworkX.
Retrieves context by walking the graph (Multi-hop reasoning).
Answers a question based on that deep context.

You’ll need a few libraries installed. In your terminal, run:

pip install networkx langchain langchain-ollama

Step 1: Loading the LLM

We need an LLM to do the main work, like reading text and pulling out logic. Here, we’ll use Ollama to run Mistral on your own machine. It’s quick, free, and works well for reasoning tasks.

Before starting, make sure to install Ollama and run this command on your terminal:

ollama pull mistral

Now, let’s load the LLM:

import networkx as nx
from langchain_ollama import ChatOllama
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# 1. Load Local LLM
# Temperature=0 is crucial here. We want facts, not creativity.
llm = ChatOllama(model="mistral", temperature=0)

For production, you might use GPT-4o or Claude 3.5 Sonnet for better accuracy. But for learning, Mistral is a great choice.

Step 2: Turning Text into Data

This step is the most important. We can’t put raw text straight into a graph; we need triples. A triple is the basic unit of a knowledge graph: (Head) -> [Relation] -> (Tail).

We’ll use a JsonOutputParser to make sure the LLM gives us clean, usable code instead of a conversation:

# 2. Prompt for Extracting Graph Triples
extract_prompt = PromptTemplate(
    template="""
You are an expert knowledge graph builder.
Extract entities and relationships from the text.
Return ONLY a JSON list. Each item must contain:
- "head": source entity
- "relation": relationship
- "tail": target entity

Text:
{text}

Output JSON:
""",
    input_variables=["text"],
)

extraction_chain = extract_prompt | llm | JsonOutputParser()

Step 3: The Data Source

Let’s test our system with a short example about the AI industry. The facts are in separate sentences, and our goal is to connect them:

# 3. Enterprise Knowledge Example
company_text = """
OpenAI was founded by Sam Altman and Elon Musk.
OpenAI developed GPT-4.
GPT-4 powers ChatGPT.
Microsoft partnered with OpenAI.
Microsoft invested 10 billion dollars in OpenAI.
ChatGPT is used by millions of users worldwide.
"""

print("\n Extracting knowledge graph triples...\n")
triples = extraction_chain.invoke({"text": company_text})
print(triples)

Extracting knowledge graph triples...

[{'head': 'OpenAI', 'relation': 'founded_by', 'tail': 'Sam Altman'}, {'head': 'OpenAI', 'relation': 'founded_by', 'tail': 'Elon Musk'}, {'head': 'OpenAI', 'relation': 'developed', 'tail': 'GPT-4'}, {'head': 'GPT-4', 'relation': 'powers', 'tail': 'ChatGPT'}, {'head': 'OpenAI', 'relation': 'partnered_with', 'tail': 'Microsoft'}, {'head': 'Microsoft', 'relation': 'invested_in', 'tail': 'OpenAI'}, {'head': 'Microsoft', 'relation': 'invested_amount', 'tail': '10 billion dollars'}, {'head': 'ChatGPT', 'relation': 'used_by', 'tail': 'millions of users worldwide'}]

Step 4: Building the Graph

Now we’ll use NetworkX, a Python library for working with graphs. We’ll take the JSON triples from Step 3 and actually create the connections:

# 4. Build Knowledge Graph
kg = nx.DiGraph() # DiGraph means "Directed Graph" (arrows point one way)

def build_knowledge_graph(triples):
    for item in triples:
        head = item.get("head")
        tail = item.get("tail")
        relation = item.get("relation")

        if head and tail:
            kg.add_node(head)
            kg.add_node(tail)
            kg.add_edge(head, tail, label=relation)

build_knowledge_graph(triples)

print("\n Nodes in Graph:")
print(list(kg.nodes()))

 Nodes in Graph:
['OpenAI', 'Sam Altman', 'Elon Musk', 'GPT-4', 'ChatGPT', 'Microsoft', '10 billion dollars', 'millions of users worldwide']

Step 5: Multi-Hop

This is where smart retrieval happens. In standard RAG, searching for “ChatGPT” gives you the sentence “ChatGPT is used by millions.” With GraphRAG, we start at “ChatGPT” and explore its connections:

Start at ChatGPT.
Look backward: “Powered by GPT-4”.
Walk to GPT-4: “Developed by OpenAI”.
Walk to OpenAI: “Invested in by Microsoft”.

Now we can see that Microsoft is linked to ChatGPT, even though they were never mentioned together in the same sentence in the original text.

Here’s how to implement it:

# 5. MULTI-HOP RETRIEVAL
def retrieve_graph_context(entity, max_depth=2):
    context = set()
    visited_nodes = set()

    def dfs(node, depth):
        if depth > max_depth:
            return
        visited_nodes.add(node)

        # 1. Check Outgoing edges (What does this node do?)
        for neighbor in kg.successors(node):
            relation = kg.get_edge_data(node, neighbor)["label"]
            context.add(f"{node} {relation} {neighbor}")
            if neighbor not in visited_nodes:
                dfs(neighbor, depth + 1)

        # 2. Check Incoming edges (Who interacts with this node?)
        for predecessor in kg.predecessors(node):
            relation = kg.get_edge_data(predecessor, node)["label"]
            context.add(f"{predecessor} {relation} {node}")
            if predecessor not in visited_nodes:
                dfs(predecessor, depth + 1)

    if entity in kg.nodes:
        dfs(entity, 1) # Start the traversal

    return ". ".join(context)

Step 6: The Final Answer

Finally, we feed that rich, interconnected context back to the LLM to answer the user’s question:

# 6. Final RAG Prompt
final_prompt = PromptTemplate(
    template="""
Answer the question using ONLY the context below.

Context:
{context}

Question:
{question}

Answer:
""",
    input_variables=["context", "question"]
)

rag_chain = final_prompt | llm

# 7. Ask a Multi-hop Reasoning Question
entity = "ChatGPT"

# We ask for a depth of 3 to catch distant connections
graph_context = retrieve_graph_context(entity, max_depth=3) 

print("\n Retrieved Graph Context:\n")
print(graph_context)

question = "Which company invested in the company that built ChatGPT?"

response = rag_chain.invoke({
    "context": graph_context,
    "question": question
})

print("\n Final Answer:\n")
print(response.content)

 Retrieved Graph Context:

GPT-4 powers ChatGPT. OpenAI founded_by Sam Altman. Microsoft invested_amount 10 billion dollars. OpenAI developed GPT-4. Microsoft invested_in OpenAI. ChatGPT used_by millions of users worldwide. OpenAI founded_by Elon Musk. OpenAI partnered_with Microsoft

 Final Answer:

 Microsoft invested in the company that built ChatGPT.

The model will correctly identify Microsoft. This works because the graph context includes the chain: Microsoft -> invested -> OpenAI -> developed -> GPT-4 -> powers -> ChatGPT.

Closing Thoughts

You might think you could just read the text to find this out. That’s true for a few lines, but imagine having 10,000 PDF reports.

In fraud detection, GraphRAG can connect a suspicious phone number to an address, a previous claim, and a known fraudster. Vector search would miss these links. In medical research, it can connect Drug X to Protein Y to Disease Z across thousands of research papers.

If you found this article useful, you can follow me on Instagram for daily AI tips and practical resources. You might also like my latest book, Hands-On GenAI, LLMs & AI Agents. It’s a step-by-step guide to help you get ready for jobs in today’s AI field.

Build a GraphRAG Pipeline for Smart Retrieval

What is GraphRAG?

Build a GraphRAG Pipeline

Step 1: Loading the LLM

Step 2: Turning Text into Data

Step 3: The Data Source

Step 4: Building the Graph

Step 5: Multi-Hop

Step 6: The Final Answer

Closing Thoughts

Aman Kharwal

Leave a ReplyCancel reply

What is GraphRAG?

Build a GraphRAG Pipeline

Step 1: Loading the LLM

Step 2: Turning Text into Data

Step 3: The Data Source

Step 4: Building the Graph

Step 5: Multi-Hop

Step 6: The Final Answer

Closing Thoughts

Aman Kharwal

Recommended For You

Do AI Engineers Need Mathematics? Here’s the Truth

Roadmap to Becoming an Agentic AI Engineer

How to Automate Your Daily Workflow Using AI Agents

AI Agent Projects for Beginners to Advanced

Leave a ReplyCancel reply

Discover more from AmanXai by Aman Kharwal