Build an Evaluation Pipeline for Your LLM App

If you’re building an LLM app, making it run is just the first step. The real test is making sure it works well, reliably, and at scale. That’s why you need an LLM evaluation pipeline.

Many beginners skip this step and just do a few manual tests. That might be fine for demos, but it quickly fails in real use. If your app uses retrieval (RAG), agents, or structured outputs, you need to measure quality all the time, not just once.

In this article, I’ll show you how to build a full evaluation pipeline for an LLM app using Python, and you won’t need any paid APIs or tools.

LLM Evaluation Pipeline: Getting Started

Simply put, an evaluation pipeline is a system that:

Runs your LLM app on a predefined dataset.
Compares outputs against expected results (or evaluates quality heuristically).
Produces metrics you can track over time.

Here, I’ll build a simple but realistic pipeline for a RAG-based question-answering system.

To get started, install these required libraries:

pip install transformers sentence-transformers faiss-cpu pandas scikit-learn

Now, let’s get started with building an LLM evaluation pipeline step-by-step.

If you want to go beyond simple LLM demos and learn how to build production-ready AI systems, I’ve covered it step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.

Step 1: Create a Small Evaluation Dataset

In real projects, this dataset is your most valuable asset. Start with a small but well-organized set:

import pandas as pd

data = [
    {
        "question": "What does a data scientist do?",
        "context": "A data scientist analyzes data to extract insights using statistics and machine learning.",
        "answer": "A data scientist analyzes data to extract insights."
    },
    {
        "question": "What is machine learning?",
        "context": "Machine learning is a subset of AI that enables systems to learn from data.",
        "answer": "Machine learning allows systems to learn from data."
    }
]

df = pd.DataFrame(data)

import pandas as pd

data = [
    {
        "question": "What does a data scientist do?",
        "context": "A data scientist analyzes data to extract insights using statistics and machine learning.",
        "answer": "A data scientist analyzes data to extract insights."
    },
    {
        "question": "What is machine learning?",
        "context": "Machine learning is a subset of AI that enables systems to learn from data.",
        "answer": "Machine learning allows systems to learn from data."
    }
]

df = pd.DataFrame(data)

Each row should include the input question, the ground truth context, and the expected answer.

Step 2: Build a Simple Retriever

We’ll turn the contexts into embeddings and use FAISS to search for similar ones:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = df["context"].tolist()
doc_embeddings = model.encode(documents)

index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings))

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = df["context"].tolist()
doc_embeddings = model.encode(documents)

index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings))

Here’s the Retriever function:

def retrieve(query, k=1):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [documents[i] for i in indices[0]]

def retrieve(query, k=1):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [documents[i] for i in indices[0]]

Step 3: Add an Open-Source LLM

We’ll pick a lightweight model from Hugging Face:

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")

def generate_answer(query, context):
    prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
    output = generator(prompt, max_length=100, num_return_sequences=1)
    return output[0]["generated_text"]

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")

def generate_answer(query, context):
    prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
    output = generator(prompt, max_length=100, num_return_sequences=1)
    return output[0]["generated_text"]

Step 4: Run the Pipeline

Now, combine retrieval and generation:

def rag_pipeline(question):
    retrieved_context = retrieve(question)[0]
    answer = generate_answer(question, retrieved_context)
    return retrieved_context, answer

def rag_pipeline(question):
    retrieved_context = retrieve(question)[0]
    answer = generate_answer(question, retrieved_context)
    return retrieved_context, answer

Here’s how to run an evaluation:

results = []

for _, row in df.iterrows():
    context, prediction = rag_pipeline(row["question"])
    
    results.append({
        "question": row["question"],
        "ground_truth": row["answer"],
        "prediction": prediction,
        "retrieved_context": context
    })

results_df = pd.DataFrame(results)

results = []

for _, row in df.iterrows():
    context, prediction = rag_pipeline(row["question"])
    
    results.append({
        "question": row["question"],
        "ground_truth": row["answer"],
        "prediction": prediction,
        "retrieved_context": context
    })

results_df = pd.DataFrame(results)

Step 5: Define LLM Evaluation Metrics

This is where many people make mistakes. You don’t need complicated metrics at first; you just need useful ones.

Start with Semantic Similarity (Answer Quality):

from sklearn.metrics.pairwise import cosine_similarity

def similarity_score(a, b):
    emb1 = model.encode([a])
    emb2 = model.encode([b])
    return cosine_similarity(emb1, emb2)[0][0]

results_df["similarity"] = results_df.apply(
    lambda row: similarity_score(row["ground_truth"], row["prediction"]),
    axis=1
)

from sklearn.metrics.pairwise import cosine_similarity

def similarity_score(a, b):
    emb1 = model.encode([a])
    emb2 = model.encode([b])
    return cosine_similarity(emb1, emb2)[0][0]

results_df["similarity"] = results_df.apply(
    lambda row: similarity_score(row["ground_truth"], row["prediction"]),
    axis=1
)

Next, we need to check Context Relevance:

results_df["context_score"] = results_df.apply(
    lambda row: similarity_score(row["retrieved_context"], row["ground_truth"]),
    axis=1
)

results_df["context_score"] = results_df.apply(
    lambda row: similarity_score(row["retrieved_context"], row["ground_truth"]),
    axis=1
)

Finally, we will check the Groundedness (Simple Heuristic):

def groundedness(answer, context):
    return int(any(word in context for word in answer.split()))

results_df["groundedness"] = results_df.apply(
    lambda row: groundedness(row["prediction"], row["retrieved_context"]),
    axis=1
)

def groundedness(answer, context):
    return int(any(word in context for word in answer.split()))

results_df["groundedness"] = results_df.apply(
    lambda row: groundedness(row["prediction"], row["retrieved_context"]),
    axis=1
)

Now, let’s aggregate metrics:

print("Average Similarity:", results_df["similarity"].mean())
print("Average Context Score:", results_df["context_score"].mean())
print("Groundedness Rate:", results_df["groundedness"].mean())

print("Average Similarity:", results_df["similarity"].mean())
print("Average Context Score:", results_df["context_score"].mean())
print("Groundedness Rate:", results_df["groundedness"].mean())

Average Similarity: 0.6464711
Average Context Score: 0.89787185
Groundedness Rate: 1.0

These results show the retrieval system is working well. The high Context Score (0.89) means it usually finds the right information. A Groundedness Rate of 1.0 shows the model uses the provided context instead of making things up. But the Similarity Score (0.64) means the answers are only somewhat close to what we expect, so there’s room to improve the generation quality.

Now you have something most beginners miss: a way to measure quality.

Closing Thoughts

Building an LLM app without an evaluation pipeline is like launching a model without checking if it works. You’re guessing instead of engineering.

This is what turns LLM experiments into real AI engineering.

I hope you found this article helpful for building a complete evaluation pipeline for an LLM.

For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.

Build an Evaluation Pipeline for Your LLM App

LLM Evaluation Pipeline: Getting Started

Step 1: Create a Small Evaluation Dataset

Step 2: Build a Simple Retriever

Step 3: Add an Open-Source LLM

Step 4: Run the Pipeline

Step 5: Define LLM Evaluation Metrics

Closing Thoughts

Aman Kharwal

Leave a ReplyCancel reply

LLM Evaluation Pipeline: Getting Started

Step 1: Create a Small Evaluation Dataset

Step 2: Build a Simple Retriever

Step 3: Add an Open-Source LLM

Step 4: Run the Pipeline

Step 5: Define LLM Evaluation Metrics

Closing Thoughts

Aman Kharwal

Recommended For You

How to Use LlamaIndex to Build RAG Apps

AI/ML Certificates That Get You Hired

Scraping Live Market Data for ML Pipelines

Agentic AI Projects to Add to Your Resume

Leave a ReplyCancel reply

Discover more from AmanXai by Aman Kharwal