If you’re building an LLM app, making it run is just the first step. The real test is making sure it works well, reliably, and at scale. That’s why you need an LLM evaluation pipeline.
Many beginners skip this step and just do a few manual tests. That might be fine for demos, but it quickly fails in real use. If your app uses retrieval (RAG), agents, or structured outputs, you need to measure quality all the time, not just once.
In this article, I’ll show you how to build a full evaluation pipeline for an LLM app using Python, and you won’t need any paid APIs or tools.
LLM Evaluation Pipeline: Getting Started
Simply put, an evaluation pipeline is a system that:
- Runs your LLM app on a predefined dataset.
- Compares outputs against expected results (or evaluates quality heuristically).
- Produces metrics you can track over time.
Here, I’ll build a simple but realistic pipeline for a RAG-based question-answering system.
To get started, install these required libraries:
pip install transformers sentence-transformers faiss-cpu pandas scikit-learn
Now, let’s get started with building an LLM evaluation pipeline step-by-step.
If you want to go beyond simple LLM demos and learn how to build production-ready AI systems, I’ve covered it step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.
Step 1: Create a Small Evaluation Dataset
In real projects, this dataset is your most valuable asset. Start with a small but well-organized set:
import pandas as pd
data = [
{
"question": "What does a data scientist do?",
"context": "A data scientist analyzes data to extract insights using statistics and machine learning.",
"answer": "A data scientist analyzes data to extract insights."
},
{
"question": "What is machine learning?",
"context": "Machine learning is a subset of AI that enables systems to learn from data.",
"answer": "Machine learning allows systems to learn from data."
}
]
df = pd.DataFrame(data)Each row should include the input question, the ground truth context, and the expected answer.
Step 2: Build a Simple Retriever
We’ll turn the contexts into embeddings and use FAISS to search for similar ones:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = df["context"].tolist()
doc_embeddings = model.encode(documents)
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings))Here’s the Retriever function:
def retrieve(query, k=1):
query_embedding = model.encode([query])
distances, indices = index.search(query_embedding, k)
return [documents[i] for i in indices[0]]Step 3: Add an Open-Source LLM
We’ll pick a lightweight model from Hugging Face:
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")
def generate_answer(query, context):
prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
output = generator(prompt, max_length=100, num_return_sequences=1)
return output[0]["generated_text"]Step 4: Run the Pipeline
Now, combine retrieval and generation:
def rag_pipeline(question):
retrieved_context = retrieve(question)[0]
answer = generate_answer(question, retrieved_context)
return retrieved_context, answerHere’s how to run an evaluation:
results = []
for _, row in df.iterrows():
context, prediction = rag_pipeline(row["question"])
results.append({
"question": row["question"],
"ground_truth": row["answer"],
"prediction": prediction,
"retrieved_context": context
})
results_df = pd.DataFrame(results)Step 5: Define LLM Evaluation Metrics
This is where many people make mistakes. You don’t need complicated metrics at first; you just need useful ones.
Start with Semantic Similarity (Answer Quality):
from sklearn.metrics.pairwise import cosine_similarity
def similarity_score(a, b):
emb1 = model.encode([a])
emb2 = model.encode([b])
return cosine_similarity(emb1, emb2)[0][0]
results_df["similarity"] = results_df.apply(
lambda row: similarity_score(row["ground_truth"], row["prediction"]),
axis=1
)Next, we need to check Context Relevance:
results_df["context_score"] = results_df.apply(
lambda row: similarity_score(row["retrieved_context"], row["ground_truth"]),
axis=1
)Finally, we will check the Groundedness (Simple Heuristic):
def groundedness(answer, context):
return int(any(word in context for word in answer.split()))
results_df["groundedness"] = results_df.apply(
lambda row: groundedness(row["prediction"], row["retrieved_context"]),
axis=1
)Now, let’s aggregate metrics:
print("Average Similarity:", results_df["similarity"].mean())
print("Average Context Score:", results_df["context_score"].mean())
print("Groundedness Rate:", results_df["groundedness"].mean())Average Similarity: 0.6464711
Average Context Score: 0.89787185
Groundedness Rate: 1.0
These results show the retrieval system is working well. The high Context Score (0.89) means it usually finds the right information. A Groundedness Rate of 1.0 shows the model uses the provided context instead of making things up. But the Similarity Score (0.64) means the answers are only somewhat close to what we expect, so there’s room to improve the generation quality.
Now you have something most beginners miss: a way to measure quality.
Closing Thoughts
Building an LLM app without an evaluation pipeline is like launching a model without checking if it works. You’re guessing instead of engineering.
This is what turns LLM experiments into real AI engineering.
I hope you found this article helpful for building a complete evaluation pipeline for an LLM.
For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.





