Turn Any CSV into an AI Chatbot with Python

One of the coolest things you can do with RAG is chat with a dataset stored in a CSV file. The great news is you don’t need a big budget or any paid APIs to do this. With free, open-source Python libraries, you can build a solid AI chatbot that works with your documents. In this article, I’ll show you how to turn any CSV file into an AI chatbot using Python.

Turn Any CSV into an AI Chatbot: Getting Started

Today, we’ll walk through how to turn a regular CSV file into a searchable vector database. We’ll process the data, turn it into numbers the computer can understand, and search it based on meaning; all with free tools.

Before we start coding, make sure your environment is ready. You’ll need to install a few common open-source libraries:

pip install pandas sentence-transformers faiss-cpu numpy

Let’s jump into the coding steps together.

Step 1: Loading and Serializing the CSV

First, we’ll load the data. For this example, I’m using a stock market dataset called sensex.csv. We’ll use Pandas to read the file, then turn each row into a single string, with column values separated by a pipe (|) character:

import pandas as pd

df = pd.read_csv("/content/sensex.csv")

# convert rows to text
documents = df.astype(str).apply(
    lambda row: " | ".join(row), axis=1
).tolist()

print(documents[:3])

import pandas as pd

df = pd.read_csv("/content/sensex.csv")

# convert rows to text
documents = df.astype(str).apply(
    lambda row: " | ".join(row), axis=1
).tolist()

print(documents[:3])

['Date | nan | nan | nan | nan | nan', '1997-07-01 | 4300.85986328125 | 4301.77001953125 | 4247.66015625 | 4263.10986328125 | 0.0', '1997-07-02 | 4333.89990234375 | 4395.31005859375 | 4295.39990234375 | 4302.9599609375 | 0.0']

If you just feed tabular data straight into an AI model, it often doesn’t work well. By joining each row into a structured string, we help the model understand how the data points in that row are connected.

Step 2: Generating Local Embeddings

Next, we need to turn our text strings into vectors. We’ll use the SentenceTransformer library and the all-MiniLM-L6-v2 model for this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(documents)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(documents)

The all-MiniLM-L6-v2 model is fast and lightweight, so it runs easily on a regular computer. It turns text into 384-dimensional vectors.

If you’re working in production, you might use a bigger model if you have the resources. But for testing things out locally, MiniLM is a popular choice.

Step 3: Indexing with FAISS

Now that we have our vectors, we need a way to store and search them quickly. We’ll use FAISS (Facebook AI Similarity Search), a library made for fast vector searches:

import faiss
import numpy as np

dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

import faiss
import numpy as np

dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

For a small CSV with about 100 rows, you don’t really need a vector database; a simple cosine similarity script would work. But FAISS is great if you want to scale up.

IndexFlatL2 calculates the exact Euclidean distance between vectors. By starting with FAISS, your code will be ready to handle millions of rows later. We set up the index with the right dimension (384) and add our data to it.

Step 4: Querying the Data

Finally, we’ll try out the chatbot’s main feature: answering a user’s question. We’ll take a natural language query, turn it into a vector, and search our FAISS index for the top k most relevant rows:

query = "What was the highest closing price of Sensex in the dataset?"

query_vector = model.encode([query])

D, I = index.search(query_vector, k=3)

results = [documents[i] for i in I[0]]

print(results)

query = "What was the highest closing price of Sensex in the dataset?"

query_vector = model.encode([query])

D, I = index.search(query_vector, k=3)

results = [documents[i] for i in I[0]]

print(results)

['2013-10-04 | 19915.94921875 | 20052.0 | 19833.169921875 | 19870.0 | 12900.0']

The index.search function gives us two arrays: D (the distances) and I (the indices of the matching rows). We use those indices to get the original text from our documents list.

In a complete chatbot, you’d take this results array, put it into a prompt, and send it to an open-source LLM like Llama 3 or Mistral to create the final answer.

Closing Thoughts

That’s the process for building a RAG app that turns any CSV file into an AI chatbot with Python.

When you build your own retrieval pipeline with tools like Pandas, SentenceTransformers, and FAISS, you get to see how everything works under the hood. You’ll learn what high-dimensional space looks like in Python, how to organize messy data, and get a feel for things like latency, vector math, and indexing.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.