Agentic AI Pipeline to Automate EDA

Many people still handle Exploratory Data Analysis (EDA) by hand. Picture this instead: you put a CSV file in a folder, and an AI agent reads the schema, writes Python code to analyze it, creates statistical summaries, builds visualizations, and explains the insights as a senior data scientist would. That’s what an Agentic AI pipeline to automate EDA can do. In this article, we’ll build one using a local LLM, with no paid APIs or cloud services needed.

Understanding the Agentic EDA Concept

Here, Agentic AI doesn’t mean a sentient machine. It means setting up a workflow where an LLM gets a clear role, the right tools, and some boundaries so it can handle a multi-step task on its own.

Normally, a person checks the data, writes the code, and interprets the results. In an agentic pipeline, things work differently:

Python extracts the structural blueprint of your data.
The LLM acts as a Coder Agent, generating the exact Python scripts needed to visualize that specific structure.
The LLM then acts as an Analyst Agent, reviewing statistical summaries to highlight anomalies, patterns, and modeling recommendations.

With this approach, your job changes from writing code to managing the whole system.

Building an Agentic AI Pipeline to Automate EDA

Everything will run on your own machine. Start by installing these dependencies:

pip install langchain langchain-community pandas matplotlib seaborn

Next, install Ollama from its official website. Then, download the model:

ollama pull mistral

We’ll use the open-weight Mistral model for this. It’s quick, lightweight, and works well for tasks like code generation and reasoning, especially when running locally.

In this tutorial, I’ll use a stock market index dataset, but you can use any CSV file with this pipeline.

Step 1: Initialize the Local LLM

First, set up the connection to your local model. We’ll use LangChain to manage this, since it makes it easier to work with different LLMs:

from langchain_community.llms import Ollama

# Initialize the Mistral model running locally via Ollama
llm = Ollama(model="mistral")

from langchain_community.llms import Ollama

# Initialize the Mistral model running locally via Ollama
llm = Ollama(model="mistral")

Mistral is great for this purpose. It’s a powerful 7B parameter model that runs well on most modern laptops and does a good job writing Python code and formatting structured text.

Step 2: Extracting the Data Blueprint

A common mistake is giving the LLM the whole dataset as a prompt. This can crash your context window and is very inefficient.

The LLM doesn’t need every row to write code. It only needs the metadata:

import pandas as pd

# Load your dataset
df = pd.read_csv("sensex.csv")

# Extract only the structural information
schema_info = {
    "columns": df.columns.tolist(),
    "dtypes": df.dtypes.astype(str).to_dict(),
    "missing_values": df.isnull().sum().to_dict(),
    "shape": df.shape
}

print(schema_info)

import pandas as pd

# Load your dataset
df = pd.read_csv("sensex.csv")

# Extract only the structural information
schema_info = {
    "columns": df.columns.tolist(),
    "dtypes": df.dtypes.astype(str).to_dict(),
    "missing_values": df.isnull().sum().to_dict(),
    "shape": df.shape
}

print(schema_info)

By providing schema_info, we give the LLM the column names, data types, and missing-value counts. This is the same context a human data scientist would use to choose which plots to make.

Step 3: The Coder Agent

Now, we’ll tell the LLM to write the EDA code for us. We’ll use a very specific prompt to control its output:

prompt = f"""
You are a data scientist.

Here is dataset metadata:
{schema_info}

Write Python code using pandas, matplotlib, and seaborn to:
1. Generate summary statistics
2. Plot distributions for numerical columns
3. Plot correlation heatmap
4. Identify missing values visually

Only return executable Python code.
"""

# Generate the code
generated_code = llm.invoke(prompt)

print(generated_code)

prompt = f"""
You are a data scientist.

Here is dataset metadata:
{schema_info}

Write Python code using pandas, matplotlib, and seaborn to:
1. Generate summary statistics
2. Plot distributions for numerical columns
3. Plot correlation heatmap
4. Identify missing values visually

Only return executable Python code.
"""

# Generate the code
generated_code = llm.invoke(prompt)

print(generated_code)

Pay attention to the instruction: “Only return executable Python code.”

If an LLM adds extra text like “Here is your code!” at the top, it can break exec() functions when you run the output. We want clean, usable code that matches the dataset’s schema exactly.

Step 4: The Analyst Agent

Writing code is important, but real value comes from interpretation. Next, we’ll use standard Pandas functions to get summary statistics, then send those results to the LLM for expert analysis:

# Generate a statistical summary of the dataset
analysis_summary = df.describe(include="all").to_string()

insight_prompt = f"""
You are a senior data scientist.

Here are summary statistics:
{analysis_summary}

Provide:
- Key patterns
- Potential data quality issues
- Interesting correlations
- Recommendations for modeling
"""

# Generate human-readable insights
insights = llm.invoke(insight_prompt)

print(insights)

# Generate a statistical summary of the dataset
analysis_summary = df.describe(include="all").to_string()

insight_prompt = f"""
You are a senior data scientist.

Here are summary statistics:
{analysis_summary}

Provide:
- Key patterns
- Potential data quality issues
- Interesting correlations
- Recommendations for modeling
"""

# Generate human-readable insights
insights = llm.invoke(insight_prompt)

print(insights)

Agentic AI Pipeline to Automate EDA: Output

Here, the LLM isn’t guessing. It reviews real statistical outputs like mean, standard deviation, min, max, and quartiles, then gives a qualitative analysis. For example, it might point out a big outlier or note that some features have zero variance and should be dropped before training a machine learning model.

Closing Thoughts

That’s how you can build an Agentic AI Pipeline to automate EDA as a data scientist.

In today’s industry, your value isn’t about how quickly you can write Matplotlib code from memory. It’s about designing systems and connecting tools, models, and logic to solve problems on a larger scale.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.