If you’re a Data Scientist, chances are you’re already using Large Language Models (LLMs) like GPT-4, Claude, or Gemini in some form. However, here’s the truth I’ve learned after mentoring fellow Data Scientists: most of us are still writing subpar prompts. As a Data Scientist, you should structure your instructions so clearly and strategically that the model becomes a true extension of your brain. In this article, I’ll provide a practical guide to prompt engineering for Data Scientists that you can follow to avoid common mistakes.
Practical Guide to Prompt Engineering for Data Scientists
Let’s Start With the Core Mindset Shift. Before you even write a prompt, you need to understand one thing: LLMs are like bright interns – brilliant with guidance, but confused without context.
You wouldn’t hand your intern a vague task like “Fix the dataset.” So don’t tell the model: “Clean this data.” Tell it:
You are a data cleaning assistant. Remove duplicate rows, impute missing values in numerical columns using median, and encode categorical variables using one-hot encoding.
Prompting is not about being clever; it’s about being clear and concise.
Prompt Engineering for Common Data Science Tasks
Let’s break down the actual prompts you can use in day-to-day work, with examples and lessons from the field.
Data Cleaning with LLMs
Bad Prompt: Clean this dataset.
Why it fails: Too vague, the model doesn’t know your rules, goals, or what “clean” even means in your context.
Better Prompt:
You are a Data Scientist. The dataset below contains customer transaction data with some missing values and inconsistent formatting. Your job is to:
1. Drop duplicates
2. Fill missing values in 'Age' with median
3. Strip whitespace from all string columns
4. Standardize date formats to YYYY-MM-DD
Provide the cleaned dataset as a pandas DataFrame.
Always specify the task, desired format, and step-by-step rules.
Feature Engineering Ideas
Most of you often misuse LLMs for feature engineering. Here’s an example of how we can write an ideal prompt for feature engineering:
You are a Feature Engineering expert helping a Data Scientist build features for a credit risk model.
Based on the following columns: ['loan_amount', 'income', 'loan_term', 'employment_status', 'credit_history'], suggest 5 new features that could improve predictive power.
Return the ideas in a table with columns: Feature Name, Description, Why It Helps.
This works as it:
- Gives the model context (credit risk).
- Asks for structured output.
- Mentions goal: predictive power.
Code Generation
Don’t just say: Write Python code to train a model.
Instead, guide like this:
You are a Python expert helping a Data Scientist.
Generate code to:
1. Split data into train/test (80/20)
2. Train a RandomForestClassifier
3. Evaluate using accuracy, precision, recall, and ROC-AUC
4. Include comments for each block
Use sklearn and pandas.
Be explicit with libraries, steps, and evaluation metrics. Treat it like you’re giving instructions to a junior teammate.
Summarization of Insights
You can use LLMs to summarize dashboards, model outputs, or even entire reports. Here’s how to write an ideal prompt to summarize insights:
You are a Data Science Lead summarizing insights for a business stakeholder.
Based on the following model performance report and confusion matrix, summarize:
- Key metrics in plain English
- What’s working well
- What needs improvement
- One next step
Keep the summary under 150 words, non-technical.
It frames the audience (stakeholder), the tone (non-technical), and the format (4 parts, with a maximum of 150 words).
Final Words
So, every good prompt usually has these five components:
- Role: “You are a Data Scientist…”
- Task: “Your job is to clean the data by…”
- Context: “This dataset is from a churn prediction project…”
- Constraints: “Only use pandas and sklearn…”
- Output Format: “Return a pandas DataFrame and a short explanation…”
I hope you liked this article on a practical guide to prompt engineering for Data Scientists. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





