How I Decide Whether a Dataset Is Worth Using

The saying ‘Garbage In, Garbage Out’ is common in Data Science, and for good reason. From my experience, picking the right dataset matters more than picking the right algorithm. You can spend hours tuning a Random Forest, but if your data is flawed, no amount of tweaking will help. In this article, I’ll share the approach I use to decide if a dataset is worth working with.

How I Decide Whether a Dataset Is Worth Using

Here is the mental framework and practical strategy I use to decide if a dataset is gold or just fool’s gold.

The Context Check

Before I even start working with the data, I pause and ask myself: Does this data really reflect the situation I want to model?

This is a common place where students struggle. We tend to pick the cleanest dataset we can find, but real-world problems are rarely neat. Usually, the data we have is just a stand-in for what we actually want.

For example, imagine you want to predict Customer Satisfaction, and you have a dataset that includes ‘Call Duration’ from customer support logs.

The issue is that a short call could mean the problem was solved quickly and the customer is happy, or it could mean the customer got frustrated and hung up. If you use ‘Call Duration’ as a direct stand-in for satisfaction without considering the context, your model won’t be reliable.

Set a rule for yourself: If you can’t clearly explain how the data was collected and what it means, don’t use it.

The Health Inspection

After a dataset passes the context check, I load it, but I don’t jump into modeling right away. Instead, I look closely to see if the data is solid and reliable.

Check for missing values. It’s unusual to find a real-world dataset without any missing data. What’s more important than the amount of missing data is the reason why it’s missing:

  1. MCAR (Missing Completely at Random): This is usually okay; you can fill in these values.
  2. MNAR (Missing Not at Random): This is risky. For example, in a salary survey, high earners might not report their income. If you fill in those missing values with the average, your model will be biased and underestimate salaries.

Next, check the distribution of your main variables. I always plot histograms right away and look for a few key things:

  1. Zero Variance: If a column has the same value in every row, remove it.
  2. Impossible Values: A person’s age listed as 200, or a house price of $0.
  3. Severe Skew: If I’m predicting fraud, and only 0.01% of my rows are fraud, I need to know that immediately. A 99.9% accuracy score is meaningless here.

I once spent a week trying to predict server outages, but later found out the dataset had removed all the outage logs as errors. I was trying to predict something that wasn’t even in the data. Always check your data’s distribution.

The Leakage Audit

This is the most technical and important step. Data leakage happens when your training data includes information that wouldn’t be available when you actually make predictions.

If you’re predicting the future, like stock prices or sales, you can’t use data from the future to train your model.

Data leakage

For example, using ‘Total Sales 2024’ to predict ‘Sales Q1 2024’ is a mistake. If your data is based on time, always split your validation sets by time instead of randomly.

If I get a model with 99% accuracy on the first try, I don’t celebrate. It usually means I’ve accidentally included information that gives away the answer.

Closing Thoughts

Rejecting a dataset is tough. It often means starting over, collecting your own data, or working hard to get access to better internal databases.

A data scientist isn’t a magician who creates truth from numbers. We’re more like translators who find truth in what we observe. If the data is flawed, the results will be misleading.

So, don’t hesitate to say that the data isn’t good enough.

If you found this article useful, you can follow me on Instagram for daily AI tips and practical resources. You might also like my latest book, Hands-On GenAI, LLMs & AI Agents. It’s a step-by-step guide to help you get ready for jobs in today’s AI field.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2037

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading