How to Clean Any Dataset in 5 Steps

Data scientists spend about 80% of their time cleaning data, and there’s a good reason for that. Cleaning data isn’t just a task before the main work; it is the main work. Even the best AI model will give poor results if the data is messy. In this article, I’ll show you a simple 5-step strategy to clean any dataset.

Clean Any Dataset in 5 Steps

To make this practical, we’ll use a raw dataset of Netflix viewership numbers from January to June. It’s messy enough to be interesting. You can download the dataset here.

Step 1: The Initial Audit

Before making any changes, take time to understand what you’re working with.

In Python, especially with Pandas, the .head() and .info() functions are very helpful.

Always start looking for three things:

  1. Data Types: Are numbers actually stored as numbers?
  2. Missing Values: Where are the holes?
  3. Junk Columns: What don’t we need?

Let’s look at our Netflix data:

import pandas as pd
df = pd.read_csv('/content/jan to june netflix.csv')
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18214 entries, 0 to 18213
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 0 non-null float64
1 Title 18214 non-null object
2 Available Globally? 18214 non-null object
3 Release Date 4855 non-null object
4 Hours Viewed 18214 non-null object
dtypes: float64(1), object(4)
memory usage: 711.6+ KB
None

Right away, we notice some problems. The Unnamed: 0 column is completely empty (0 non-null). The Hours Viewed column is stored as text instead of a number. The Release Date column is missing about 75% of its data.

Step 2: Trim the Fat

Keeping things simple is best. If a column doesn’t add value or insight, it’s time to delete it.

Empty or useless columns add unnecessary noise. They make your dataframe larger and harder to read.

In our dataset, the Unnamed: 0 column is completely empty. It was probably left over from a previous save. Let’s remove it:

# dropping the empty column
df.drop(columns=['Unnamed: 0'], inplace=True)

Step 3: Fix Identity Crises

This is a common mistake for beginners. You might try to calculate the average viewership, but Python gives an error because the numbers are stored as text.

Computers take things literally. For example, “1,000” is seen as a word (string) because of the comma. We need to remove the formatting and tell Python to treat it as a number.

Take the Hours Viewed column as an example. The values appear as “81,21,00,000”. These commas prevent us from doing calculations. Here’s how to fix it:

# Remove commas and convert to numeric
df['Hours Viewed'] = df['Hours Viewed'].str.replace(',', '')
df['Hours Viewed'] = pd.to_numeric(df['Hours Viewed'])

Now, “81,21,00,000” is converted to the integer 812100000. This allows us to calculate means, medians, and correlations.

Step 4: Handle the Void

When data is missing, you have to decide what to do: delete the row, estimate the value (impute), or leave it as is.

If your dataset has millions of rows and only a few are missing, you can simply drop those rows. If it’s a numeric value, you might fill it with the average (mean) or with 0.

In our dataset, the Release Date is missing for almost 14,000 rows. Dropping those rows would remove 75% of our data, which is too much. Instead, we can convert the available dates to the correct format and keep the missing ones as NaT (Not a Time) so we can still analyze the Titles:

# Convert to datetime, errors='coerce' turns unparseable data into NaT
df['Release Date'] = pd.to_datetime(df['Release Date'],
                                    errors='coerce')

Step 5: Standardization

Finally, we need to make the data consistent. Text data can be messy, with issues like inconsistent capitalization, extra spaces, and unusual symbols.

There is a column called “Available Globally?” with values “Yes” and “No”. We can convert this to a Boolean (True/False) to make filtering easier later:

# Convert Yes/No to True/False
df['Available Globally?'] = df['Available Globally?'].map({'Yes': True, 'No': False})

Now, instead of matching strings, we can just use df[df[‘Available Globally?’]] to find global shows.

Closing Thoughts

That’s the 5-step strategy for cleaning any dataset. We began with a file full of extra columns, numbers stored as text, and missing dates. Now, we have a clean, well-structured dataset ready for analysis.

Data cleaning is more than just fixing mistakes; it’s about taking pride in your work. It means getting to know your data, understanding its details, and preparing it to reveal useful insights.

If you found this article useful, you can follow me on Instagram for daily AI tips and practical resources. You might also like my latest book, Hands-On GenAI, LLMs & AI Agents. It’s a step-by-step guide to help you get ready for jobs in today’s AI field.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2116

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading