Data Cleaning Methods You Should Know

In Data Science, cleaning data is a crucial step as it ensures the accuracy and quality of the data for analysis. There are some data cleaning methods you should know as a Data Scientist/Analyst to clean your data for analysis. So, in this article, I’ll take you through a practical guide to data cleaning methods you should know with implementation using Python.

Data Cleaning Methods You Should Know

Below are all the data cleaning methods you should know:

  1. Removing Duplicates
  2. Handling Missing Values
  3. Dealing with Outliers
  4. Converting Data Types
  5. Handling Categorical Data
  6. Dealing with Inconsistent Data

Let’s understand all these methods practically using Python. To implement all these techniques, I’ll be using a dataset based on food delivery costs. You can download the dataset from here.

Now, let’s understand all these data cleaning methods using Python.

Removing Duplicates

Always make sure your data doesn’t have any duplicate records, as duplicate records can mislead the results you are aiming to find from your data. Here’s how to check whether your data has any duplicate records and remove them:

import pandas as pd

data = pd.read_csv("food_orders_new_delhi.csv")

# check for duplicate rows
duplicate_rows = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

data = data.drop_duplicates()
Number of duplicate rows: 0

Handling Missing Values

Identify missing values and decide on a strategy:

  • Remove: Delete rows/columns with missing values.
  • Fill: Impute missing values using the mean, median, mode, or constant value.
  • Interpolate: Use interpolation methods to estimate missing values.

In our dataset, only the discount and offers column has missing values. To handle missing values in the Discounts and Offers column, we need to decide on an appropriate strategy. Here are a few common approaches:

  1. Fill with a default value: This could be ‘None’ or ‘0%’ indicating no discount.
  2. Fill with the most frequent value: Use the mode of the column.

Given that Discounts and Offers is a categorical column, filling with a default value like ‘None’ might be the most straightforward and sensible approach if we assume that missing values imply no discount was applied. Here’s how to implement this strategy:

data['Discounts and Offers'].fillna('None', inplace=True)

Identify Outliers

Identify outliers using statistical methods (IQR, Z-score) or visualization (box plots). Then, decide on a strategy:

  • Remove: Delete rows with outliers.
  • Cap: Limit outliers to a specific range.
  • Transform: Use transformations like log or square root.

Please note that removing outliers is not a crucial step every time. Only remove outliers if they affect the interpretation of the data or a machine learning model. Here’s how to deal with outliers using Python:

numerical_columns = ['Order Value', 'Delivery Fee', 'Commission Fee', 'Payment Processing Fee', 'Refunds/Chargebacks']

def cap_outliers(series):
    lower_cap = series.quantile(0.01)
    upper_cap = series.quantile(0.99)
    return series.clip(lower_cap, upper_cap)

for col in numerical_columns:
    data[col] = cap_outliers(data[col])

Converting Data Types

Identify columns with incorrect data types and convert data types to appropriate formats (e.g., strings to dates, floats to integers). Here’s how to convert the date columns in our dataset to the datetime data type:

data['Order Date and Time'] = pd.to_datetime(data['Order Date and Time'])
data['Delivery Date and Time'] = pd.to_datetime(data['Delivery Date and Time'])

Handling Categorical Data

Identify categorical columns. If the problem demands, convert categorical data to numerical format using one-hot encoding or label encoding. Here’s how you can convert categorical data into a numerical format using Python:

data = pd.get_dummies(data, columns=['Payment Method'], drop_first=True)

Handling Inconsistent Data

Identify inconsistencies (e.g., different formats, typos) and correct inconsistencies manually or using predefined rules. In our dataset, the discount and offers column has data inconsistencies, which makes it a challenge to calculate the exact discount amount. Here’s how to handle the inconsistent data in the discount and offers column:

def extract_discount(discount_str):
    if 'off' in discount_str:
        # fixed amount off
        try:
            return float(discount_str.split(' ')[0]), 'fixed'
        except ValueError:
            return 0.0, 'none'
    elif '%' in discount_str:
        # percentage off
        try:
            return float(discount_str.split('%')[0]), 'percentage'
        except ValueError:
            return 0.0, 'none'
    else:
        # no discount
        return 0.0, 'none'

# apply the function to create new 'Discount Percentage' and 'Discount Type' columns
data['Discount Value'], data['Discount Type'] = zip(*data['Discounts and Offers'].apply(extract_discount))

# calculate the discount amount based on the order value and discount type
def calculate_discount_amount(row):
    if row['Discount Type'] == 'percentage':
        return row['Order Value'] * row['Discount Value'] / 100
    elif row['Discount Type'] == 'fixed':
        return row['Discount Value']
    else:
        return 0.0

data['Discount Amount'] = data.apply(calculate_discount_amount, axis=1)
print(data.head())
   Order ID Customer ID Restaurant ID Order Date and Time  \
0 1 C8270 R2924 2024-02-01 01:11:52
1 2 C1860 R2054 2024-02-02 22:11:04
2 3 C6390 R2870 2024-01-31 05:54:35
3 4 C6191 R2642 2024-01-16 22:52:49
4 5 C6734 R2799 2024-01-29 01:19:30

Delivery Date and Time Order Value Delivery Fee Discounts and Offers \
0 2024-02-01 02:39:52 1914 0 5% on App
1 2024-02-02 22:46:04 986 40 10%
2 2024-01-31 06:52:35 937 30 15% New User
3 2024-01-16 23:38:49 1463 50 None
4 2024-01-29 02:48:30 1974 30 50 off Promo

Commission Fee Payment Processing Fee Refunds/Chargebacks \
0 150.0 47 0
1 198.0 23 0
2 195.0 45 0
3 146.0 27 0
4 130.0 50 0

Payment Method_Credit Card Payment Method_Digital Wallet Discount Value \
0 True False 5.0
1 False True 10.0
2 False False 15.0
3 False False 0.0
4 False False 50.0

Discount Type Discount Amount
0 percentage 95.70
1 percentage 98.60
2 percentage 140.55
3 none 0.00
4 fixed 50.00

Summary

So, below are all the data cleaning methods you should know:

  1. Removing Duplicates
  2. Handling Missing Values
  3. Dealing with Outliers
  4. Converting Data Types
  5. Handling Categorical Data
  6. Dealing with Inconsistent Data

I hope you liked this article on data cleaning methods you should know and how to implement them using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2116

One comment

  1. I just wanted to ask if the categorical column have more than 9 to 10 varities how to categorize them for example bike company thier model then how to bifurcate the data

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading