Data Cleaning Methods You Should Know

In Data Science, cleaning data is a crucial step as it ensures the accuracy and quality of the data for analysis. There are some data cleaning methods you should know as a Data Scientist/Analyst to clean your data for analysis. So, in this article, I’ll take you through a practical guide to data cleaning methods you should know with implementation using Python.

Data Cleaning Methods You Should Know

Below are all the data cleaning methods you should know:

Removing Duplicates
Handling Missing Values
Dealing with Outliers
Converting Data Types
Handling Categorical Data
Dealing with Inconsistent Data

Let’s understand all these methods practically using Python. To implement all these techniques, I’ll be using a dataset based on food delivery costs. You can download the dataset from here.

Now, let’s understand all these data cleaning methods using Python.

Removing Duplicates

Always make sure your data doesn’t have any duplicate records, as duplicate records can mislead the results you are aiming to find from your data. Here’s how to check whether your data has any duplicate records and remove them:

import pandas as pd

data = pd.read_csv("food_orders_new_delhi.csv")

# check for duplicate rows
duplicate_rows = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

data = data.drop_duplicates()

Number of duplicate rows: 0

Handling Missing Values

Identify missing values and decide on a strategy:

Remove: Delete rows/columns with missing values.
Fill: Impute missing values using the mean, median, mode, or constant value.
Interpolate: Use interpolation methods to estimate missing values.

In our dataset, only the discount and offers column has missing values. To handle missing values in the Discounts and Offers column, we need to decide on an appropriate strategy. Here are a few common approaches:

Fill with a default value: This could be ‘None’ or ‘0%’ indicating no discount.
Fill with the most frequent value: Use the mode of the column.

Given that Discounts and Offers is a categorical column, filling with a default value like ‘None’ might be the most straightforward and sensible approach if we assume that missing values imply no discount was applied. Here’s how to implement this strategy:

data['Discounts and Offers'].fillna('None', inplace=True)

Identify Outliers

Identify outliers using statistical methods (IQR, Z-score) or visualization (box plots). Then, decide on a strategy:

Remove: Delete rows with outliers.
Cap: Limit outliers to a specific range.
Transform: Use transformations like log or square root.

Please note that removing outliers is not a crucial step every time. Only remove outliers if they affect the interpretation of the data or a machine learning model. Here’s how to deal with outliers using Python:

numerical_columns = ['Order Value', 'Delivery Fee', 'Commission Fee', 'Payment Processing Fee', 'Refunds/Chargebacks']

def cap_outliers(series):
    lower_cap = series.quantile(0.01)
    upper_cap = series.quantile(0.99)
    return series.clip(lower_cap, upper_cap)

for col in numerical_columns:
    data[col] = cap_outliers(data[col])

Converting Data Types

Identify columns with incorrect data types and convert data types to appropriate formats (e.g., strings to dates, floats to integers). Here’s how to convert the date columns in our dataset to the datetime data type:

data['Order Date and Time'] = pd.to_datetime(data['Order Date and Time'])
data['Delivery Date and Time'] = pd.to_datetime(data['Delivery Date and Time'])

Handling Categorical Data

Identify categorical columns. If the problem demands, convert categorical data to numerical format using one-hot encoding or label encoding. Here’s how you can convert categorical data into a numerical format using Python:

data = pd.get_dummies(data, columns=['Payment Method'], drop_first=True)

Handling Inconsistent Data

Identify inconsistencies (e.g., different formats, typos) and correct inconsistencies manually or using predefined rules. In our dataset, the discount and offers column has data inconsistencies, which makes it a challenge to calculate the exact discount amount. Here’s how to handle the inconsistent data in the discount and offers column:

def extract_discount(discount_str):
    if 'off' in discount_str:
        # fixed amount off
        try:
            return float(discount_str.split(' ')[0]), 'fixed'
        except ValueError:
            return 0.0, 'none'
    elif '%' in discount_str:
        # percentage off
        try:
            return float(discount_str.split('%')[0]), 'percentage'
        except ValueError:
            return 0.0, 'none'
    else:
        # no discount
        return 0.0, 'none'

# apply the function to create new 'Discount Percentage' and 'Discount Type' columns
data['Discount Value'], data['Discount Type'] = zip(*data['Discounts and Offers'].apply(extract_discount))

# calculate the discount amount based on the order value and discount type
def calculate_discount_amount(row):
    if row['Discount Type'] == 'percentage':
        return row['Order Value'] * row['Discount Value'] / 100
    elif row['Discount Type'] == 'fixed':
        return row['Discount Value']
    else:
        return 0.0

data['Discount Amount'] = data.apply(calculate_discount_amount, axis=1)
print(data.head())

   Order ID Customer ID Restaurant ID Order Date and Time  \
0         1       C8270         R2924 2024-02-01 01:11:52   
1         2       C1860         R2054 2024-02-02 22:11:04   
2         3       C6390         R2870 2024-01-31 05:54:35   
3         4       C6191         R2642 2024-01-16 22:52:49   
4         5       C6734         R2799 2024-01-29 01:19:30   

  Delivery Date and Time  Order Value  Delivery Fee Discounts and Offers  \
0    2024-02-01 02:39:52         1914             0            5% on App   
1    2024-02-02 22:46:04          986            40                  10%   
2    2024-01-31 06:52:35          937            30         15% New User   
3    2024-01-16 23:38:49         1463            50                 None   
4    2024-01-29 02:48:30         1974            30         50 off Promo   

   Commission Fee  Payment Processing Fee  Refunds/Chargebacks  \
0           150.0                      47                    0   
1           198.0                      23                    0   
2           195.0                      45                    0   
3           146.0                      27                    0   
4           130.0                      50                    0   

   Payment Method_Credit Card  Payment Method_Digital Wallet  Discount Value  \
0                        True                          False             5.0   
1                       False                           True            10.0   
2                       False                          False            15.0   
3                       False                          False             0.0   
4                       False                          False            50.0   

  Discount Type  Discount Amount  
0    percentage            95.70  
1    percentage            98.60  
2    percentage           140.55  
3          none             0.00  
4         fixed            50.00

Summary

So, below are all the data cleaning methods you should know:

Removing Duplicates
Handling Missing Values
Dealing with Outliers
Converting Data Types
Handling Categorical Data
Dealing with Inconsistent Data

I hope you liked this article on data cleaning methods you should know and how to implement them using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.