SQL Data Cleaning Methods

Data cleaning is an essential step in the process of solving a Data Science problem to ensure the data’s quality and reliability. If you are learning Data Science and want to understand how to clean your data using SQL, this article is for you. In this article, I’ll take you through a detailed practical guide on SQL data cleaning methods you should know for Data Science.

SQL Data Cleaning Methods

Below are some essential SQL data cleaning methods you should know for Data Science:

Identifying Missing Data
Removing Duplicate Records
Handling Missing Data
Standardizing Data
Correcting Data Entry Errors

Let’s understand all these SQL data cleaning methods in detail one by one.

Identifying Missing Data

Before cleaning, you must identify missing or NULL values. You can use the IS NULL condition to find missing data. For example, look at the table given below:

SaleID	ProductName	SaleAmount	SaleDate
1	Laptop	1000	2024-08-01
2	NULL	500	2024-08-02
3	Phone	NULL	2024-08-03
4	Tablet	300	NULL

Table: Sales

Here’s how to identify missing data from the above table:

SELECT *
FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;

SaleID	ProductName	SaleAmount	SaleDate
2	NULL	500	2024-08-02
3	Phone	NULL	2024-08-03
4	Tablet	300	NULL

Output

Removing Duplicate Records

Duplicate records can skew analysis. Use DISTINCT or ROW_NUMBER() with PARTITION BY to identify and remove duplicates. For example, look at the table given below:

OrderID	CustomerID	OrderDate
1	101	2024-08-01
2	102	2024-08-02
3	101	2024-08-01
4	103	2024-08-03

Table: Orders

Here’s how to identify duplicate records from the table:

SELECT OrderID, CustomerID, OrderDate,
       COUNT(*) OVER (PARTITION BY CustomerID, OrderDate) AS DuplicateCount
FROM Orders;

OrderID	CustomerID	OrderDate	DuplicateCount
1	101	2024-08-01	2
3	101	2024-08-01	2
2	102	2024-08-02	1
4	103	2024-08-03	1

Output

And, here’s how to remove the duplicate records:

WITH CTE AS (
    SELECT OrderID, CustomerID, OrderDate,
           ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate ORDER BY OrderID) AS RowNum
    FROM Orders
)
DELETE FROM Orders
WHERE OrderID IN (SELECT OrderID FROM CTE WHERE RowNum > 1);

OrderID	CustomerID	OrderDate
1	101	2024-08-01
2	102	2024-08-02
4	103	2024-08-03

Output

Handling Missing Data

There are several strategies for handling missing data, including deletion, imputation, or replacement with default values. Look at the table given below, it has missing values in all the rows except the first row:

SaleID	ProductName	SaleAmount	SaleDate
1	Laptop	1000	2024-08-01
2	NULL	500	2024-08-02
3	Phone	NULL	2024-08-03
4	Tablet	300	NULL

Table: Sales

Here’s how to delete rows with missing data:

DELETE FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;

SaleID	ProductName	SaleAmount	SaleDate
1	Laptop	1000	2024-08-01

Output

And if you want to replace missing values with a default value, here’s an example:

UPDATE Sales
SET ProductName = 'Unknown'
WHERE ProductName IS NULL;

UPDATE Sales
SET SaleAmount = 0
WHERE SaleAmount IS NULL;

UPDATE Sales
SET SaleDate = '2024-01-01'
WHERE SaleDate IS NULL;

SaleID	ProductName	SaleAmount	SaleDate
1	Laptop	1000	2024-08-01
2	Unknown	500	2024-08-02
3	Phone	0	2024-08-03
4	Tablet	300	2024-01-01

Output

Standardizing Data

Standardizing data involves converting data to a consistent format. For example, converting all text to lowercase or uppercase. Look at the example table below, the format of the country column is inconsistent:

CustomerID	CustomerName	Country
1	John Doe	USA
2	jane SMITH	usA
3	Michael Brown	Canada
4	linda JOHNSON	CANADA

Table: Customers

Here’s how to correct the inconsistencies in our data:

UPDATE Customers
SET Country = UPPER(Country);

CustomerID	CustomerName	Country
1	John Doe	USA
2	jane SMITH	USA
3	Michael Brown	CANADA
4	linda JOHNSON	CANADA

Output

Correcting Data Entry Errors

Data entry errors, like misspelt words or incorrect values, can be corrected using SQL. For example, look at the table given below, the spelling of Phone is incorrect in the table:

ProductID	ProductName	Price
1	Laptop	1000
2	Phonne	500
3	Tablet	300
4	Desktop	800

Table: Products

Here’s how you can correct it:

UPDATE Products
SET ProductName = 'Phone'
WHERE ProductName = 'Phonne';

ProductID	ProductName	Price
1	Laptop	1000
2	Phone	500
3	Tablet	300
4	Desktop	800

Output

Summary

Below are some essential SQL data cleaning methods you should know for Data Science:

Identifying Missing Data: Use IS NULL to find missing data.
Removing Duplicate Records: Use DISTINCT or ROW_NUMBER() to identify and remove duplicates.
Handling Missing Data: Delete, impute, or replace missing data with default values.
Standardizing Data: Convert data to a consistent format.
Correcting Data Entry Errors: Use UPDATE to correct misspelled words or incorrect values.

I hope you liked this article on SQL data cleaning methods you should know for Data Science. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.