SQL Data Cleaning Methods

Data cleaning is an essential step in the process of solving a Data Science problem to ensure the data’s quality and reliability. If you are learning Data Science and want to understand how to clean your data using SQL, this article is for you. In this article, I’ll take you through a detailed practical guide on SQL data cleaning methods you should know for Data Science.

SQL Data Cleaning Methods

Below are some essential SQL data cleaning methods you should know for Data Science:

  1. Identifying Missing Data
  2. Removing Duplicate Records
  3. Handling Missing Data
  4. Standardizing Data
  5. Correcting Data Entry Errors

Let’s understand all these SQL data cleaning methods in detail one by one.

Identifying Missing Data

Before cleaning, you must identify missing or NULL values. You can use the IS NULL condition to find missing data. For example, look at the table given below:

SaleIDProductNameSaleAmountSaleDate
1Laptop10002024-08-01
2NULL5002024-08-02
3PhoneNULL2024-08-03
4Tablet300NULL
Table: Sales

Here’s how to identify missing data from the above table:

SELECT *
FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
SaleIDProductNameSaleAmountSaleDate
2NULL5002024-08-02
3PhoneNULL2024-08-03
4Tablet300NULL
Output

Removing Duplicate Records

Duplicate records can skew analysis. Use DISTINCT or ROW_NUMBER() with PARTITION BY to identify and remove duplicates. For example, look at the table given below:

OrderIDCustomerIDOrderDate
11012024-08-01
21022024-08-02
31012024-08-01
41032024-08-03
Table: Orders

Here’s how to identify duplicate records from the table:

SELECT OrderID, CustomerID, OrderDate,
       COUNT(*) OVER (PARTITION BY CustomerID, OrderDate) AS DuplicateCount
FROM Orders;
OrderIDCustomerIDOrderDateDuplicateCount
11012024-08-012
31012024-08-012
21022024-08-021
41032024-08-031
Output

And, here’s how to remove the duplicate records:

WITH CTE AS (
    SELECT OrderID, CustomerID, OrderDate,
           ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate ORDER BY OrderID) AS RowNum
    FROM Orders
)
DELETE FROM Orders
WHERE OrderID IN (SELECT OrderID FROM CTE WHERE RowNum > 1);
OrderIDCustomerIDOrderDate
11012024-08-01
21022024-08-02
41032024-08-03
Output

Handling Missing Data

There are several strategies for handling missing data, including deletion, imputation, or replacement with default values. Look at the table given below, it has missing values in all the rows except the first row:

SaleIDProductNameSaleAmountSaleDate
1Laptop10002024-08-01
2NULL5002024-08-02
3PhoneNULL2024-08-03
4Tablet300NULL
Table: Sales

Here’s how to delete rows with missing data:

DELETE FROM Sales
WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
SaleIDProductNameSaleAmountSaleDate
1Laptop10002024-08-01
Output

And if you want to replace missing values with a default value, here’s an example:

UPDATE Sales
SET ProductName = 'Unknown'
WHERE ProductName IS NULL;

UPDATE Sales
SET SaleAmount = 0
WHERE SaleAmount IS NULL;

UPDATE Sales
SET SaleDate = '2024-01-01'
WHERE SaleDate IS NULL;
SaleIDProductNameSaleAmountSaleDate
1Laptop10002024-08-01
2Unknown5002024-08-02
3Phone02024-08-03
4Tablet3002024-01-01
Output

Standardizing Data

Standardizing data involves converting data to a consistent format. For example, converting all text to lowercase or uppercase. Look at the example table below, the format of the country column is inconsistent:

CustomerIDCustomerNameCountry
1John DoeUSA
2jane SMITHusA
3Michael BrownCanada
4linda JOHNSONCANADA
Table: Customers

Here’s how to correct the inconsistencies in our data:

UPDATE Customers
SET Country = UPPER(Country);
CustomerIDCustomerNameCountry
1John DoeUSA
2jane SMITHUSA
3Michael BrownCANADA
4linda JOHNSONCANADA
Output

Correcting Data Entry Errors

Data entry errors, like misspelt words or incorrect values, can be corrected using SQL. For example, look at the table given below, the spelling of Phone is incorrect in the table:

ProductIDProductNamePrice
1Laptop1000
2Phonne500
3Tablet300
4Desktop800
Table: Products

Here’s how you can correct it:

UPDATE Products
SET ProductName = 'Phone'
WHERE ProductName = 'Phonne';
ProductIDProductNamePrice
1Laptop1000
2Phone500
3Tablet300
4Desktop800
Output

Summary

Below are some essential SQL data cleaning methods you should know for Data Science:

  • Identifying Missing Data: Use IS NULL to find missing data.
  • Removing Duplicate Records: Use DISTINCT or ROW_NUMBER() to identify and remove duplicates.
  • Handling Missing Data: Delete, impute, or replace missing data with default values.
  • Standardizing Data: Convert data to a consistent format.
  • Correcting Data Entry Errors: Use UPDATE to correct misspelled words or incorrect values.

I hope you liked this article on SQL data cleaning methods you should know for Data Science. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2071

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading