Data cleaning is an essential step in the process of solving a Data Science problem to ensure the data’s quality and reliability. If you are learning Data Science and want to understand how to clean your data using SQL, this article is for you. In this article, I’ll take you through a detailed practical guide on SQL data cleaning methods you should know for Data Science.
SQL Data Cleaning Methods
Below are some essential SQL data cleaning methods you should know for Data Science:
- Identifying Missing Data
- Removing Duplicate Records
- Handling Missing Data
- Standardizing Data
- Correcting Data Entry Errors
Let’s understand all these SQL data cleaning methods in detail one by one.
Identifying Missing Data
Before cleaning, you must identify missing or NULL values. You can use the IS NULL condition to find missing data. For example, look at the table given below:
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Here’s how to identify missing data from the above table:
SELECT * FROM Sales WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Removing Duplicate Records
Duplicate records can skew analysis. Use DISTINCT or ROW_NUMBER() with PARTITION BY to identify and remove duplicates. For example, look at the table given below:
| OrderID | CustomerID | OrderDate |
|---|---|---|
| 1 | 101 | 2024-08-01 |
| 2 | 102 | 2024-08-02 |
| 3 | 101 | 2024-08-01 |
| 4 | 103 | 2024-08-03 |
Here’s how to identify duplicate records from the table:
SELECT OrderID, CustomerID, OrderDate,
COUNT(*) OVER (PARTITION BY CustomerID, OrderDate) AS DuplicateCount
FROM Orders;| OrderID | CustomerID | OrderDate | DuplicateCount |
|---|---|---|---|
| 1 | 101 | 2024-08-01 | 2 |
| 3 | 101 | 2024-08-01 | 2 |
| 2 | 102 | 2024-08-02 | 1 |
| 4 | 103 | 2024-08-03 | 1 |
And, here’s how to remove the duplicate records:
WITH CTE AS (
SELECT OrderID, CustomerID, OrderDate,
ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate ORDER BY OrderID) AS RowNum
FROM Orders
)
DELETE FROM Orders
WHERE OrderID IN (SELECT OrderID FROM CTE WHERE RowNum > 1);| OrderID | CustomerID | OrderDate |
|---|---|---|
| 1 | 101 | 2024-08-01 |
| 2 | 102 | 2024-08-02 |
| 4 | 103 | 2024-08-03 |
Handling Missing Data
There are several strategies for handling missing data, including deletion, imputation, or replacement with default values. Look at the table given below, it has missing values in all the rows except the first row:
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | NULL | 500 | 2024-08-02 |
| 3 | Phone | NULL | 2024-08-03 |
| 4 | Tablet | 300 | NULL |
Here’s how to delete rows with missing data:
DELETE FROM Sales WHERE ProductName IS NULL OR SaleAmount IS NULL OR SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
And if you want to replace missing values with a default value, here’s an example:
UPDATE Sales SET ProductName = 'Unknown' WHERE ProductName IS NULL; UPDATE Sales SET SaleAmount = 0 WHERE SaleAmount IS NULL; UPDATE Sales SET SaleDate = '2024-01-01' WHERE SaleDate IS NULL;
| SaleID | ProductName | SaleAmount | SaleDate |
|---|---|---|---|
| 1 | Laptop | 1000 | 2024-08-01 |
| 2 | Unknown | 500 | 2024-08-02 |
| 3 | Phone | 0 | 2024-08-03 |
| 4 | Tablet | 300 | 2024-01-01 |
Standardizing Data
Standardizing data involves converting data to a consistent format. For example, converting all text to lowercase or uppercase. Look at the example table below, the format of the country column is inconsistent:
| CustomerID | CustomerName | Country |
|---|---|---|
| 1 | John Doe | USA |
| 2 | jane SMITH | usA |
| 3 | Michael Brown | Canada |
| 4 | linda JOHNSON | CANADA |
Here’s how to correct the inconsistencies in our data:
UPDATE Customers SET Country = UPPER(Country);
| CustomerID | CustomerName | Country |
|---|---|---|
| 1 | John Doe | USA |
| 2 | jane SMITH | USA |
| 3 | Michael Brown | CANADA |
| 4 | linda JOHNSON | CANADA |
Correcting Data Entry Errors
Data entry errors, like misspelt words or incorrect values, can be corrected using SQL. For example, look at the table given below, the spelling of Phone is incorrect in the table:
| ProductID | ProductName | Price |
|---|---|---|
| 1 | Laptop | 1000 |
| 2 | Phonne | 500 |
| 3 | Tablet | 300 |
| 4 | Desktop | 800 |
Here’s how you can correct it:
UPDATE Products SET ProductName = 'Phone' WHERE ProductName = 'Phonne';
| ProductID | ProductName | Price |
|---|---|---|
| 1 | Laptop | 1000 |
| 2 | Phone | 500 |
| 3 | Tablet | 300 |
| 4 | Desktop | 800 |
Summary
Below are some essential SQL data cleaning methods you should know for Data Science:
- Identifying Missing Data: Use IS NULL to find missing data.
- Removing Duplicate Records: Use DISTINCT or ROW_NUMBER() to identify and remove duplicates.
- Handling Missing Data: Delete, impute, or replace missing data with default values.
- Standardizing Data: Convert data to a consistent format.
- Correcting Data Entry Errors: Use UPDATE to correct misspelled words or incorrect values.
I hope you liked this article on SQL data cleaning methods you should know for Data Science. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





