Data Cleaning Concepts for Interviews

Aman Kharwal
November 12, 2024

Data Cleaning Concepts for Interviews

A Data Scientist’s day often starts and ends with data cleaning. Many practical questions in Data Science interviews focus on data cleaning concepts, given the importance of data cleaning in the data pipeline. So, if you’re preparing for Data Science interviews and want to master essential data cleaning techniques, this article is for you. In this article, I’ll take you through a guide to essential data cleaning concepts asked in Data Science interviews, with example questions.

Data Cleaning Concepts for Interviews

Here are must-know data cleaning concepts for Data Science interviews, each explained in detail with an example question and its solution in Python.

Handling Missing Data with Advanced Imputation Techniques

Missing data can skew analyses and reduce model accuracy. Advanced imputation techniques, such as K-Nearest Neighbors (KNN) imputation, regression imputation, and iterative imputation, help in filling in missing values by considering patterns within the data rather than simple mean or median substitution.

Example Question: You have a dataset with missing values in the age column. Use KNN imputation to fill in the missing age values based on similarities in other columns like income and education level.

Here’s how to solve this problem using Python:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# sample data
data = {'age': [25, np.nan, 28, 32, np.nan, 30],
        'income': [50000, 55000, 62000, 60000, 64000, 59000],
        'education_level': [1, 2, 2, 3, 3, 2]}
df = pd.DataFrame(data)

# KNN imputer for missing age values
imputer = KNNImputer(n_neighbors=2)
df[['age', 'income', 'education_level']] = imputer.fit_transform(df[['age',
                                                                     'income',
                                                                     'education_level']])
print(df)

    age   income  education_level
0  25.0  50000.0              1.0
1  31.0  55000.0              2.0
2  28.0  62000.0              2.0
3  32.0  60000.0              3.0
4  30.0  64000.0              3.0
5  30.0  59000.0              2.0

Outlier Detection and Treatment

Outliers can distort analysis, which makes it essential to detect and handle them appropriately. Common techniques include using the Interquartile Range (IQR), Z-score, or more advanced methods like Isolation Forests for multivariate outlier detection.

Example Question: You have a dataset of monthly incomes where some entries appear to be outliers. Use the IQR method to detect and handle these outliers by capping them within a specified range.

Here’s how to solve this problem using Python:

# sample income data with outliers
income_data = {'monthly_income': [3000, 3200, 3100, 15000, 2800, 2700, 3400, 2500, 35000]}
df = pd.DataFrame(income_data)

# calculate IQR
Q1 = df['monthly_income'].quantile(0.25)
Q3 = df['monthly_income'].quantile(0.75)
IQR = Q3 - Q1

# define bounds and cap outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df['monthly_income'] = np.where(df['monthly_income'] > upper_bound, upper_bound,
                                np.where(df['monthly_income'] < lower_bound, lower_bound, df['monthly_income']))
print(df)

   monthly_income
0          3000.0
1          3200.0
2          3100.0
3          4300.0
4          2800.0
5          2700.0
6          3400.0
7          2500.0
8          4300.0

Handling Duplicates with Fuzzy Matching

Duplicate records can affect analysis and lead to bias. While exact duplicates are easy to remove, real-world data may have near-duplicates. Fuzzy matching, using libraries like fuzzywuzzy, helps identify and clean near-duplicate records by comparing string similarity.

Example Question: You have a dataset with customer names that may contain near-duplicates due to typos (e.g., “John Doe” and “Jon Doe”). Use fuzzy matching to identify these near-duplicates and retain only one unique entry per group.

Here’s how to solve this problem using Python:

# install fuzzywuzzy: pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process

# sample data with similar names
data = {'customer_name': ['John Doe', 'Jon Doe', 'Jane Doe', 'Janet Doe', 'Jake Doe']}
df = pd.DataFrame(data)

# identify potential duplicates using fuzzy matching
unique_names = []
for name in df['customer_name']:
    match = process.extractOne(name, unique_names, scorer=fuzz.token_sort_ratio)
    if match and match[1] > 85:  # Similarity threshold
        print(f"Duplicate found: {name} -> {match[0]}")
    else:
        unique_names.append(name)

print("Unique names:", unique_names)

Duplicate found: Jon Doe -> John Doe
Duplicate found: Janet Doe -> Jane Doe
Duplicate found: Jake Doe -> Jane Doe
Unique names: ['John Doe', 'Jane Doe']

Text Data Cleaning and Normalization

Text data often requires cleaning and normalization, including removing special characters, converting text to lowercase, and handling contractions or slang terms. It is crucial for preparing data for Natural Language Processing (NLP) tasks.

Example Question: You have a dataset with user reviews. The text contains various issues like inconsistent capitalization, special characters, and extra whitespace. Write a function to clean and normalize this text data for analysis.

Here’s how to solve this problem using Python:

import re

# sample text data
reviews = ["Great Product!!", "   loved it... would buy again!!!", "Not BAD, but could be BETTER :)", "awesome!!!"]

# function to clean and normalize text
def clean_text(text):
    text = text.lower()  # convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra whitespace
    return text

# apply cleaning function
cleaned_reviews = [clean_text(review) for review in reviews]
print(cleaned_reviews)

['great product', 'loved it would buy again', 'not bad but could be better', 'awesome']

Scaling and Normalizing Numeric Data

Scaling and normalization are essential for models that are sensitive to the range of data, especially distance-based models. To start with, scaling, such as Min-Max scaling, adjusts the features to fit within a specified range. Meanwhile, normalization, like Z-score normalization, goes a step further by centering features around the mean and scaling them to have unit variance. In short, both techniques are crucial for preparing data, yet they serve slightly different purposes in ensuring models can interpret data effectively.

Example Question: You have a dataset with numerical features of varying scales (e.g., age and income). Apply Min-Max scaling to transform each feature to a range of 0 to 1.

Here’s how to solve this problem using Python:

from sklearn.preprocessing import MinMaxScaler

# sample data
data = {'age': [20, 30, 40, 50, 60], 'income': [20000, 30000, 50000, 80000, 120000]}
df = pd.DataFrame(data)

# apply min-max scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df)

    age  income
0  0.00     0.0
1  0.25     0.1
2  0.50     0.3
3  0.75     0.6
4  1.00     1.0

Summary

Data cleaning is a foundational skill for any Data Scientist. It directly impacts the accuracy, interpretability, and quality of insights derived from data. I hope you liked this article on data cleaning concepts asked in Data Science interviews. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Leave a ReplyCancel reply