Data Preprocessing Techniques You Should Know

Data preprocessing is a critical step in the data analysis and machine learning workflow. It involves transforming raw data into a clean, understandable format suitable for analysis. It includes tasks like cleaning, transformation, normalization, and feature extraction. So, if you want to learn about the essential data preprocessing techniques you should know, this article is for you. In this article, I’ll take you through some data preprocessing techniques you should know and how to implement them using Python.

Data Preprocessing Techniques You Should Know

Below are some of the data preprocessing techniques that you should know for any Data Science job:

Handling Missing Values
Handling Outliers
Feature Selection
Principal Component Analysis
Feature Scaling
Hyperparameter Tuning
SMOTE

Let’s go through each of these data preprocessing techniques in detail.

Handling Missing Values

Missing values occur when no data is stored for certain observations within a variable. Missing data can arise due to various reasons, such as errors in data collection, non-response in surveys, or deliberate omission. Handling missing values is crucial as most machine learning algorithms do not support data with missing values.

You should address missing values before applying any machine learning model. The method of handling missing data depends on the nature of the data and the percentage of missing values. Below are some techniques to address missing values:

Imputation: Replacing missing values with statistical measures (mean, median, mode) for numerical data or a specific value for categorical data.
Prediction Models: Using algorithms like k-Nearest Neighbors or regression models to predict and fill in missing values.
Deletion: Removing records with missing values, which is only advisable when the number of such records is relatively small.

Here’s an example of a Python function to replace missing values using the imputation method (using mean):

from sklearn.impute import SimpleImputer
import pandas as pd

def impute_missing_values(data, strategy='mean', fill_value=None):
    if strategy == 'constant' and fill_value is None:
        raise ValueError("fill_value must be specified for strategy='constant'")

    imputer = SimpleImputer(strategy=strategy, fill_value=fill_value)
    return pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

You can learn more about handling missing values in detail here.

Handling Outliers

Outliers are data points that deviate significantly from the rest of the data. They can occur due to variability in measurement or experimental errors. Outliers can distort statistical analyses and models.

Identifying and handling outliers before training a model is crucial, especially for algorithms sensitive to outlier values, such as linear regression. Below are some techniques you can use to address outliers in your dataset:

Z-score: Identifying outliers by finding data points with a certain number of standard deviations away from the mean.
IQR (Interquartile Range) Method: Removing data points that lie beyond the 1.5 * IQR above the third quartile and below the first quartile.

Here’s an example of a Python function to remove outliers using the IQR method:

def handle_outliers_iqr(data, threshold=1.5):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - (threshold * IQR)
    upper_bound = Q3 + (threshold * IQR)

    return data[~((data < lower_bound) | (data > upper_bound)).any(axis=1)]

You can learn more about outlier detection in detail here.

Feature Selection

Feature selection involves selecting a subset of relevant features (variables, predictors) to use for model construction. It reduces overfitting, improves accuracy, and reduces training time.

Use feature selection when you have a dataset with a large number of features, some of which might be irrelevant or redundant for predicting the output variable. Below are some recommended techniques for selecting features:

Filter Methods: Use statistical techniques to evaluate the relationship between each feature and the target variable (e.g., correlation coefficient, Chi-square test).
Wrapper Methods: Use an algorithm to search for the best combination of features (e.g., forward selection, backward elimination, recursive feature elimination).
Embedded Methods: Algorithms that perform feature selection as part of the model training process (e.g., LASSO regression).

Here’s an example of a Python function for feature selection using the wrapper method (recursive feature elimination):

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

def select_features_rfe(data, target, n_features_to_select=5):
    model = LogisticRegression(solver='liblinear')
    rfe = RFE(model, n_features_to_select=n_features_to_select)
    fit = rfe.fit(data, target)

    selected_features = [f for f, s in zip(data.columns, fit.support_) if s]
    return selected_features

One of my favourite techniques to select features for Machine Learning models is to perform EDA step by step to find the relationship between all features with the target variable. You can learn more about it here.

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that transforms a large set of variables into a smaller one that still contains most of the information in the large set. It’s used to simplify the dataset, improve visualization, or improve model performance by reducing overfitting.

Apply PCA when you have a high-dimensional dataset, and you want to reduce the number of variables without losing much information. Here’s an example of a Python function to implement PCA:

from sklearn.decomposition import PCA

def apply_pca(data, n_components=2):
    pca = PCA(n_components=n_components)
    principalComponents = pca.fit_transform(data)
    return pd.DataFrame(data=principalComponents, columns=[f'PC{i+1}' for i in range(n_components)])

Feature Scaling

Feature scaling involves standardizing the range of independent variables or features of data. In the absence of scaling, machine learning algorithms can behave poorly or converge slowly.

Scaling is necessary for algorithms that compute distances between data points, such as k-nearest Neighbors, and for optimization algorithms like gradient descent. Below are some techniques used for feature scaling:

Normalization (Min-Max Scaling): Transforms features to be on a similar scale by converting values to a range between 0 and 1.
Standardization (Z-score Normalization): Transforms features to have a mean of 0 and a standard deviation of 1.

Here’s an example of a Python function to implement feature scaling using the standardization method:

from sklearn.preprocessing import StandardScaler

def standardize_features(data):
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)
    return pd.DataFrame(scaled_data, columns=data.columns)

You can learn more about feature scaling in detail here.

Hyperparameter Tuning

Hyperparameter tuning involves finding the combination of hyperparameters for a learning algorithm that performs the best under a specific performance metric. Hyperparameters are the configuration settings used to structure the learning process and must be set before training the model.

Use hyperparameter tuning to improve model performance by optimizing the learning process. It applies to nearly all machine learning algorithms. Below are some techniques you can use for hyperparameter tuning:

Grid Search: Exhaustively searches through a manually specified subset of the hyperparameter space.
Random Search: Randomly searches through the hyperparameter space, providing a more efficient and less exhaustive search method.
Bayesian Optimization: Uses a probabilistic model to guide the search for the best hyperparameters.

Here’s an example of a Python function to implement hyperparameter tuning using the grid search method:

from sklearn.model_selection import GridSearchCV

def tune_hyperparameters(model, param_grid, X_train, y_train, cv=5):
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

You can learn more about hyperparameter tuning in detail here.

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is an oversampling technique used to address class imbalance by creating synthetic examples of the minority class. It helps improve model performance on imbalanced datasets by balancing the class distribution.

SMOTE is particularly useful when dealing with highly imbalanced datasets, where the number of instances in one class significantly outnumbers the instances in the other classes. Here’s an example of a Python function to implement SMOTE to address class imbalance:

from imblearn.over_sampling import SMOTE

def apply_smote(X, y):
    smote = SMOTE()
    X_res, y_res = smote.fit_resample(X, y)
    return X_res, y_res

You can learn more about class imbalance and SMOTE in detail here.

Summary

So, these are the essential data preprocessing techniques you should know. Data preprocessing is a critical step in the data analysis and machine learning workflow. It involves transforming raw data into a clean, understandable format suitable for analysis. It includes tasks like cleaning, transformation, normalization, and feature extraction.

I hope you liked this article on Data Preprocessing technical you should know. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.