Categorical Data Handling Techniques

Handling categorical data is a crucial step in data preprocessing before training Machine Learning models. If you think there’s only one technique: one hot encoding, to handle categorical data, then you are wrong. There are many more techniques you should know to handle categorical data. So, in this article, I’ll take you through a guide to the techniques you can use for categorical data handling and how to implement them using Python.

Categorical Data Handling Techniques

Here are the techniques used to handle categorical data that you should know:

Label Encoding
One-Hot Encoding
Ordinal Encoding

Let’s go through all these techniques with implementation using Python.

Label Encoding

Label Encoding converts categorical labels into integer values. Label encoding is typically used for both nominal and ordinal data, but it is more commonly used for nominal data where no inherent order is present.

Here’s an example of handling categorical data using label encoding with Python:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# using label encoder
le = LabelEncoder()

df['Color_encoded'] = le.fit_transform(df['Color'])

print(df)

   Color  Color_encoded
0    Red              2
1   Blue              0
2  Green              1
3   Blue              0
4    Red              2

One-Hot Encoding

One-Hot Encoding creates binary columns for each category. Use this method when the categorical feature is nominal (i.e., categories do not have a meaningful order) and the number of categories is small.

Here’s an example of handling categorical data using one-hot encoding with Python:

from sklearn.preprocessing import OneHotEncoder

data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# using the OneHotEncoder
encoder = OneHotEncoder(sparse=False)

encoded = encoder.fit_transform(df[['Color']])

df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color']))

print(df_encoded)

   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         1.0          0.0        0.0
4         0.0          0.0        1.0

Ordinal Encoding

Ordinal Encoding assigns ordinal numbers to each category based on the order. Use this method when the categorical feature has a meaningful order (ordinal data) and the number of categories is small.

Here’s an example of handling categorical data using ordinal encoding:

data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# define the order of the categories
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}

# ordinal Encoding
df['Size_encoded'] = df['Size'].map(size_mapping)

print(df)

     Size  Size_encoded
0   Small             1
1  Medium             2
2   Large             3
3  Medium             2
4   Small             1

You may think what’s the difference between label encoding and ordinal encoding? Label encoding assigns a unique integer to each category without implying any order or ranking among the categories, which makes it suitable for nominal data. In contrast, ordinal encoding is specifically used for ordinal data where categories have a meaningful order or ranking, such as low, medium, and high, and assigns integers that reflect this order.

So, each method has its strengths and is suited to different types of categorical data. Label Encoding can be used for both ordinal and nominal data, one hot encoding can be used when you prefer to create a new column for each category, and ordinal encoding can be used for ordinal data if the number of categories is small.

Summary

So, are the techniques used to handle categorical data that you should know:

Label Encoding
One-Hot Encoding
Ordinal Encoding

I hope you liked this article on categorical data handling techniques. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.