Automate Feature Selection using Python

Feature selection is a critical step in machine learning that helps improve model performance by removing irrelevant or redundant features. One effective method for feature selection is using Decision Trees, which rank features based on their importance in predicting the target variable. In this article, I’ll explain how to automate feature selection with Decision Trees using Python.

What is Automated Feature Selection?

Automated Feature selection is like a pipeline that removes irrelevant or redundant features, leading to the following:

Faster training times
Reduced overfitting
Improved model accuracy
Better model interpretability

Decision Tree-based Feature Importance is one of the best methods for feature selection because decision trees naturally assign importance scores to each feature based on their contribution to reducing error.

Let’s understand how to automate feature selection using Python step-by-step. For this task, we will be using a dataset based on Dynamic Pricing, which can be downloaded from here.

Automate Feature Selection using Python

First, import the data and analyze its structure to begin automating feature selection:

import pandas as pd

df = pd.read_csv("/content/dynamic_pricing.csv")
print(df.head())

The target variable, Historical_Cost_of_Ride (cost of the ride), represents the cost, while the other feature variables include various numerical and categorical values. Let’s start automating feature selection step by step.

Step 1: Encoding Categorical Features

We will use LabelEncoder to convert categorical variables into numerical values:

from sklearn.preprocessing import LabelEncoder

# identify categorical columns
categorical_cols = ['Location_Category', 'Customer_Loyalty_Status', 'Time_of_Booking', 'Vehicle_Type']  # Replace with actual categorical columns

# apply Label Encoding
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

Now, all features are numerical and ready for feature selection.

Step 2: Training a Decision Tree for Feature Selection

Now, we will train a Decision Tree Regressor and extract Feature Importances:

X = df.drop(columns=['Historical_Cost_of_Ride'])
y = df['Historical_Cost_of_Ride']

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=42)
model.fit(X, y)

feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
})

# sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

print(feature_importances)

                   Feature  Importance
8   Expected_Ride_Duration    0.877188
5          Average_Ratings    0.033423
4     Number_of_Past_Rides    0.021423
1        Number_of_Drivers    0.019312
0         Number_of_Riders    0.017676
7             Vehicle_Type    0.013145
2        Location_Category    0.009164
6          Time_of_Booking    0.005836
3  Customer_Loyalty_Status    0.002833

Key Insight: Expected_Ride_Duration is by far the most important feature.

Step 3: Automate Feature Selection

We can now automate feature selection by keeping only features with importance > 1%:

# define threshold
threshold = 0.01  # keep features with importance > 1%

# select important features
selected_features = feature_importances[feature_importances['Importance'] > threshold]['Feature'].tolist()

# filter dataset
X_selected = X[selected_features]

Now, X_selected contains only the most relevant features for training the final model. Here are the best features we found for selection:

print(X_selected.columns)

Expected_Ride_Duration
Average_Ratings
Number_of_Past_Rides
Number_of_Drivers
Number_of_Riders
Vehicle_Type

Summary

So, one effective method for feature selection is using Decision Trees, which rank features based on their importance in predicting the target variable. I hope you liked this article on how to automate feature selection using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.