Feature selection is a critical step in machine learning that helps improve model performance by removing irrelevant or redundant features. One effective method for feature selection is using Decision Trees, which rank features based on their importance in predicting the target variable. In this article, I’ll explain how to automate feature selection with Decision Trees using Python.
What is Automated Feature Selection?
Automated Feature selection is like a pipeline that removes irrelevant or redundant features, leading to the following:
- Faster training times
- Reduced overfitting
- Improved model accuracy
- Better model interpretability
Decision Tree-based Feature Importance is one of the best methods for feature selection because decision trees naturally assign importance scores to each feature based on their contribution to reducing error.
Let’s understand how to automate feature selection using Python step-by-step. For this task, we will be using a dataset based on Dynamic Pricing, which can be downloaded from here.
Automate Feature Selection using Python
First, import the data and analyze its structure to begin automating feature selection:
import pandas as pd
df = pd.read_csv("/content/dynamic_pricing.csv")
print(df.head())The target variable, Historical_Cost_of_Ride (cost of the ride), represents the cost, while the other feature variables include various numerical and categorical values. Let’s start automating feature selection step by step.
Step 1: Encoding Categorical Features
We will use LabelEncoder to convert categorical variables into numerical values:
from sklearn.preprocessing import LabelEncoder
# identify categorical columns
categorical_cols = ['Location_Category', 'Customer_Loyalty_Status', 'Time_of_Booking', 'Vehicle_Type'] # Replace with actual categorical columns
# apply Label Encoding
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = leNow, all features are numerical and ready for feature selection.
Step 2: Training a Decision Tree for Feature Selection
Now, we will train a Decision Tree Regressor and extract Feature Importances:
X = df.drop(columns=['Historical_Cost_of_Ride'])
y = df['Historical_Cost_of_Ride']
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X, y)
feature_importances = pd.DataFrame({
'Feature': X.columns,
'Importance': model.feature_importances_
})
# sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
print(feature_importances)Feature Importance
8 Expected_Ride_Duration 0.877188
5 Average_Ratings 0.033423
4 Number_of_Past_Rides 0.021423
1 Number_of_Drivers 0.019312
0 Number_of_Riders 0.017676
7 Vehicle_Type 0.013145
2 Location_Category 0.009164
6 Time_of_Booking 0.005836
3 Customer_Loyalty_Status 0.002833
Key Insight: Expected_Ride_Duration is by far the most important feature.
Step 3: Automate Feature Selection
We can now automate feature selection by keeping only features with importance > 1%:
# define threshold threshold = 0.01 # keep features with importance > 1% # select important features selected_features = feature_importances[feature_importances['Importance'] > threshold]['Feature'].tolist() # filter dataset X_selected = X[selected_features]
Now, X_selected contains only the most relevant features for training the final model. Here are the best features we found for selection:
print(X_selected.columns)
- Expected_Ride_Duration
- Average_Ratings
- Number_of_Past_Rides
- Number_of_Drivers
- Number_of_Riders
- Vehicle_Type
Summary
So, one effective method for feature selection is using Decision Trees, which rank features based on their importance in predicting the target variable. I hope you liked this article on how to automate feature selection using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





