How I Do Feature Selection with 500+ Columns

I still remember loading a dataset so big that my laptop fan started roaring. When I checked its shape, it had 20,000 rows and 512 columns. In data science, having more data isn’t always an advantage; what matters is having better data. In this article, I’ll show you how I handle feature selection when there are over 500 columns.

Feature Selection with 500+ Columns

I don’t just use Principal Component Analysis and hope it works. PCA makes it hard to explain results. I can’t tell stakeholders that a drop in Principal Component 4 by 10% caused a problem.

Instead, I use a funnel approach. We start with everything and narrow it down using three clear steps:

  1. Remove data that technically exists but holds no information.
  2. Remove features that are telling you the same story as other features.
  3. Ask a model which features actually help it make a decision.

Let’s walk through an example to see how this works.

The Problem

Imagine we’re predicting housing prices, but our dataset includes everything from Square Footage to Number of Doorknobs and Colour of the Mailbox.

We’ll use scikit-learn to create this messy dataset and then clean it up.

Step 0: Create the Dataset

First, we’ll generate a dataset with 500 features, but only about 20 of them are actually useful:

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Generate a dataset with 1000 samples and 500 features
X, y = make_classification(
    n_samples=1000, 
    n_features=500, 
    n_informative=20, 
    n_redundant=50, 
    n_repeated=0, 
    random_state=42
)

# Convert to DataFrame for realism
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(500)])
print(f"Original Shape: {df.shape}")
Original Shape: (1000, 500)

Step 1: The Variance Threshold

If a column has the same value for 99% of the rows, like Country always being India, it has zero variance. Zero variance means no useful information, so we remove these columns first:

from sklearn.feature_selection import VarianceThreshold

# Filter features with low variance (constants or near-constants)
# Threshold 0 means drop columns where all values are the same
# We can set a small threshold (e.g., 0.01) to drop near-constants
selector_variance = VarianceThreshold(threshold=0.01)
X_var = selector_variance.fit_transform(df)

# Get the remaining columns
selected_cols_var = df.columns[selector_variance.get_support()]
df_clean = df[selected_cols_var]

print(f"Shape after Variance Filter: {df_clean.shape}")
Shape after Variance Filter: (1000, 500)

If you see this output, don’t worry. It just means you don’t have any columns where every row is identical. The real issue is likely hidden deeper, in the relationships between your data.

Step 2: The Correlation Filter

If “Feature A” and “Feature B” are 95% correlated, you don’t need to keep both. Keeping both can confuse linear models by causing multicollinearity and can also slow down tree models.

We’ll go through the correlation matrix and drop one feature from any pair that has a correlation above a set threshold, like 0.90:

# Calculate correlation matrix
corr_matrix = df_clean.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

# Drop them
df_uncorrelated = df_clean.drop(to_drop, axis=1)

print(f"Dropped {len(to_drop)} correlated features.")
print(f"Shape after Correlation Filter: {df_uncorrelated.shape}")
Dropped 0 correlated features.
Shape after Correlation Filter: (1000, 500)

If you look at the output, you’ll see the noise isn’t just simple duplication. It’s unique, random clutter that looks different enough to pass a correlation check. We’re still left with 500 columns, so now it’s time to stop searching for duplicates and start asking the model directly.

Step 3: Model-Based Selection

Now that we’ve removed the junk and the duplicates, we have a clean set. But is it actually useful? We can ask a Random Forest model.

Tree-based models naturally calculate Feature Importance, which shows how much each feature decreases impurity. We can train a quick model and have it drop anything that doesn’t help much:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Initialize the model (The Judge)
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# SelectFromModel will pick features whose importance is greater than the mean importance
selector_model = SelectFromModel(estimator=rf, threshold='median')
selector_model.fit(df_uncorrelated, y)

# Transform the dataset
X_final = selector_model.transform(df_uncorrelated)

# Get final features
selected_features = df_uncorrelated.columns[selector_model.get_support()]

print(f"Final Shape after Model Selection: {X_final.shape}")
print(f"We reduced features from 500 to {X_final.shape[1]}!")
Final Shape after Model Selection: (1000, 250)
We reduced features from 500 to 250!

I like this approach because it’s similar to how people make decisions:

  1. Ignore the obvious nonsense (Variance Threshold).
  2. Stop repeating yourself (Correlation).
  3. Focus on what actually impacts the outcome (Model Importance).

This method is robust, easy to interpret, and efficient because we remove the simple stuff before running the more complex model-based selection.

Closing Thoughts

That’s how I handle feature selection with over 500 columns. Data science is often seen as the art of adding more data, more layers, more compute. But the senior engineers I know, the ones whose models last, focus on the art of subtraction.

When you’re faced with 500 columns, don’t be afraid to cut. Your model will be lighter, faster, and honestly, smarter.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2074

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading