A Guide to PCA for Data Scientists

Many people think PCA (Principal Component Analysis) is just about reducing columns to reduce a dataset’s dimensionality, but PCA works by creating new features (called principal components) that are linear combinations of the original ones. So, if you don’t know anything about PCA, this article is for you. In this article, I’ll take you through a guide to PCA for Data Scientists.

A Guide to PCA for Data Scientists

PCA is like taking a big, messy dataset and rotating it to see it from an angle where the patterns are clearer. It’s a dimensionality reduction technique. But it’s not just about reducing columns randomly, it’s about preserving the most important information while discarding noise or redundancy.

Here, you’re creating new features (called principal components) that are linear combinations of the original ones, and these new features explain as much variance in the data as possible.

Here’s what PCA does under the hood:

Centre the Data: Subtract the mean of each feature so everything’s centred around zero.
Compute the Covariance Matrix: Understand how features vary together.
Calculate Eigenvectors and Eigenvalues: These tell us the directions of maximum variance and how important each one is.
Sort and Project: Keep the top k components and project the original data onto them.

The final result is a new dataset with fewer, uncorrelated features known as the principal components.

Implementing PCA to Reduce the Dimensionality of a Dataset

Now, let’s see how to use PCA to reduce the dimensionality of a dataset using Python. To understand the implementation, we will apply PCA to the Forest Cover Type dataset, which has 54 features and over half a million rows. We’ll create 2 new features from it, the first two principal components, and then we will visualize the transformation.

Let’s load the data and apply PCA to reduce the dimensionality of the dataset:

from sklearn.datasets import fetch_covtype
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

data = fetch_covtype()
X = data.data
y = data.target

# standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# reduce to 2 dimensions with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Here, X_pca contains our new features. Each component is a weighted combination of all original features, and they are sorted by how much variance they capture.

Now, let’s see how the data looks before and after PCA:

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Original Data (First 2 Features)", "After PCA (First 2 Principal Components)"),
    specs=[[{"type": "scatter"}, {"type": "scatter"}]]
)

fig.add_trace(go.Scattergl(
    x=X_scaled[:, 0],
    y=X_scaled[:, 1],
    mode='markers',
    marker=dict(
        color=y,
        colorscale='Viridis',
        size=3,
        opacity=0.6
    ),
    showlegend=False
), row=1, col=1)

fig.add_trace(go.Scattergl(
    x=X_pca[:, 0],
    y=X_pca[:, 1],
    mode='markers',
    marker=dict(
        color=y,
        colorscale='Viridis',
        size=3,
        opacity=0.6
    ),
    showlegend=False
), row=1, col=2)

fig.update_layout(
    title="PCA: From Original Features to Principal Components",
    height=500,
    width=1000,
    margin=dict(l=40, r=40, t=80, b=60),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='white'
)

fig.update_xaxes(title_text="Feature 1", row=1, col=1)
fig.update_yaxes(title_text="Feature 2", row=1, col=1)
fig.update_xaxes(title_text="PC1", row=1, col=2)
fig.update_yaxes(title_text="PC2", row=1, col=2)

fig.show()

PCA: From Original Features to Principal Components

This PCA output clearly shows how dimensionality reduction transforms the original feature space into a more meaningful one. On the left, the original data using two features appears densely packed and lacks clear structure or separation.

After applying PCA (right plot), the transformed data along the first two principal components reveals more distinct patterns and variation, suggesting that PCA has effectively captured the directions of maximum variance and reduced redundancy, making the underlying structure more visible and potentially more useful for tasks like clustering or classification.

Summary

So, PCA is a dimensionality reduction technique. But it’s not just about reducing columns randomly, it’s about preserving the most important information while discarding noise or redundancy. I hope you liked this article on a guide to PCA for Data Scientists. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.