Hypothesis Testing for Data Scientists with Python

As a Data Scientist, you’re often tasked with determining whether a difference in outcomes or a trend in the data is significant, or simply the result of random variation. This is where hypothesis testing becomes essential. It provides a structured, statistical framework to validate assumptions, compare groups, and make confident, data-driven decisions. So, in this article, I’ll take you through a practical guide to Hypothesis Testing for Data Scientists with Python.

Hypothesis Testing for Data Scientists with Python: Getting Started

We’ve been given a dataset of 1000 employees, with information on:

Age, Department, Education, Experience
Whether they attended a training program
Their performance scores (scaled from 0 to 100)

We want to evaluate whether the training program improved performance, on average, compared to employees who didn’t attend the training. You can find the dataset here.

Step 1: Define the Hypotheses

In hypothesis testing, we start by stating two opposing claims:

Null Hypothesis (H₀): There is no difference in average performance scores between trained and untrained employees.
Alternative Hypothesis (H₁): Trained employees have a higher average performance score than untrained employees.

This is a one-tailed test, as we’re specifically interested in improvement.

Now, before the second step, we will import the dataset:

import pandas as pd
df = pd.read_csv('/content/Employee_Training_and_Performance_Dataset.csv')
df.head()

Step 2: Prepare the Groups

Next, we will split the dataset into two groups based on whether employees attended the training:

group_yes = df[df['TrainingAttended'] == 'Yes']['PerformanceScore']
group_no = df[df['TrainingAttended'] == 'No']['PerformanceScore']

Step 3: Check for Normality

Most parametric tests, including the t-test, assume that the data is normally distributed. So, we will use the Shapiro-Wilk Test to verify this for both groups. If the p-value > 0.05, we fail to reject the assumption of normality:

from scipy import stats

sample_size = min(len(group_yes), len(group_no), 300)

shapiro_yes = stats.shapiro(group_yes.sample(sample_size, random_state=1))
shapiro_no = stats.shapiro(group_no.sample(sample_size, random_state=1))

print("Shapiro Test (Training = Yes):", shapiro_yes)
print("Shapiro Test (Training = No):", shapiro_no)

Shapiro Test (Training = Yes): ShapiroResult(statistic=np.float64(0.9947190464566062), pvalue=np.float64(0.3910129582664982))
Shapiro Test (Training = No): ShapiroResult(statistic=np.float64(0.99501435026432), pvalue=np.float64(0.44369527154076494))

Both groups are approximately normal, so we can proceed with the t-test.

Step 4: Check for Equal Variance

Before running a t-test, we need to determine whether the two groups have equal variances. We use Levene’s Test for this:

levene = stats.levene(group_yes, group_no)
print("Levene’s Test:", levene)

Levene’s Test: LeveneResult(statistic=np.float64(3.6987757209752585), pvalue=np.float64(0.05473666933558896))

While this p-value is just above 0.05, it’s borderline. To be cautious, we assume unequal variances and use Welch’s t-test, which is more robust.

Step 5: Perform Welch’s T-Test

Now, we will perform the actual hypothesis test:

t_stat, p_val = stats.ttest_ind(group_yes, group_no, equal_var=False)
print("T-test statistic:", t_stat)
print("T-test p-value:", p_val)

T-test statistic: 9.187893626181372
T-test p-value: 2.8582551803382495e-19

A p-value this small means there’s an extremely low probability that the observed difference happened by chance.

Since p-value < 0.05, we will reject the null hypothesis. We now have strong statistical evidence that employees who attended training perform significantly better than those who did not.

Here’s a visual comparison of both groups:

import plotly.express as px

fig = px.box(
    df,
    x='TrainingAttended',
    y='PerformanceScore',
    title='Performance Score by Training Attendance',
    labels={
        'TrainingAttended': 'Training Attended',
        'PerformanceScore': 'Performance Score'
    },
    color='TrainingAttended',  
    points='all',  
)

fig.update_layout(
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='white',
    margin=dict(l=40, r=40, t=80, b=60),
    showlegend=False
)

fig.show()

Performance Score by Training Attendance

Summary

So, hypothesis testing is a powerful tool that enables Data Scientists to move beyond intuition and make statistically sound decisions. In this case, we used Python to validate a real-world assumption, followed the right testing steps, and uncovered clear evidence that training positively impacts employee performance. I hope you liked this article on hypothesis testing for Data Scientists with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.