Statistical Equations for Data Science

Statistical equations are mathematical expressions that describe relationships, summarize data, or allow inferences and predictions. Some statistical equations form the backbone of many Data Science techniques. So, if you want to know the statistical equations you should learn as a data science professional, this article is for you. In this article, I’ll take you through some essential statistical equations for Data Science with practical implementation using Python.

Statistical Equations for Data Science

Here are some of the essential statistical equations for Data Science you should know:

Mean
Standard Deviation
Correlation Coefficient
Linear Regression
ANOVA

Let’s go through these statistical equations in detail one by one.

Mean (Average)

The mean, often referred to as the average, is a fundamental statistical measure used to find the central tendency of a dataset. The mean of a dataset is calculated by summing up all the values in the dataset and then dividing by the number of values. The formula for the mean is:

Statistical Equations for Data Science: Mean

Where:

x is the sample mean
xi is the ith value in the dataset
n is the number of values in the dataset

For example, if you had a dataset of the following five numbers: 2, 4, 6, 8, and 10, you would calculate the sample mean as follows:

And here’s how you can calculate mean using Python:

data = [1, 2, 3, 4, 5]
mean = sum(data) / len(data)
print(mean)

Output: 3.0

Standard Deviation

Standard deviation is a measure of the amount of variation or dispersion in a set of values. It is a very common statistical calculation that gives a sense for the typical distance between the values of a dataset and the mean of the dataset. A low standard deviation means that the values tend to be close to the mean, while a high standard deviation means that the values are spread out over a wider range.

The standard deviation (σ) for a population of size N is calculated using the formula:

Statistical Equations for Data Science: Standard Deviation

Here’s the breakdown of the formula:

σ: represents the population standard deviation
Σ: represents the sum of the squared deviations from the mean for all values in the population
(xi – μ): represents the deviation of each individual value (xi) from the population mean (μ)
N: represents the total number of values in the population

Here’s how to calculate standard deviation using Python:

import numpy as np
data = [1, 2, 3, 4, 5]
std_dev = np.std(data)
print(std_dev)

Output: 1.4142135623730951

Correlation Coefficient

The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A calculated number greater than 0 indicates a positive relationship; a number less than 0 signifies a negative relationship; and a number equal to 0 implies no relationship between the variables.

There are several types of correlation coefficients, but the most common is Pearson’s correlation coefficient. Here’s how it’s defined and calculated:

Here’s the breakdown of the formula:

r: represents the Pearson correlation coefficient, which can range from -1 to 1.
n: represents the number of data points in the dataset.
Σxy: represents the sum of the product of the deviations from the mean for each variable (x and y).
Σx²: represents the sum of the squared deviations from the mean for variable x.
Σy²: represents the sum of the squared deviations from the mean for variable y.

Here’s how to calculate correlation coefficient using Python:

import numpy as np
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
correlation = np.corrcoef(x, y)[0, 1]
print(correlation)

Output: 0.9999999999999999

Linear Regression

Linear regression is a fundamental statistical and machine learning technique used to predict the value of a dependent variable (often denoted by Y) based on the value of one or more independent variables (often denoted by X). The goal of linear regression is to find the best-fitting straight line through the data points that minimizes the errors in prediction.

The equation for a simple linear regression (with one independent variable) is:

Y= β0+ β1X +ϵ

Where:

Y is the dependent variable.
X is the independent variable.
β0 is the intercept of the regression line (the value of Y when X=0).
β1 is the slope of the regression line (the change in Y for a one-unit change in X).
ϵ is the error term (the difference between the observed values and the values predicted by the model).

Here’s how to implement linear regression algorithm using Python:

X = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

mean_x = sum(X) / len(X)
mean_y = sum(y) / len(y)

n = len(X)
sum_xy = sum([X[i] * y[i] for i in range(n)])
sum_x = sum(X)
sum_y = sum(y)
sum_x_squared = sum([x ** 2 for x in X])

beta_1 = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2)

beta_0 = mean_y - beta_1 * mean_x

print(beta_0, beta_1)

Output: 0.0 2.0

The linear regression model results in an intercept (β0) of 0.0 and a slope (β1) of 2.0. This means the best-fit line for the given data is:

Y = 2.0X

This is a simple linear relationship where each unit increase in X results in a 2-unit increase in Y. But never use this method while working on a dataset. Always use scikit-learn’s Linear Regression method because that is optimized to work on real-time data. Here’s how you can use that method:

from sklearn.linear_model import LinearRegression
X = [[1], [2], [3], [4], [5]]
y = [2, 4, 6, 8, 10]
model = LinearRegression().fit(X, y)

ANOVA

ANOVA, which stands for Analysis of Variance, is a statistical method used to compare the means of three or more samples to understand whether at least one sample mean is significantly different from the others. It is particularly useful when dealing with multiple groups and when wanting to understand the influence of a single or multiple independent categorical variables on a continuous dependent variable.

There are three types of ANOVA:

One-Way ANOVA (Single Factor): Used when there is one independent variable and one dependent variable. It compares the means between the groups that have been split on one independent variable.
Two-Way ANOVA (Factorial): Used when there are two independent variables. It can also consider interaction effects between the independent variables on the dependent variable.
N-Way ANOVA (Multifactorial): Used when there are three or more independent variables.

The formula for one-way ANOVA focuses on the concept of variance between the groups and within the groups. The basic idea is to compare the variance (or variation) expressed between the groups with the variance expressed within the groups.

The ANOVA formula for the F-statistic is:

Statistical Equations for Data Science: ANOVA

Here’s the breakdown of the formula:

Mean Square Between (MSB): The average of the squares of the group means’ deviations from the grand mean, weighted by the sample size of the groups.
Mean Square Within (MSW): The average of the squared deviations within each group.

Here’s Python implementation for one-way ANOVA:

import scipy.stats as stats
data1 = [1, 2, 3]
data2 = [4, 5, 6]
data3 = [7, 8, 9]
f_val, p_val = stats.f_oneway(data1, data2, data3)
print(f_val, p_val)

Output: 27.0 0.0010000000000000002

So, these were some statistical equations that you should know for Data Science.

Summary

So, below are some of the essential statistical equations for Data Science you should know:

Mean
Standard Deviation
Correlation Coefficient
Linear Regression
ANOVA

I hope you liked this article on statistical equations for Data Science. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.