Probability and Statistics for Data Science

Probability provides the theoretical foundation needed to make statistical inferences, while statistics applies those theories to analyze and make sense of real-world data. Understanding probability and statistics is crucial for learning data science as these concepts form the backbone of data analysis, machine learning algorithms, and their interpretation. In this article, I’ll take you through some key concepts of probability and statistics with implementation using Python you should know for Data Science.

Probability and Statistics for Data Science

Let’s understand all the key concepts of probability and statistics for Data Science. I’ll also explain the concepts with implementation using Python when required.

Descriptive Statistics

Descriptive statistics summarize the main features of a dataset, providing a quick overview of the sample and measures. It consists of:

Measures of Central Tendency: These are statistical metrics that represent the centre point or typical value of data. The most common measures are:

Mean: The average of the data.
Median: The middle value in a sorted list.
Mode: The most frequently occurring value.

Measures of Spread: These metrics indicate how spread out the data points are in a dataset. Common measures include:

Range: Difference between the highest and lowest values.
Variance: Measures how far each number in the set is from the mean.
Standard Deviation: Square root of the variance.

Skewness and Kurtosis: These are measures of the shape of the distribution of the data:

Skewness: Measure of the asymmetry of the probability distribution.
Kurtosis: Measure of the “tailedness” of the distribution.

Here’s how we can calculate the descriptive statistics of data using Python:

import numpy as np
import pandas as pd
from scipy import stats

# Generating a simple dataset
np.random.seed(0)
data = np.random.normal(50, 15, 100)  # 100 data points, mean=50, std=15

# Descriptive Statistics
mean = np.mean(data)
median = np.median(data)
range_data = np.ptp(data)
variance = np.var(data)
std_dev = np.std(data)
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)

(mean, median, range_data, variance, std_dev, skewness, kurtosis)

(50.89712023301727,
 51.41144179156997,
 72.34116659732528,
 228.56098932335956,
 15.118233670748696,
 0.005171839713550985,
 -0.37835455663313455)

We can also visualize them for a better understanding:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Creating the plots
plt.figure(figsize=(18, 6))

# Histogram
plt.subplot(1, 3, 1)
sns.histplot(data, kde=True)
plt.title('Histogram with Kernel Density Estimate')
plt.axvline(np.mean(data), color='r', linestyle='--')
plt.axvline(np.median(data), color='g', linestyle='-')
plt.legend({'Mean':np.mean(data), 'Median':np.median(data)})

# Box Plot
plt.subplot(1, 3, 2)
sns.boxplot(x=data)
plt.title('Box Plot')

# Violin Plot
plt.subplot(1, 3, 3)
sns.violinplot(x=data)
plt.title('Violin Plot')

plt.show()

Histogram with Kernel Density Estimate (KDE): The red dashed line represents the mean. The green solid line represents the median. This plot gives a sense of the data distribution, mean, and median.

Box Plot: Illustrates the range, median, and interquartile range (IQR). The central line in the box represents the median, while the box’s edges represent the lower and upper quartiles. The “whiskers” extend to the most extreme data points not considered outliers.

Violin Plot: Combines aspects of the box plot with a KDE. The width of the plot at different values indicates the density of the data at that value, giving a sense of the skewness and kurtosis.

Probability

Probability measures the likelihood that an event will occur. Key concepts include:

Probability rules: Including the addition and multiplication rules.
Conditional probability: Probability of an event given that another event has occurred.
Discrete Distributions: Like Binomial, Poisson.
Continuous Distributions: Like Normal, Uniform.

Let’s create a discrete probability distribution to illustrate these concepts. I’ll consider a simple example: rolling a six-sided die:

# Probabilities of rolling a six-sided die
die_rolls = np.arange(1, 7)  # Possible outcomes: 1, 2, 3, 4, 5, 6
probabilities = np.full(6, 1/6)  # Each outcome has an equal probability

# Creating a DataFrame for better visualization
die_probability_distribution = pd.DataFrame({
    'Outcome': die_rolls,
    'Probability': probabilities
})

print(die_probability_distribution)

   Outcome  Probability
0        1     0.166667
1        2     0.166667
2        3     0.166667
3        4     0.166667
4        5     0.166667
5        6     0.166667

In this case, each outcome (rolling a 1, 2, 3, 4, 5, or 6) has an equal probability of 1/6.

Inferential Statistics

Inferential statistics allow making predictions or inferences about a population based on a sample. Some key concepts you should know are:

Sampling:

Random Sampling: Ensuring each member has an equal chance of being selected.
Sampling Distribution: Distribution of a statistic over many samples.

Hypothesis Testing:

Null Hypothesis (H0): A statement of no effect, difference, or relationship in a population based on sample data. It’s the hypothesis that a researcher tries to disprove or reject.
Alternative Hypothesis (H1): Contrasts the null hypothesis and represents the outcome the researcher is trying to demonstrate or prove. It indicates the presence of an effect, difference, or relationship.

You can learn more about the implementation of Hypothesis testing using Python from here.

Correlation and Regression

Correlation measures the strength and direction of the relationship between two variables. Regression means predicting the value of a dependent variable based on the independent variable.

Next, let’s look at correlation and regression. Suppose we have another set of data (let’s call it data2), and we want to understand the relationship between data (that we used for descriptive statistics) and data2. We’ll first compute the correlation and then perform a simple linear regression:

# Generating another dataset
np.random.seed(1)
data2 = np.random.normal(30, 10, 100)  # 100 data points, mean=30, std=10

# Correlation
correlation, _ = stats.pearsonr(data, data2)

# Simple Linear Regression
from sklearn.linear_model import LinearRegression

# Reshaping data for regression model
X = data.reshape(-1, 1)
Y = data2.reshape(-1, 1)

# Creating and fitting the model
model = LinearRegression()
model.fit(X, Y)

# Coefficients
slope = model.coef_[0]
intercept = model.intercept_

print((correlation, slope, intercept))

(0.14939503462531986, array([0.08746918]), array([26.15389939]))

The correlation coefficient of 0.149 suggests a weak positive linear relationship between the two datasets. A correlation coefficient ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

The linear regression model, defined by the equation Y = 0.087 * X + 26.154, provides a way to predict values of data2 based on data. The slope of 0.087 indicates that for every unit increase in data, data2 increases by 0.087 units, on average.

Bayesian Statistics

Lastly, let’s touch on Bayesian statistics with a simple example. Bayesian statistics involves updating the probability estimate as more evidence becomes available. It combines prior beliefs with the likelihood of the observed data.

Suppose you have a prior belief that the probability of an event (like a coin being biased towards heads) is 50%. After observing 10 coin flips, 7 of which are heads, you want to update this probability.

We’ll use Bayes’ Theorem for this, which states:

Where:

P(A|B) is the posterior probability (probability of the hypothesis after considering the evidence).
P(B|A) is the likelihood (probability of the evidence given the hypothesis is true).
P(A) is the prior probability (initial probability of the hypothesis).
P(B) is the marginal likelihood (probability of the evidence under all possible hypotheses).

We’ll assume a binomial likelihood and a uniform prior for simplicity. Let’s calculate the posterior probability:

from scipy.stats import binom

# Prior probability (50% chance of being biased towards heads)
prior = 0.5

# Likelihood of observing 7 heads out of 10 flips, assuming the coin is biased
likelihood = binom.pmf(7, 10, 0.5)

# Assuming a uniform prior, the marginal likelihood is also 0.5 (as we assume 50-50 chance for any outcome)
marginal_likelihood = 0.5

# Applying Bayes' Theorem
posterior = (likelihood * prior) / marginal_likelihood

print(posterior)

0.11718749999999999

After observing 7 heads out of 10 coin flips, the posterior probability of the coin being biased towards heads, calculated using Bayesian statistics, is approximately 0.117 (or 11.7%).

This posterior probability is significantly lower than our prior probability of 50%. It suggests that, given the observed data (7 heads out of 10 flips), the belief in the coin being biased towards heads decreases.

So, these were some essential statistics and probability concepts for data science, along with practical Python examples.

Summary

Probability provides the theoretical foundation needed to make statistical inferences, while statistics applies those theories to analyze and make sense of real-world data. I hope you liked this article on essential probability and statistics concepts for Data Science. Feel free to ask valuable questions in the comments section below.