If you are preparing for data science interviews, you need to practice interview problems based on SQL, pandas, NumPy, statistics, and many more foundational concepts in data science. So, if you are looking for Data Science interview problems, this article is for you. In this article, I’ll take you through 20 problems to help you crack Data Science interviews.
20 Problems to Crack Data Science Interviews
Below are 20 problems you should try to crack Data Science interviews, and how to solve them using Python/SQL.
Problems based on SQL
Problem 1: You have a sales table with columns: sale_id, sale_date, store_id, and amount. Write a query to calculate the running total of sales for each store ordered by date.
Here’s how to solve this SQL problem:
SELECT
store_id,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY store_id ORDER BY sale_date) AS running_total
FROM
sales
ORDER BY
store_id, sale_date;The output will show each store’s store_id, sale_date, amount, and a cumulative running_total of sales up to each date, ordered by store_id and sale_date.
Problem 2: You have an employees table with columns: employee_id, manager_id, and salary. Write a query to calculate the total salary expense for each manager, including their direct reports.
Here’s how to solve this SQL problem:
WITH ManagerSalaries AS (
SELECT
manager_id,
SUM(salary) AS total_salary
FROM
employees
GROUP BY
manager_id
)
SELECT
m.manager_id,
e.employee_id,
e.salary,
m.total_salary
FROM
ManagerSalaries m
JOIN
employees e
ON
m.manager_id = e.manager_id;The output will show each manager’s manager_id, their employees’ employee_id and salary, and the total salary managed by that manager.
Problem 3: You have a departments table with columns: department_id and parent_department_id. Write a query to find all sub-departments under a given department.
Here’s how to solve this SQL problem:
WITH RecursiveDept AS (
SELECT
department_id,
parent_department_id
FROM
departments
WHERE
department_id = 1 -- Starting point
UNION ALL
SELECT
d.department_id,
d.parent_department_id
FROM
departments d
INNER JOIN
RecursiveDept r
ON
d.parent_department_id = r.department_id
)
SELECT
*
FROM
RecursiveDept;The output will show all departments and their parent-child relationships starting from the specified department (department_id = 1), including all its sub-departments recursively.
Problem 4: You have a user_data table with a user_info column containing JSON data. Write a query to extract the email field and count users by their country.
Here’s how to solve this SQL problem:
SELECT
user_info->>'country' AS country,
COUNT(*) AS user_count
FROM
user_data
GROUP BY
user_info->>'country';The output will display each country extracted from the JSON column user_info and the corresponding count of users from that country.
Problem 5: You have a sales table with columns: store_id, sale_date, and amount. Write a query to show the total sales for each store, with separate columns for each month.
Here’s how to solve this SQL problem:
SELECT
store_id,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 1 THEN amount ELSE 0 END) AS january_sales,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 2 THEN amount ELSE 0 END) AS february_sales,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 3 THEN amount ELSE 0 END) AS march_sales
FROM
sales
GROUP BY
store_id;The output will show each store’s store_id along with total sales for January, February, and March as separate columns (january_sales, february_sales, march_sales).
Problems based on Statistics
Problem 6: You are analyzing the performance of three different advertising strategies (A, B, C) based on the number of product purchases. Use ANOVA to determine if the strategies lead to significantly different results.
Here’s how to solve this problem using Python:
from scipy.stats import f_oneway
# sample data: number of purchases for three strategies
strategy_A = [30, 28, 35, 29, 34]
strategy_B = [25, 22, 27, 24, 30]
strategy_C = [40, 42, 45, 41, 46]
# perform ANOVA
f_stat, p_value = f_oneway(strategy_A, strategy_B, strategy_C)
print("F-Statistic:", f_stat)
print("P-Value:", p_value)
# interpretation
if p_value < 0.05:
print("Significant differences exist among strategies.")
else:
print("No significant differences found.")F-Statistic: 44.91828793774321
P-Value: 2.6771009397609933e-06
Significant differences exist among strategies.
Problem 7: You are running an email spam filter. Initially, you assume a 50% chance an email is spam. If 80% of spam emails contain the word “sale” and 20% of non-spam emails also contain “sale”, calculate the updated probability that an email containing “sale” is spam.
Here’s how to solve this problem using Python:
# Bayesian Inference Formula
# P(Spam | Sale) = [P(Sale | Spam) * P(Spam)] / P(Sale)
# probabilities
P_spam = 0.5
P_sale_given_spam = 0.8
P_sale_given_not_spam = 0.2
P_not_spam = 1 - P_spam
# total probability of 'Sale'
P_sale = (P_sale_given_spam * P_spam) + (P_sale_given_not_spam * P_not_spam)
# posterior probability
P_spam_given_sale = (P_sale_given_spam * P_spam) / P_sale
print("Updated Probability of Spam given 'Sale':", P_spam_given_sale)Updated Probability of Spam given 'Sale': 0.8
Problem 8: Estimate the 95% confidence interval for the mean income from a dataset of 1,000 individuals using bootstrapping.
Here’s how to solve this problem using Python:
import numpy as np
# sample data: income of 1,000 individuals
np.random.seed(42)
incomes = np.random.normal(50000, 15000, 1000)
# bootstrapping
bootstrap_means = [np.mean(np.random.choice(incomes, size=len(incomes), replace=True)) for _ in range(1000)]
# confidence interval
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)
print("95% Confidence Interval for Mean Income:", (ci_lower, ci_upper))95% Confidence Interval for Mean Income: (49410.580818461574, 51275.35146924302)
Problem 9: You have data on customer churn (time in months before they stopped using the service). Calculate the survival probability for each month using the Kaplan-Meier estimator.
Here’s how to solve this problem using Python:
# use: pip install lifelines import numpy as np from lifelines import KaplanMeierFitter # sample churn data: Time in months and event occurrence time = [5, 6, 6, 2, 4, 8, 10, 3, 5, 7] event = [1, 1, 0, 1, 1, 0, 1, 1, 1, 0] # 1 = churned, 0 = censored # Kaplan-Meier Estimator kmf = KaplanMeierFitter() kmf.fit(time, event_observed=event) # survival probabilities kmf.plot_survival_function() print(kmf.survival_function_)
KM_estimate
timeline
0.0 1.0
2.0 0.9
3.0 0.8
4.0 0.7
5.0 0.5
6.0 0.4
7.0 0.4
8.0 0.4
10.0 0.0
Problem 10: You have a dataset with 10 features. Use PCA to reduce it to 2 components by retaining at least 95% of the variance.
Here’s how to solve this problem using Python:
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
# generate sample data
X, _ = make_classification(n_samples=500, n_features=10, random_state=42)
# apply PCA
pca = PCA(n_components=0.95) # retain at least 95% variance
X_pca = pca.fit_transform(X)
# results
print("Number of Components:", pca.n_components_)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)Number of Components: 8
Explained Variance Ratio: [0.22552466 0.20268672 0.11512407 0.10510183 0.09660526 0.08947361
0.08811409 0.07736976]
Problems based on Data Cleaning
Problem 11: You have a dataset with missing values in the age column. Use KNN imputation to fill in the missing age values based on similarities in other columns like income and education level.
Here’s how to solve this problem using Python:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
# sample data
data = {'age': [25, np.nan, 28, 32, np.nan, 30],
'income': [50000, 55000, 62000, 60000, 64000, 59000],
'education_level': [1, 2, 2, 3, 3, 2]}
df = pd.DataFrame(data)
# KNN imputer for missing age values
imputer = KNNImputer(n_neighbors=2)
df[['age', 'income', 'education_level']] = imputer.fit_transform(df[['age',
'income',
'education_level']])
print(df)age income education_level
0 25.0 50000.0 1.0
1 31.0 55000.0 2.0
2 28.0 62000.0 2.0
3 32.0 60000.0 3.0
4 30.0 64000.0 3.0
5 30.0 59000.0 2.0
Problem 12: You have a dataset of monthly incomes where some entries appear to be outliers. Use the IQR method to detect and handle these outliers by capping them within a specified range.
Here’s how to solve this problem using Python:
# sample income data with outliers
income_data = {'monthly_income': [3000, 3200, 3100, 15000, 2800, 2700, 3400, 2500, 35000]}
df = pd.DataFrame(income_data)
# calculate IQR
Q1 = df['monthly_income'].quantile(0.25)
Q3 = df['monthly_income'].quantile(0.75)
IQR = Q3 - Q1
# define bounds and cap outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df['monthly_income'] = np.where(df['monthly_income'] > upper_bound, upper_bound,
np.where(df['monthly_income'] < lower_bound, lower_bound, df['monthly_income']))
print(df)monthly_income
0 3000.0
1 3200.0
2 3100.0
3 4300.0
4 2800.0
5 2700.0
6 3400.0
7 2500.0
8 4300.0
Problem 13: You have a dataset with customer names that may contain near-duplicates due to typos (e.g., “John Doe” and “Jon Doe”). Use fuzzy matching to identify these near-duplicates and retain only one unique entry per group.
Here’s how to solve this problem using Python:
# install fuzzywuzzy: pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
# sample data with similar names
data = {'customer_name': ['John Doe', 'Jon Doe', 'Jane Doe', 'Janet Doe', 'Jake Doe']}
df = pd.DataFrame(data)
# identify potential duplicates using fuzzy matching
unique_names = []
for name in df['customer_name']:
match = process.extractOne(name, unique_names, scorer=fuzz.token_sort_ratio)
if match and match[1] > 85: # Similarity threshold
print(f"Duplicate found: {name} -> {match[0]}")
else:
unique_names.append(name)
print("Unique names:", unique_names)Duplicate found: Jon Doe -> John Doe
Duplicate found: Janet Doe -> Jane Doe
Duplicate found: Jake Doe -> Jane Doe
Unique names: ['John Doe', 'Jane Doe']
Problem 14: You have a dataset with user reviews. The text contains various issues like inconsistent capitalization, special characters, and extra whitespace. Write a function to clean and normalize this text data for analysis.
Here’s how to solve this problem using Python:
import re
# sample text data
reviews = ["Great Product!!", " loved it... would buy again!!!", "Not BAD, but could be BETTER :)", "awesome!!!"]
# function to clean and normalize text
def clean_text(text):
text = text.lower() # convert to lowercase
text = re.sub(r'[^a-z\s]', '', text) # remove special characters
text = re.sub(r'\s+', ' ', text).strip() # remove extra whitespace
return text
# apply cleaning function
cleaned_reviews = [clean_text(review) for review in reviews]
print(cleaned_reviews)['great product', 'loved it would buy again', 'not bad but could be better', 'awesome']
Problem 15: You have a dataset with numerical features of varying scales (e.g., age and income). Apply Min-Max scaling to transform each feature to a range of 0 to 1.
Here’s how to solve this problem using Python:
from sklearn.preprocessing import MinMaxScaler
# sample data
data = {'age': [20, 30, 40, 50, 60], 'income': [20000, 30000, 50000, 80000, 120000]}
df = pd.DataFrame(data)
# apply min-max scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df)age income
0 0.00 0.0
1 0.25 0.1
2 0.50 0.3
3 0.75 0.6
4 1.00 1.0
Problems based on NumPy
Problem 16: You have a dataset with numerical features of varying scales (e.g., age and income). Apply Min-Max scaling to transform each feature to a range of 0 to 1.
Here’s how to solve this problem using NumPy and Python:
import numpy as np
# 2D array of temperatures (5 cities, 7 days)
temperatures = np.array([
[20, 21, 19, 22, 18, 20, 19],
[25, 26, 24, 23, 22, 27, 26],
[15, 17, 16, 18, 15, 16, 17],
[30, 31, 29, 32, 30, 28, 27],
[22, 23, 21, 25, 24, 26, 25]
])
# 1D array with daily offsets
offsets = np.array([1, -1, 0, 2, -2, 1, 0])
# adjust temperatures using broadcasting
adjusted_temperatures = temperatures + offsets
print(adjusted_temperatures)[[21 20 19 24 16 21 19]
[26 25 24 25 20 28 26]
[16 16 16 20 13 17 17]
[31 30 29 34 28 29 27]
[23 22 21 27 22 27 25]]
Problem 17: You have an array of distances in kilometres and need to convert them to miles without using a loop. Use vectorized operations in NumPy for the conversion (1 km = 0.621371 miles).
Here’s how to solve this problem using NumPy and Python:
# array of distances in kilometers distances_km = np.array([5, 10, 15, 20, 25]) # convert to miles using vectorized operation distances_miles = distances_km * 0.621371 print(distances_miles)
[ 3.106855 6.21371 9.320565 12.42742 15.534275]
Problem 18: You have an array of ages, and you want to find all ages that are above 18 and increase them by 1 to simulate a birthday. Use Boolean indexing to achieve this.
Here’s how to solve this problem using NumPy and Python:
# array of ages ages = np.array([15, 20, 17, 19, 21, 18, 23, 16]) # increase ages above 18 by 1 using boolean indexing ages[ages > 18] += 1 print(ages)
[15 21 17 20 22 18 24 16]
Problem 19: Given a 3D array representing monthly sales data (axis 0: stores, axis 1: months, axis 2: products), calculate the total sales for each product across all stores for each month.
Here’s how to solve this problem using NumPy and Python:
# sample 3D array: 2 stores, 3 months, 4 products
sales_data = np.array([
[[200, 150, 100, 80], [210, 160, 110, 90], [220, 170, 120, 95]],
[[180, 140, 90, 70], [190, 150, 100, 85], [200, 160, 110, 80]]
])
# sum sales for each product across all stores for each month
product_sales_monthly = np.sum(sales_data, axis=0)
print(product_sales_monthly)[[380 290 190 150]
[400 310 210 175]
[420 330 230 175]]
Problem 20: You have a 2D array representing grayscale pixel values (8×8 matrix). Reshape it into a 4D array where each 4×4 sub-array represents a quadrant of the image.
Here’s how to solve this problem using NumPy and Python:
# sample 8x8 array representing grayscale pixel values pixels = np.arange(64).reshape(8, 8) # reshape into 4D array (2, 2, 4, 4) to represent 4 quadrants quadrants = pixels.reshape(2, 4, 2, 4).transpose(0, 2, 1, 3) print(quadrants)
[[[[ 0 1 2 3]
[ 8 9 10 11]
[16 17 18 19]
[24 25 26 27]]
[[ 4 5 6 7]
[12 13 14 15]
[20 21 22 23]
[28 29 30 31]]]
[[[32 33 34 35]
[40 41 42 43]
[48 49 50 51]
[56 57 58 59]]
[[36 37 38 39]
[44 45 46 47]
[52 53 54 55]
[60 61 62 63]]]]
So, these are 20 problems you should try to crack Data Science interviews. All these problems are popular in Data Science interviews, especially among freshers.
Summary
So, if you are preparing for data science interviews, practice these interview problems based on SQL, pandas, NumPy, statistics, and many more foundational concepts in data science. I hope you liked this article on 20 problems to crack Data Science interviews. This list will be updated every week with many more questions. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





