SQL Concepts for Data Science Interviews

Aman Kharwal
November 26, 2024

SQL Concepts for Data Science Interviews

Data Scientists use SQL to solve complex data manipulation, aggregation, and transformation problems. In Data Science interviews, Interviewers often test your SQL knowledge to work with structured and semi-structured data, create readable queries, and handle real-world challenges like hierarchy, JSON, and summarization. So, if you are preparing for Data Science interviews and looking for practice problems based on SQL, this article is for you. In this article, I’ll take you through a guide to essential SQL concepts for Data Science interviews, with example questions.

SQL Concepts for Data Science Interviews

Below are must-know SQL concepts for Data Science interviews, each explained in detail with an example question.

Window Functions

Window functions perform calculations across a set of rows related to the current row without collapsing the result into a single output. They are useful for running totals, rankings, and moving averages.

Example Question: You have a sales table with columns: sale_id, sale_date, store_id, and amount. Write a query to calculate the running total of sales for each store ordered by date.

Here’s how to solve this SQL problem:

SELECT 
    store_id,
    sale_date,
    amount,
    SUM(amount) OVER (PARTITION BY store_id ORDER BY sale_date) AS running_total
FROM 
    sales
ORDER BY 
    store_id, sale_date;

The output will show each store’s store_id, sale_date, amount, and a cumulative running_total of sales up to each date, ordered by store_id and sale_date.

Common Table Expressions (CTEs)

CTEs allow you to create temporary result sets that can be referred to within the same query. They improve query readability and can be used for recursive queries.

Example Question: You have an employees table with columns: employee_id, manager_id, and salary. Write a query to calculate the total salary expense for each manager, including their direct reports.

Here’s how to solve this SQL problem:

WITH ManagerSalaries AS (
    SELECT 
        manager_id,
        SUM(salary) AS total_salary
    FROM 
        employees
    GROUP BY 
        manager_id
)
SELECT 
    m.manager_id,
    e.employee_id,
    e.salary,
    m.total_salary
FROM 
    ManagerSalaries m
JOIN 
    employees e
ON 
    m.manager_id = e.manager_id;

The output will show each manager’s manager_id, their employees’ employee_id and salary, and the total salary managed by that manager.

Recursive Queries

Recursive queries use CTEs to process hierarchical or sequential data. They are particularly useful for analyzing tree-like structures, such as organizational charts or file systems.

Example Question: You have a departments table with columns: department_id and parent_department_id. Write a query to find all sub-departments under a given department.

Here’s how to solve this SQL problem:

WITH RecursiveDept AS (
    SELECT 
        department_id,
        parent_department_id
    FROM 
        departments
    WHERE 
        department_id = 1 -- Starting point

    UNION ALL

    SELECT 
        d.department_id,
        d.parent_department_id
    FROM 
        departments d
    INNER JOIN 
        RecursiveDept r
    ON 
        d.parent_department_id = r.department_id
)
SELECT 
    * 
FROM 
    RecursiveDept;

The output will show all departments and their parent-child relationships starting from the specified department (department_id = 1), including all its sub-departments recursively.

Handling JSON Data

Modern databases like PostgreSQL and MySQL support JSON data types, allowing you to store and query semi-structured data. It is critical for handling APIs or logs.

Example Question: You have a user_data table with a user_info column containing JSON data. Write a query to extract the email field and count users by their country.

Here’s how to solve this SQL problem:

SELECT 
    user_info->>'country' AS country,
    COUNT(*) AS user_count
FROM 
    user_data
GROUP BY 
    user_info->>'country';

The output will display each country extracted from the JSON column user_info and the corresponding count of users from that country.

Pivoting Data with CASE Statements

Pivoting transforms rows into columns, which helps create reports or summarize data. It can be achieved using CASE statements or database-specific pivot functions.

Example Question: You have a sales table with columns: store_id, sale_date, and amount. Write a query to show the total sales for each store, with separate columns for each month.

Here’s how to solve this SQL problem:

SELECT 
    store_id,
    SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 1 THEN amount ELSE 0 END) AS january_sales,
    SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 2 THEN amount ELSE 0 END) AS february_sales,
    SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 3 THEN amount ELSE 0 END) AS march_sales
FROM 
    sales
GROUP BY 
    store_id;

The output will show each store’s store_id along with total sales for January, February, and March as separate columns (january_sales, february_sales, march_sales).

Summary

So, in Data Science interviews, Interviewers often test your SQL knowledge to work with structured and semi-structured data, create readable queries, and handle real-world challenges like hierarchy, JSON, and summarization. I hope you liked this article on SQL concepts for Data Science interviews. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Leave a ReplyCancel reply