Guide for Data Science Interview-SQL

Priyanka Banerjee
5 min readJul 1, 2021
Source: Google Image

Technical interview rounds for the role of a Data Scientist require lots of preparations. I have given several interviews and what I have observed is every company has their own set of rules of conducting it. Some are programming and ML approach driven, others are metric driven. The area of focuses are SQL, Statistics, Programming, Dashboarding, Machine Learning.

I will try to cover one topic in each article.

To read about Python Interview Questions, check this out.

To start with your data science journey, you can visit this link.

  1. (Amazon | Role: Business Intelligence)

Given the transactions table,

column — type: id — integer, use_id — integer, created_at — datetime, product_id — integer, quantity — integer

write a query to get the average quantity of each product purchased for each transaction, every year. The output should include the year, product_id, and average quantity for that product sorted by year and product_id ascending. Round avg_quantity to two decimal places.

Output: column — type: year— integer, product_id— integer, avg_quantity— integer,

Answer:

SELECT YEAR(t.created_at) as year, t.product_id, round(avg(t.quantity),2) as avg_quantity
FROM transactions AS t
GROUP BY YEAR(t.created_at), product_id

2. (Facebook, Amazon | Roles: Product Analyst, Data Scientist, Business Intelligence)

We have a table that represents the total number of messages sent between two users by date on messenger.

column — type: id — integer, date — date, user1 — integer, user2 — integer, msg_count — integer

a. What are some insights that could be derived from this table?

b. Write a query to get the distribution of the number of conversations created by each user by day in the year 2020.

Output: column — type: num_conversations — integer, frequency — integer

Answer:

a. Insights are : Avg no. of messages per conversation, Total number of messages sent per day, Number of conversations started, Comparison of number of messages exchanged now vs six months back, etc

b. WITH cte as
(
SELECT user1, DATE(date), COUNT(DISTINCT user2) AS num_conversations
FROM messages
WHERE YEAR(date) = '2020'
GROUP BY user1,date
)
SELECT num_conversations, COUNT(*) AS frequency
FROM cte
GROUP BY num_conversations

3. (Facebook, Amazon, Splunk, Nextdoor, Oracle | Roles: Data Scientist, Data Analyst, Business Analyst)

Users table : columns — type: id — int, name — varchar, neighborhood_id — int, created_at — datetime

Neighborhoods table : columns — type: id — int, name — varchar, city_id — int

Write a query that returns all of the neighborhoods that have 0 users

Answer:

SELECT n.name as name
FROM users as u JOIN neighborhoods as n
on u.neighborhood_id = n.id
WHERE u.id IS NULL

4. (Amazon | Role: Business Intelligence)

columns — type: id — int, name — varchar, created_at — datetime

Given a users table, write a query to get the cumulative number of new users added by day, with the total reset every month.

Output table

Answer:

Here we have to make sure that the total amount of users on March 31st rolls over back to 0 on April 1st

/* Daily count of users*/

WITH daily_total AS
(
SELECT DATE(created_at) AS dt, COUNT(*) AS cnt
FROM users
GROUP BY dt
)

/* Now do a self join where we set left table frequency index more than the right.*/

SELECT t.dt AS date, SUM(u.cnt) AS monthly_cumulative
FROM daily_total AS t
LEFT JOIN daily_total AS u
ON t.dt >= u.dt
AND MONTH(t.dt) = MONTH(u.dt)
AND YEAR(t.dt) = YEAR(u.dt)
GROUP BY date

5. (Amazon | Role: Business Intelligence)

Write a query to return pairs of projects where the end date of one project matches the start date of another project.

Input Table
Output Table

Answer:

SELECT pje.title AS project_title_end, pjs.title AS project_title_start, pje.end_date AS date
FROM projects AS pje join projects as ps ON DATE(pjs.start_date) = DATE(pje.end_date)

6. (Amazon | Role: Business Intelligence)

Write a query to identify customers who placed more than three transactions each in both 2019 and 2020.

Input two tables

Output: column — type: customer_name — string

Answer:

WITH transact_ct AS
(
SELECT u.id,u.name,
SUM(CASE WHEN YEAR(t.created_at)= ‘2019’ THEN 1 ELSE 0 END) AS t_2019,
SUM(CASE WHEN YEAR(t.created_at)= ‘2020’ THEN 1 ELSE 0 END) AS t_2020
FROM transactions t
JOIN users u ON u.id = user_id
GROUP BY u.id,u.name
HAVING t_2019 > 3 AND t_2020 > 3
)

SELECT tc.name AS customer_name
FROM transact_ct tc

7. (Microsoft, Facebook, Uber, Amazon | Roles: Data Analyst, data Scientist, Business Intelligence, Data Engineer)

Input Tables

Given the tables above, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Output Table

Answer:

SELECT CAST(SUM(CASE WHEN salary > 100000 THEN 1 ELSE 0 END) AS DECIMAL)/COUNT(*) AS percentage_over_100K,
d.name as department_name, COUNT(*) AS number_of_employees
FROM departments AS d
LEFT JOIN employees AS e ON d.id = e.department_id
GROUP BY d.name
HAVING COUNT(*) >= 10
ORDER BY 2 DESC
LIMIT 3

Conclusion:

Technical interviews can be tough no matter what the role is. Hope this study guide helps you!

Note: This document will be updated with time and I will try to cover most of the important questions.

--

--

Priyanka Banerjee

Sr. Data Scientist | Work in Finance & Health Domain | Keep Learning, Keep Sharing.