Guide for Data Science Interview-SQL

5 min readJul 1, 2021

Technical interview rounds for the role of a Data Scientist require lots of preparations. I have given several interviews and what I have observed is every company has their own set of rules of conducting it. Some are programming and ML approach driven, others are metric driven. The area of focuses are SQL, Statistics, Programming, Dashboarding, Machine Learning.

I will try to cover one topic in each article.

To read about Python Interview Questions, check this out.

To start with your data science journey, you can visit this link.

(Amazon | Role: Business Intelligence)

Given the transactions table,

column — type: id — integer, use_id — integer, created_at — datetime, product_id — integer, quantity — integer

write a query to get the average quantity of each product purchased for each transaction, every year. The output should include the year, product_id, and average quantity for that product sorted by year and product_id ascending. Round avg_quantity to two decimal places.

Output: column — type: year— integer, product_id— integer, avg_quantity— integer,

Answer:

SELECT YEAR(t.created_at) as year, t.product_id, round(avg(t.quantity),2) as avg_quantity FROM transactions AS t GROUP BY YEAR(t.created_at), product_id

2. (Facebook, Amazon | Roles: Product Analyst, Data Scientist, Business Intelligence)

We have a table that represents the total number of messages sent between two users by date on messenger.

column — type: id — integer, date — date, user1 — integer, user2 — integer, msg_count — integer

a. What are some insights that could be derived from this table?

b. Write a query to get the distribution of the number of conversations created by each user by day in the year 2020.

Output: column — type: num_conversations — integer, frequency — integer

Answer:

a. Insights are : Avg no. of messages per conversation, Total number of messages sent per day, Number of conversations started, Comparison of number of messages exchanged now vs six months back, etc

b. WITH cte as ( SELECT user1, DATE(date), COUNT(DISTINCT user2) AS num_conversations FROM messages WHERE YEAR(date) = '2020' GROUP BY user1,date ) SELECT num_conversations, COUNT(*) AS frequency FROM cte GROUP BY num_conversations

3. (Facebook, Amazon, Splunk, Nextdoor, Oracle | Roles: Data Scientist, Data Analyst, Business Analyst)

Users table : columns — type: id — int, name — varchar, neighborhood_id — int, created_at — datetime

Neighborhoods table : columns — type: id — int, name — varchar, city_id — int

Write a query that returns all of the neighborhoods that have 0 users

Answer:

SELECT n.name as name FROM users as u JOIN neighborhoods as n on u.neighborhood_id = n.id WHERE u.id IS NULL

4. (Amazon | Role: Business Intelligence)

columns — type: id — int, name — varchar, created_at — datetime

Given a users table, write a query to get the cumulative number of new users added by day, with the total reset every month.

Answer:

Here we have to make sure that the total amount of users on March 31st rolls over back to 0 on April 1st

/* Daily count of users*/

WITH daily_total AS ( SELECT DATE(created_at) AS dt, COUNT(*) AS cnt FROM users GROUP BY dt )

/* Now do a self join where we set left table frequency index more than the right.*/

SELECT t.dt AS date, SUM(u.cnt) AS monthly_cumulative FROM daily_total AS t LEFT JOIN daily_total AS u ON t.dt >= u.dt AND MONTH(t.dt) = MONTH(u.dt) AND YEAR(t.dt) = YEAR(u.dt) GROUP BY date

5. (Amazon | Role: Business Intelligence)

Write a query to return pairs of projects where the end date of one project matches the start date of another project.

Answer:

SELECT pje.title AS project_title_end, pjs.title AS project_title_start, pje.end_date AS date FROM projects AS pje join projects as ps ON DATE(pjs.start_date) = DATE(pje.end_date)

6. (Amazon | Role: Business Intelligence)

Write a query to identify customers who placed more than three transactions each in both 2019 and 2020.

Output: column — type: customer_name — string

Answer:

WITH transact_ct AS ( SELECT u.id,u.name, SUM(CASE WHEN YEAR(t.created_at)= ‘2019’ THEN 1 ELSE 0 END) AS t_2019, SUM(CASE WHEN YEAR(t.created_at)= ‘2020’ THEN 1 ELSE 0 END) AS t_2020 FROM transactions t JOIN users u ON u.id = user_id GROUP BY u.id,u.name HAVING t_2019 > 3 AND t_2020 > 3 )

SELECT tc.name AS customer_name FROM transact_ct tc

7. (Microsoft, Facebook, Uber, Amazon | Roles: Data Analyst, data Scientist, Business Intelligence, Data Engineer)

Given the tables above, select the top 3 departments with at least ten employees and rank them according to the percentage of their employees making over 100K in salary.

Answer:

SELECT CAST(SUM(CASE WHEN salary > 100000 THEN 1 ELSE 0 END) AS DECIMAL)/COUNT(*) AS percentage_over_100K, d.name as department_name, COUNT(*) AS number_of_employees FROM departments AS d LEFT JOIN employees AS e ON d.id = e.department_id GROUP BY d.name HAVING COUNT(*) >= 10 ORDER BY 2 DESC LIMIT 3

Conclusion:

Technical interviews can be tough no matter what the role is. Hope this study guide helps you!

Note: This document will be updated with time and I will try to cover most of the important questions.

Guide for Data Science Interview-SQL

Conclusion:

Written by Priyanka Banerjee