Guide for Data Science Interview-Python

Priyanka Banerjee
6 min readJul 1, 2021
Source: Pinterest Image

Here we will cover important questions on Python for the role of a Data Scientist, Data Analyst, Business Intelligence Engineer, etc.

Visit this to learn Python.

To learn about SQL interview problems, read it here.

  1. Write a function that checks if a list of integers is normally distributed. Specifically given a list of 100 numbers, write a function that returns a score that measures the deviation from normality. I.E. a normally distributed list of integers would return 0. (Amazon | Data Scientist)

Answer:

I have used the three sigma thumb rule of normal distribution which means the values within one standard deviation of the mean account for about 68% of the list while within two standard deviations account for about 95% and within three standard deviations account for about 99.7%.

a. Calculate mean and standard deviation of the list

def normality_test(sample_list):
l = len(sample_list)
avg_num = sum(sample_list)/l
sigma_num = (sum((x-avg_num)**2 for x in sample_list) / l) ** .5

b. Check the percentage of values in each standard deviation

num1 = 0, num2 = 0, num3 = 0
for x in sample_list:
if (x < sigma_num + avg_num) and (x > avg_num - sigma_num):
num1 += 1
if (x < (2*sigma_num) + avg_num) and (x > (avg_num - (2*sigma_num)):
num2 += 1
if (x < (3*sigma_num) + avg_num) and (x > (avg_num - (3*sigma_num)):
num3+= 1
d1 = num1/l
d2 = num2/l
d3 = num3/l

c. Now that we have percentage of values of 1SD, 2SD and 3SD, we can check if the values of 1SD ≤ 0.68, 2SD ≤ 0.95 and 3SD ≤ 0.997. If the condition satisfies then we have a Normal Distribution else not.

2. Write a function that can take a string and return a list of bigrams. (Indeed, Microsoft, Facebook | Roles: Research Scientist, ML Engineer, Data Scientist)

Answer:

text = ['Have free hours and love children?',  'Drive kids to school', 'soccer practice and other activities.'] 
output = []
for l in text:
for b in zip(l.split(" ")[:-1], l.split(" ")[1:]):
output.append(b)
print(output)

3. Let’s say you’re given a list of standardized test scores from high schoolers from grades 9 to 12.

Given the dataset, write code in Pandas to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100. (Google | Data Scientist)

Input Dataframe
Output Dataframe

Answer:

import pandas as pddata = {'user_id':  [1,2,3,4,5,6,7,8,9,10,11,12],
'grade': [10,10,11,10,11,10,10,10,10,10,10,10],
'test_score': [85,60,90,30,99,44,84,93,90,98,89,78]
}
df = pd.DataFrame(data,columns = ['user_id' , 'grade' , 'test_score'] )
# print(df)
bins = [0, 50, 75, 90, 100]
labels=['<50','<75','<90' , '<100']
df['test_score'] = pd.cut(df['test_score'], bins,labels=labels)
numer = df.groupby(['grade','test_score'])['user_id'].count()
denom = df.groupby(['grade'])['user_id'].count()
df = numer/denom
df = df.reset_index()
df['Percentage']=(df.groupby(['grade'])['user_id'].cumsum()*100).astype('str')+"%"df=df[['grade','test_score','Percentage']]
df

4. There are two lists of dictionaries representing friendship beginnings and endings: friends_added and friends_removed. Each dictionary contains the user_ids and created_at time of the friendship beginning /ending .

Write a function to generate an output which lists the pairs of friends with their corresponding timestamps of the friendship beginning and then the timestamp of the friendship ending.

Note: There can be multiple instances over time when two people became friends and unfriended; only output lists when a corresponding friendship was removed. (Facebook | Data Scientist)

Input:

friends_added = [{'user_ids': [1, 2], 'created_at': '2020-01-01'},
{'user_ids': [3, 2], 'created_at': '2020-01-02'},
{'user_ids': [2, 1], 'created_at': '2020-02-02'},
{'user_ids': [4, 1], 'created_at': '2020-02-02'}]

friends_removed = [{'user_ids': [2, 1], 'created_at': '2020-01-03'},
{'user_ids': [2, 3], 'created_at': '2020-01-05'},
{'user_ids': [1, 2], 'created_at': '2020-02-05'}]

Output:

friendships = [{
'user_ids': [1, 2],
'start_date': '2020-01-01',
'end_date': '2020-01-03'
},
{
'user_ids': [1, 2],
'start_date': '2020-02-02',
'end_date': '2020-02-05'
},
{
'user_ids': [2, 3],
'start_date': '2020-01-02',
'end_date': '2020-01-05'
},
]

Answer:

It is important to note that we are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is present in the friends_removed list. So if we start by iterating through the friends_removed dictionary, we will get the id pair and the end date of each listing in our final output. Now, we will need to find the corresponding start date for each end date by checking that the created_at date of friends_added for a particular user_id is always smaller than the created_at date of friends_removed.

def friendship_timeline(fd_added, fd_removed):
friendship = []
for rem in fd_removed:
for add in fd_added:
if sorted(rem['user_ids']) == sorted(add['user_ids']):
fd_added.remove(add)
friendship.append({
'user_ids': sorted(rem['user_ids']),
'start_date': add['created_at'],
'end_date': rem['created_at']
})
break
return sorted(friendship, key=lambda x: x['user_ids'])

5. Every night between 7pm and midnight, two computing jobs from two different sources are randomly started with each one lasting an hour. Unfortunately, when the jobs simultaneously run, they cause a failure in some of the company’s other nightly jobs, resulting in downtime for the company that costs $1000.

The CEO needs a single number representing the annual (365 days) cost of this problem. Write a function to simulate this problem and output an estimated cost. Bonus — How this can be solved using probability? (Wealthfront | Roles: Data Analysts, Data Scientist, Business Intelligence)

Answer: Within 7pm and midnight, we have 5*60*60 seconds. We can do simulation to get an approximation of the probability that overlap will occur between the two jobs. We will generate two random numbers and check if they overlap and append the values 1 or 0 to our array. Finally we will consider the mean to get the answer.

import numpy as np
import pandas as pd
import random
task1 = np.random.randint(0, 5*60*60, size = 10000000)
task2 = np.random.randint(0, 5*60*60, size = 10000000)
data = pd.DataFrame(
{'task1_start': task1,
'task2_start': task2,
})
data['overlap'] = np.where(np.abs(data.task1_start - data.task2_start) <=3600, 1, 0)
print(data['overlap'].mean())

6. Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point. (Amazon, Adobe | Roles: Data Scientist, Data Analyst, Research Scientist)

Answer:

import datetimeclass Solution:def weekly_aggregate(self, ts):    if ts == None or len(ts) == 0: return -1
output = [[ts[0]]]; count = 0

for idx in range(1, len(ts)):
if self.WkNum(ts[idx]) == self.WkNum(ts[idx - 1]):
output[count].append(ts[idx])
else:
count += 1
output.append([ts[idx]])
return outputdef WkNum(self, time_text):
t = datetime.datetime.strptime(time_text, "%Y-%m-%d")
wk = (t - datetime.datetime(t.year, 1, 1)).days // 7 + 1
return wk

ts = ['2019-01-01',
'2019-01-02',
'2019-01-08',
'2019-02-01',
'2019-02-02',
'2019-02-05']
if __name__ == "__main__":
s = Solution()
print(s.weekly_aggregate(ts))

7. Write a function that takes in a list of dictionaries with a key and list of integers and returns a dictionary with the standard deviation of each list. (Snapchat, Tinder | Roles: Data Scientist, Data Analyst)

Answer:

def std(data):
mu = np.average(data)
output = np.sqrt(sum((x-mu)**2 for x in data) / len(data))
return output
def get_stddev(input):
new_dict = {}
for x in input:
new_dict[x.get('key')] = std(x.get('values'))
return new_dict

8. Write a function to calculate the root mean squared error of a regression model. The function should take in two lists, one that represents the predictions and another with the target values. (Snapchat | Roles: Data Scientist, ML Engineer)

Answer:

import math
def calculate_rmse(y_true, y_pred):
if len(y_true) != len(y_pred):
print("Length doesn't match")
return
sq = sum((x - y)**2 for x, y in zip(y_true, y_pred))
output_rmse = math.sqrt(sq / len(y_true))
return output_rmse

9. Let’s say we have a five-by-five matrix where each row is a company and each column represents a department. Each cell of the matrix displays the number of employees working in that particular department at each company.

Write a program to return a five by five matrix that contains the percentage of employees employed in each department compared to the total number of employees at each company. (Google | Data Scientist)

Answer:

employees_by_department=[
[10, 20, 30, 30, 10],
[15, 15, 5, 10, 5],
[150, 50, 100, 150, 50],
[300, 200, 300, 100, 100],
[1, 5, 1, 1, 2]
]
a = []
for row in employees_by_department:
row_sum = 0
for j in row:
row_sum += j
new_row = [j/row_sum for j in row]
a.append(new_row)
print("percentage_by_department =",a)

10. Matrix Transpose

Answer:

matrix=[[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12]]transposed=[[row[i] for row in matrix] for i in range(4)]
print(transposed)

Conclusion:

Technical interviews can be tough no matter what the role is. Hope this study guide helps you!

Note: This document will be updated with time and I will try to cover most of the important questions.

--

--

Priyanka Banerjee

Sr. Data Scientist | Work in Finance & Health Domain | Keep Learning, Keep Sharing.