A Crisp Guide to Pandas for Data Engineering — Part 2

Priyanka Banerjee
6 min readOct 20, 2020

In my last blog, I had discussed about how certain pandas libraries are extremely necessary for a Data Engineer.

In this story I will discuss about certain important functions which you cannot ignore, while working with data.

Content :

  1. Aggregating
  2. Reshaping
  3. Duplicate Handling
  4. Dataframe Transformations
  5. Lambda function
  6. Apply function

Aggregating :

GroupBy is an immensely useful function.

'''Consider a dataframe with the following columns for a mobile operator company - 'date','duration','item','network','network_type','month'# Let us get the sum of the durations per month
df.groupby('month')['duration'].sum()
Output:
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
# For each network and each month, let us get the sum of durations and count of items
stat = df.groupby(['month','network'], as_index=False)['duration','item'].agg({'duration':'sum','item':'count'})
stat.head(6)
Output:
month network duration item
0 2014-11 Meteor 1531.000 23
1 2014-11 Tesco 4048.000 23
2 2014-11 Three 12483.000 64
3 2014-11 Vodafone 4371.000 79
4 2014-11 data 998.441 29
5 2014-11 landline 2906.000 5
# Let us get the sum of durations, for calls only for each network
df[df['item'] == 'call'].groupby('network')['duration'].sum()
Output:
network
Meteor 7200.0
Tesco 13828.0
Three 36464.0
Vodafone 14621.0
landline 18433.0
voicemail 1775.0
Name: duration, dtype: float64
# Let us see number of calls, data, and sms are present in each month
df.groupby(['month', 'item'])['date'].count()
Output:
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
# Let us group the data frame by month and item and extract a number of stats from each group
df.groupby(['month', 'item']).agg({'duration':'sum','network_type': "count",'date': 'first'})
Output:

duration network_type date
month item
2014-11 call 25547.000 107 15/10/14 06:58
data 998.441 29 15/10/14 06:58
sms 94.000 94 16/10/14 22:18
2014-12 call 13561.000 79 14/11/14 17:24
data 1032.870 30 13/11/1406:58
sms 48.000 48 14/11/14 17:28
2015-01 call 17070.000 88 15/12/1420:03
data 1067.299 31 13/12/14 06:58
#Based on the requirement you can use different statistical measurements like -
df.groupby(['month']).agg({'duration':'mean'})
df.groupby(['month']).agg({'duration':'std'})

Reshaping :

  • Wide format to long format using melt() function:
# creating a dataframedf = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela', 3: 'Alexa', 4: 'Alan', 5: 'Emma'},'Course': {0: 'Masters', 1: 'Graduate', 2: 'Graduate',3: 'Graduate', 4: 'Graduate', 5: 'Masters'},'Age': {0: 27, 1: 23, 2: 21, 3: 22, 4: 20, 5: 19}})Output:
Name Course Age
0 John Masters 27
1 Bob Graduate 23
2 Sally Graduate 21
3 Alexa Graduate 22
4 Alan Graduate 20
5 Emma Masters 19
df1 = pd.melt(df, id_vars =['Name'], value_vars =['Course','Age'])
df1
Output:
Name variable value
0 John Course Masters
1 Bob Course Graduate
2 Sally Course Graduate
3 Alexa Course Graduate
4 Alan Course Graduate
5 Emma Course Masters
6 John Age 27
7 Bob Age 23
8 Sally Age 21
9 Alexa Age 22
10 Alan Age 20
11 Emma Age 19
  • Wide format to long format using stack() function:
df_stacked = df.stack()
print(df_stacked.head(15))
Output:
0 Name John
Course Masters
Age 27
1 Name Bob
Course Graduate
Age 23
2 Name Sally
Course Graduate
Age 21
3 Name Alexa
Course Graduate
Age 22
4 Name Alan
Course Graduate
Age 20
  • Long format to wide format using pivot() function:
# Let us convert the above long format dataframe into wide format:df2=df1.pivot(index='Name',columns='variable', values='value')
df2
Output:
variable Age Course
Name
Alan 20 Graduate
Alexa 22 Graduate
Bob 23 Graduate
Emma 19 Masters
John 27 Masters
Sally 21 Graduate
  • unstack() function basically reverses stack function :
df = df_stacked.unstack()
df
Output:
Name Course Age
0 John Masters 27
1 Bob Graduate 23
2 Sally Graduate 21
3 Alexa Graduate 22
4 Alan Graduate 20
5 Emma Masters 19

Duplicate Handling :

An important part of Data analysis is analyzing Duplicate Values and removing them. We can do so using the function drop_duplicates()

# Removing rows with same "First Name". Here "First Name" is the column namesample_df.drop_duplicates(subset ="First Name", keep = False, inplace = True)# Dropping duplicates in multiple columns sample_df.drop_duplicates(subset=['First Name', 'Age'], keep=False)'''
keep : {‘first’, ‘last’, False}, default ‘first’
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
'''

Dataframe Transformation :

  • Convert List to Pandas Dataframe :
people_name = ['John','Mark','Sally','Jack','Jill']
df = pd.DataFrame(people_name,columns=['First_Name'])
print (df)
# Usage of Transpose()
people_name = [['John','Mark','Sally','Jack','Jill'],['Smith','Brown','Lee','Jones','Ford'],[21,38,42,28,55]]
df = pd.DataFrame(People_List).transpose()
df.columns = ['First_Name','Last_Name','Age']
Output:
First_Name Last_Name Age
0 Jon Smith 21
1 Mark Brown 38
2 Sally Lee 42
3 Jack Jones 28
4 Jill Ford 55
# A different way of representationpeople_name = [['John','Mark','Sally','Jack','Jill'],['Smith','Brown','Lee','Jones','Ford'],[21, 38, 42, 28, 55]]
df = pd.DataFrame(People_List, index =['First_Name','Last_Name','Age'],columns = ['a','b','c','d','e'])
Output:
a b c d e
First_Name John Mark Sally Jack Jill
Last_Name Smith Brown Lee Jones Ford
Age 21 38 42 28 55
  • Convert Dictionary to Pandas Dataframe :
my_dict = {'John':21,'Mark':38,'Sally':42,'Jack':28,'Jill':55}
df = pd.DataFrame(list(my_dict.items()),columns = ['Name','Age'])
print (df)
Output:
Name Age
0 John 21
1 Mark 38
2 Sally 42
3 Jack 28
4 Jill 55
  • Convert Array to Pandas Dictionary :
import numpy as np
# creating the Numpy array
array = np.array([[1, 1, 1], [2, 4, 8], [3, 9, 27],[4, 16, 64], [5, 25, 125], [6, 36, 216], [7, 49, 343]])
# You may or may not create a list of index names. If you don't create one, a default will be created.
# index_values = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh']
# creating a list of column names
column_values = ['number', 'squares', 'cubes']
# creating the dataframe
df = pd.DataFrame(data = array,
# index = index_values,
columns = column_values)
print(df)
Output(without specified index column):
number squares cubes
0 1 1 1
1 2 4 8
2 3 9 27
3 4 16 64
4 5 25 125
5 6 36 216
6 7 49 343

Lambda Function :

lambda function can be used anytime while building a complex logic for a new column or a filter and many more. This function can take multiple arguments but is applied only on one expression.

x = lambda a, b : a * b
print(x(5, 6))
Output: 30# Use lambda function in a function that always doubles the number you take as an input:def myfunc(n):
return lambda a : a * n
doubler = myfunc(2) # doubler = lambda a: a*2
print(doubler(11)) # for a=11, compute 11*2
Output: 22# Use lambda function to perform calculation and make a new column in a dataframe :# Create a dataframe consisting names of students and their total marks
values = [['Rohan',455],['Elvish',250],['Deepak',495],['Soni',400],['Radhika',350],['Vansh',450]]
df = pd.DataFrame(values,columns=['Name','Total_Marks'])
# Compute percentage of 'Total_Marks' column using df.assign()
df = df.assign(Percentage = lambda x: (x['Total_Marks'] /500 * 100))
Output:
Name Total_Marks Percentage
0 Rohan 455 91.0
1 Elvish 250 50.0
2 Deepak 495 99.0
3 Soni 400 80.0
4 Radhika 350 70.0
5 Vansh 450 90.0
# Usage of apply function along with lambda :df = df.set_index('Name')
df = df.apply(lambda x: np.square(x) if x.name == 'Deepak' else x, axis=1)
Output:
Total_Marks Percentage
Name
Rohan 455.0 91.0
Elvish 250.0 50.0
Deepak 245025.0 9801.0
Soni 400.0 80.0
Radhika 350.0 70.0
Vansh 450.0 90.0

Lambda function takes in other built-in functions like filter() , map() as arguments.

  • Usage of lambda function with filter()
# The function filter() takes in a function and a list as arguments.
# Program to filter out only the even items from a list
my_list = [1, 5, 4, 6, 8, 11, 3, 12]
new_list = list(filter(lambda x: (x%2 == 0) , my_list))
print(new_list)
Output: [4,6,8,12]
  • Usage of lambda function with map()
# Program to double each item in a list using map()

my_list = [1, 5, 4, 6, 8, 11, 3, 12]
new_list = list(map(lambda x: x * 2 , my_list))
print(new_list)
Output: [2, 10, 8, 12, 16, 22, 6, 24]

Apply Function :

As the name suggest apply() function allows the users to pass a function and apply it on every single value.

# Usage of apply() to add 5 to each value in a series and returns a new series :new = s.apply(lambda num : num + 5)# Remove $ present in the vlaues of the column named "salary" and convert it into a float value :def money_to_float(money_str):
return float(money_str.replace("$","").replace(",",""))

df['new_salary'] = df['salary'].apply(money_to_float)
  • Usage of apply() function to every single row of a dataframe
# Use the "height" and "width" columns to calculate the "area"def calculate_area(row):
return row['height'] * row['width']

rectangles_df['area'] = rectangles_df.apply(calculate_area, axis=1)
Output:

height width area
0 40.0 10 400.0
1 20.0 9 180.0
2 3.4 4 13.6

Conclusion :

As you’ve observed lambda() and apply() take care of lots of complex functionalities, you can practice more such stuff on the internet.

I never suggest any particular course since I believe requirement varies for each one.

Follow me on Medium if you have found this useful and want to come across more such posts in future.

--

--

Priyanka Banerjee

Sr. Data Scientist | Work in Finance & Health Domain | Keep Learning, Keep Sharing.