A Crisp Guide to Pandas for Data Engineering — Part 2

Priyanka Banerjee

6 min readOct 20, 2020

In my last blog, I had discussed about how certain pandas libraries are extremely necessary for a Data Engineer.

In this story I will discuss about certain important functions which you cannot ignore, while working with data.

Content :

Aggregating
Reshaping
Duplicate Handling
Dataframe Transformations
Lambda function
Apply function

Aggregating :

GroupBy is an immensely useful function.

'''Consider a dataframe with the following columns for a mobile operator company - 'date','duration','item','network','network_type','month'# Let us get the sum of the durations per month
df.groupby('month')['duration'].sum()Output:
month 
2014-11    26639.441 
2014-12    14641.870 
2015-01    18223.299 
2015-02    15522.299 
2015-03    22750.441 
Name: duration, dtype: float64# For each network and each month, let us get the sum of durations and count of items
stat = df.groupby(['month','network'], as_index=False)['duration','item'].agg({'duration':'sum','item':'count'})
stat.head(6)Output:
  month   network   duration  item
0 2014-11 Meteor    1531.000   23
1 2014-11 Tesco     4048.000   23
2 2014-11 Three     12483.000  64
3 2014-11 Vodafone  4371.000   79
4 2014-11 data      998.441    29
5 2014-11 landline  2906.000   5# Let us get the sum of durations, for calls only for each network
df[df['item'] == 'call'].groupby('network')['duration'].sum()Output:
network 
Meteor        7200.0 
Tesco        13828.0 
Three        36464.0 
Vodafone     14621.0 
landline     18433.0 
voicemail     1775.0 
Name: duration, dtype: float64# Let us see number of calls, data, and sms are present in each month
df.groupby(['month', 'item'])['date'].count()Output:
month    item 
2014-11  call    107          
         data     29          
         sms      94 
2014-12  call     79          
         data     30          
         sms      48 
2015-01  call     88          
         data     31          
         sms      86# Let us group the data frame by month and item and extract a number of stats from each group
df.groupby(['month', 'item']).agg({'duration':'sum','network_type': "count",'date': 'first'})Output:

                   duration   network_type  date
month     item
2014-11    call    25547.000   107          15/10/14 06:58
           data    998.441     29           15/10/14 06:58
           sms     94.000      94           16/10/14 22:18
2014-12    call    13561.000   79           14/11/14 17:24
           data    1032.870    30           13/11/1406:58
           sms     48.000      48           14/11/14 17:28
2015-01    call    17070.000   88           15/12/1420:03
           data    1067.299    31           13/12/14 06:58#Based on the requirement you can use different statistical measurements like -
df.groupby(['month']).agg({'duration':'mean'})
df.groupby(['month']).agg({'duration':'std'})

Reshaping :

Wide format to long format using melt() function:

# creating a dataframedf = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela', 3: 'Alexa', 4: 'Alan', 5: 'Emma'},'Course': {0: 'Masters', 1: 'Graduate', 2: 'Graduate',3: 'Graduate', 4: 'Graduate', 5: 'Masters'},'Age': {0: 27, 1: 23, 2: 21, 3: 22, 4: 20, 5: 19}})Output:
  Name Course  Age
0  John  Masters  27
1  Bob   Graduate 23
2  Sally Graduate 21
3  Alexa Graduate 22
4  Alan  Graduate 20
5  Emma  Masters  19df1 = pd.melt(df, id_vars =['Name'], value_vars =['Course','Age'])
df1Output:
   Name   variable value 
0  John   Course   Masters
1  Bob    Course   Graduate
2  Sally  Course   Graduate
3  Alexa  Course   Graduate
4  Alan   Course   Graduate
5  Emma   Course   Masters
6  John   Age      27
7  Bob    Age      23
8  Sally  Age      21
9  Alexa  Age      22
10 Alan   Age      20
11 Emma   Age      19

Wide format to long format using stack() function:

df_stacked = df.stack()
print(df_stacked.head(15))Output:
0  Name       John
   Course     Masters    
   Age        27
1  Name       Bob    
   Course     Graduate    
   Age        23 
2  Name       Sally   
   Course     Graduate    
   Age        21 
3  Name       Alexa    
   Course     Graduate    
   Age        22 
4  Name       Alan    
   Course     Graduate   
   Age        20

Long format to wide format using pivot() function:

# Let us convert the above long format dataframe into wide format:df2=df1.pivot(index='Name',columns='variable', values='value')
df2Output:
variable   Age  Course
    Name
    Alan   20   Graduate
    Alexa  22   Graduate
    Bob    23   Graduate
    Emma   19   Masters
    John   27   Masters
    Sally  21   Graduate

unstack() function basically reverses stack function :

df = df_stacked.unstack()
dfOutput:
   Name  Course    Age
0  John  Masters   27
1  Bob   Graduate  23
2  Sally Graduate  21
3  Alexa Graduate  22
4  Alan  Graduate  20
5  Emma  Masters   19

Duplicate Handling :

An important part of Data analysis is analyzing Duplicate Values and removing them. We can do so using the function drop_duplicates()

# Removing rows with same "First Name". Here "First Name" is the column namesample_df.drop_duplicates(subset ="First Name", keep = False, inplace = True)# Dropping duplicates in multiple columns sample_df.drop_duplicates(subset=['First Name', 'Age'], keep=False)'''
keep : {‘first’, ‘last’, False}, default ‘first’
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
'''

Dataframe Transformation :

Convert List to Pandas Dataframe :

people_name = ['John','Mark','Sally','Jack','Jill']
df = pd.DataFrame(people_name,columns=['First_Name'])
print (df)# Usage of Transpose()
people_name = [['John','Mark','Sally','Jack','Jill'],['Smith','Brown','Lee','Jones','Ford'],[21,38,42,28,55]]
df = pd.DataFrame(People_List).transpose()
df.columns = ['First_Name','Last_Name','Age']Output:
  First_Name  Last_Name Age 
0     Jon     Smith     21 
1     Mark    Brown     38 
2     Sally   Lee       42 
3     Jack    Jones     28 
4     Jill    Ford      55# A different way of representationpeople_name = [['John','Mark','Sally','Jack','Jill'],['Smith','Brown','Lee','Jones','Ford'],[21, 38, 42, 28, 55]]
df = pd.DataFrame(People_List, index =['First_Name','Last_Name','Age'],columns = ['a','b','c','d','e'])Output:
             a       b     c      d     e
First_Name  John     Mark  Sally  Jack  Jill
Last_Name   Smith    Brown Lee    Jones Ford
Age         21       38    42     28    55

Convert Dictionary to Pandas Dataframe :

my_dict = {'John':21,'Mark':38,'Sally':42,'Jack':28,'Jill':55}
df = pd.DataFrame(list(my_dict.items()),columns = ['Name','Age'])
print (df)Output:
    Name   Age 
0   John   21 
1   Mark   38 
2   Sally  42 
3   Jack   28 
4   Jill   55

Convert Array to Pandas Dictionary :

import numpy as np
# creating the Numpy array
array = np.array([[1, 1, 1], [2, 4, 8], [3, 9, 27],[4, 16, 64], [5, 25, 125], [6, 36, 216], [7, 49, 343]])# You may or may not create a list of index names. If you don't create one, a default will be created.
# index_values = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh']# creating a list of column names
column_values = ['number', 'squares', 'cubes']# creating the dataframe
df = pd.DataFrame(data = array,
                  # index = index_values,
                  columns = column_values)
print(df)Output(without specified index column):
     number  squares  cubes 
0       1        1      1 
1       2        4      8 
2       3        9      27 
3       4        16     64 
4       5        25     125 
5       6        36     216 
6       7        49     343

Lambda Function :

lambda function can be used anytime while building a complex logic for a new column or a filter and many more. This function can take multiple arguments but is applied only on one expression.

x = lambda a, b : a * b
print(x(5, 6))Output: 30# Use lambda function in a function that always doubles the number you take as an input:def myfunc(n):
    return lambda a : a * ndoubler = myfunc(2) # doubler = lambda a: a*2
print(doubler(11)) # for a=11, compute 11*2Output: 22# Use lambda function to perform calculation and make a new column in a dataframe :# Create a dataframe consisting names of students and their total marks
values = [['Rohan',455],['Elvish',250],['Deepak',495],['Soni',400],['Radhika',350],['Vansh',450]]
df = pd.DataFrame(values,columns=['Name','Total_Marks'])# Compute percentage of 'Total_Marks' column using df.assign()
df = df.assign(Percentage = lambda x: (x['Total_Marks'] /500 * 100))Output:
   Name    Total_Marks     Percentage
0  Rohan   455             91.0
1  Elvish  250             50.0
2  Deepak  495             99.0
3  Soni    400             80.0
4  Radhika 350             70.0
5  Vansh   450             90.0# Usage of apply function along with lambda :df = df.set_index('Name')
df = df.apply(lambda x: np.square(x) if x.name == 'Deepak' else x, axis=1)Output:
         Total_Marks   Percentage 
Name  
Rohan    455.0         91.0
Elvish   250.0         50.0
Deepak   245025.0      9801.0
Soni     400.0         80.0
Radhika  350.0         70.0
Vansh    450.0         90.0

Lambda function takes in other built-in functions like filter() , map() as arguments.

Usage of lambda function with filter()

# The function filter() takes in a function and a list as arguments.
# Program to filter out only the even items from a listmy_list = [1, 5, 4, 6, 8, 11, 3, 12]
new_list = list(filter(lambda x: (x%2 == 0) , my_list))
print(new_list)Output: [4,6,8,12]

Usage of lambda function with map()

# Program to double each item in a list using map()

my_list = [1, 5, 4, 6, 8, 11, 3, 12]
new_list = list(map(lambda x: x * 2 , my_list))
print(new_list)Output: [2, 10, 8, 12, 16, 22, 6, 24]

Apply Function :

As the name suggest apply() function allows the users to pass a function and apply it on every single value.

# Usage of apply() to add 5 to each value in a series and returns a new series :new = s.apply(lambda num : num + 5)# Remove $ present in the vlaues of the column named "salary" and convert it into a float value :def money_to_float(money_str):
    return float(money_str.replace("$","").replace(",",""))

df['new_salary'] = df['salary'].apply(money_to_float)

Usage of apply() function to every single row of a dataframe

# Use the "height" and "width" columns to calculate the "area"def calculate_area(row):
    return row['height'] * row['width']

rectangles_df['area'] = rectangles_df.apply(calculate_area, axis=1)Output:

  height   width   area
0 40.0     10      400.0
1 20.0     9       180.0
2 3.4      4       13.6

Conclusion :

As you’ve observed lambda() and apply() take care of lots of complex functionalities, you can practice more such stuff on the internet.

I never suggest any particular course since I believe requirement varies for each one.

Follow me on Medium if you have found this useful and want to come across more such posts in future.