What is the Application of Programming in Data Science

Priyanka Banerjee
Analytics Vidhya
Published in
5 min readJul 6, 2021

--

Mathematics and Programming are the backbone of Data Science.

Why Programming is essential?

We know Data Science is a domain where you need to work with Data. Probability plays an extremely vital role. Even, we find so many advertisements of different platforms suggesting anybody can become a Data Scientist even without having Mathematics or Programming background, is it the REALITY?

Well, the Answer can’t be given in one word or in binary format of Yes/No.

To understand it, let us realise why Data Science is essential in a company:

  1. The way during our childhood we have learnt the science behind chemical components in CHEMISTRY, biological components in BIOLOGY, similarly you can see Data Science is that branch of Science where we understand Data as a component.
  2. Data Science is that sector which provides us practical application of Mathematics and Statistics. I will not be surprised if after few years, there will be introduction of Mathematics Lab in classes 11 and 12 with Data Science being the subject.
  3. With the introduction of internet and advancement of technologies, whatever we do, gets stores in the form of Data. These data became easy to access with cloud technology. As people started doing more researches, they understood that these data has hidden patterns inside it which if utilised properly, can help in increasing the revenue. Thus the science behind understanding the data or Data Science came into existence.

What problems did Data Science solve? Some common problems like:

  1. Predicting upcoming revenue generation. These helped companies to get an idea how much raw materials should be bought, helping to save money which can be used for better marketing strategies.
  2. Targeting customers. These helped companies to understand behaviour of customers like what kind of products they like and what they don’t.

Eg: Let us consider that you go to a shop to buy a dress.

Scenario 1: You didn’t tell the shopkeeper the colour or size but just mentioned that it is for yourself. He is showing dresses of various colours apart from Black even of various sizes which may or may not fit you. No matter how good they’re, you might not end up buying because it is not specific to your requirement.

Scenario 2: Now if he specifically asks you questions like which colour you want? what type of dress you want? what is the size you’re looking for? what is the price range you’re looking for? etc. and after listening to your requirements, he shows you 5–6 dresses, chances of you selecting one among them is pretty higher than Scenario 1 because here all the dresses match your requirement.

Now, if you are a regular shopper from that shop and whenever you shop anything, the shop keeps details of it. These details are the DATA. Let us consider the shop has stored your details for 4 times. Now when you arrive for the 5th time, based on your previous shopping, the shopkeeper gets to learn your choices and shows you exactly those dresses what you prefer to buy like Black dress, Size S, Price Range: Rs.1000-Rs.2000. You’re happy that you didn’t have to waste time searching and choosing from a huge set of collection. Now that you’re a happy and satisfied customer, you will visit the store again and again which benefits them by increasing their growth and revenue.

Here, the question arises how the store will know my choices or where should they save it?

Well all these data will be stored in Cloud. Cloud technologies like Google Cloud Platform, AWS, Microsoft Azure help you to store data which you can manipulate later to determine the pattern of Data. If you notice closely, you will see the data is getting stored in a place which can be accessed only if you know computer programming.

Eg: Let us consider data is stored in GCP. There are several tables in which data is stored. One table stores sizes of dresses customers buy, another stores colours, another prices, etc. Now to understand the buying pattern of one customer, you have to join these tables together and extract the complete information. Here comes SQL. You need to use BigQuery to perform this task. Once you get the data you can use data analysis workflow or use AutoML feature of GCP. To analyse data and implement the workflow you have to know Python/R

Till now, it seems simple and you might think that all these steps don’t require extensive programming knowledge or mathematics knowledge.

Well, this is just the tip of a glacier. Being a Data Scientist of an organisation, it is expected that you know how to engineer and extract the data, analyse it, visualize it, use models on it for prediction, choose and deploy the right model and test it on future data. To perform all these tasks good knowledge in Python is expected.

Eg: Let us consider you are required to know the total number of quantities that a customer has bought till now and convert it into a dataframe for further manipulation.

Raw data from tables:

Date:1/1/2021, 21/1/2021, 3/3/2021, 14/4/2021

Quantities: 1, 2, 3, 4

Dataframe:

Days: 4

Total Quantities: 10

Now when you see the data you find a hidden pattern in the quantities bought. Your dataframe should have aggregated values from the raw table, i.e., total count of dates as “Days” and total number of quantities bought as “Total Quantities”. Here you cannot use pandas libraries or any other python specific library. So you will need to write:

qtys = [1,2,3,4]
n = len(qtys)
sum = 0
for i in range(0,n):
sum += qtys[i]
print("Total quantities bought", sum)

You can write the above program only when you’ve understood the logic. Now do you think the above method is an optimised way of obtaining the result? No. If the customer has visited the shop 1 lakh times, it will loop for 1lakh times (O(n)) which makes it inefficient. What can be the other way round:

qtys = [1,2,3,4]
n = len(nums)
print("Total quantities bought", n*(n+1)/2)

You might ask why n*(n+1)/2. Well if you observe closely you can notice there’s a hidden pattern in the quantities bought(1,2,3,4) — sum of natural numbers from 1 to 4. Here comes Mathematics: Sum of n natural numbers where a is the first term and d is the difference between them, gives you

Sum = n/2(2*a +(n-1)*d

We’re using the above AP series formula and assigning values of a =1 and d =1. If the buying pattern of quantities would have been 3,6,9,12-we could have applied the same logic. This is the optimised way of writing a program which any organisation will look for while interviewing you. What if you come across a different set of pattern, you will have to formulate so that it can be represented by O(1) instead of O(n)

I hope you have realised how important programming and mathematics are in the domain of Data Science. Hence do not pay attention in any hearsay that claims without proper Programming or Mathematics knowledge you can become an expert in the Data Science domain.

If you’re planning to embark your journey in the field of Data Science/ Machine Learning, you can check out PyLambda

You can also check out my blog on interview guide:

--

--

Priyanka Banerjee
Analytics Vidhya

Sr. Data Scientist | Work in Finance & Health Domain | Keep Learning, Keep Sharing.