The ultimate guide to building a predictive analytics project from scratch using Python!

The ultimate guide to building a predictive analytics project from scratch using Python!

Predictive analytics is one of the most useful applications of machine learning when it comes to a business setting. This guide will take you step by step to building the ultimate predictive analytics project from scratch using python along with the Github documentation link attached so that you can directly implement the code for practice!

The Github link can be found here:

The first step to building your project is to collect your data. There are multiple methods to collect data using Databases, MySQL, Surveys and APIs like twitter. For this project we have used the data that’s directly available on the popular machine learning website: Kaggle.

The data that we are using to build this project is called Human Resource Analytics and can be downloaded from here:

Step 1: Understand your data.

When you get a new dataset under your belt you want to understand what exactly your objective and goals with the dataset are. You also need to ask yourself what you’re curious about understanding about the dataset understand. As an example of this you want to create the following three subsections: 

Next we want to import the data and explore the first 5 rows of the data. We can do the using the code shown below:

Step 2: Explore, clean and visualize your data

We then want to explore the summary statistics of the numerical columns of the dataset so that we can have a brief idea of what the numerical quality of the data that we are working with. We can do this using the code shown below:

This results in a table as shown below: 

Next we want to clean the data. Now cleaning data is a guide in itself and here’s a link to “The ultimate guide to cleaning data in Python.”

For this project though, the main issues that we could notice were:

  • The data had a couple of incorrectly named column names.
  • The data did not have any NULL values
  • The categorical variables were ‘int64’ type
  • There was duplicate rows of data

The code that we used to clean the data is given below:

After this we would like to visualize our data. This too constitutes to another guide and you can learn more about it using this link – “How do you visualize data in Python?”

For the project under study the code used to visualize all our plots is given below:

The explanations and the inferences to each of the plots mentioned above can be found on the GitHub link here:

Step 3: Statistical Modeling

The aim of statistical modelling is to verify the authencitiy of the quality of the data that we have gotten from the HR department. We can also test various hypothesis that we might have formed in our mind during data exploration.

This step is VERY important in any predictive analytics project as we can statistically verify information such as:

  • The kind of distributions the various continuous variables have.
  • Be sure of the values of the summary statistics even if the data collection was carried out a 100,000 times using Bootstrapping.
  • Ensure that the variables are correlated using hypothesis testing.

The code that we executed to carry out statistical modeling can be found below:

The explanations for the lines of code mentioned above can be found on the Github link:

Step 4: Machine Learning.

Without Machine Learning, a predictive analytics project is just analytics. The project under study is a classification machine learning problem as we want to effectively predict if an employee would leave the firm or stay based on various factors.

Three algorithms are compared here:

  • K Nearest Neighbors Classification
  • Logistic Regression
  • Decision Trees

Each model is trained and tested with the same train-test split of the original dataframe. The ROC Curve is plotted and the AUC is calculated for each algorithm.

Before running the machine learning models, the parameters for each of the models have to be chosen. In this case the model that requires parameter tuning would be the KNN algorithm as we would like to determine the ideal number of K nearest neighbors. We can do this using the code shown below:

We could determine that 8 neighbors would result in the best possible model.

Next we can build all 3 machine learning models:

Next we want to determine the accuracy of all three models:

We will then plot our model’s performance using the ROC curve and determine the AUC scores for each of the three models:

We then scale our data and then compare the performance of our models after scaling to see if the model scores have improved our not.

The next step is an important one. We need to carry out a feature selection test to determine what features in our dataset have the most influence on the final predictions. We can do this using the code shown below:

 This generates a table for us with the most important features in the dataset: 
We then choose which variables to keep in our machine learning model based on domain experience and the table shown above and then test the performance of our models after feature selection:

We then conclude with what we observed while we built these projects and what our final conclusions are:

With this we have come to the end of one of the longest guides published on LinearData. We usually aim to keeping our guides 1000 words short but this guide wouldn’t be the “Ultimate” guide to predictive analytics if that were true.

Happy Predictions!