How do you solve a classification machine learning problem using R?

0
How do you solve a classification machine learning problem using R?

Machine learning algorithms seem to be used carelessly on a trial and error basis without proper understanding about how they work. This guide will introduce you to the fundamentals of the classification machine learning problem using R and how you can apply this into data sets that will generate value for your business.

What is a classification machine learning problem in a nutshell?

1.Classification:

Goal:  To predict if an object belongs to one category or the other.

Example: To predict if a user stays or leaves a firm.

The Classification Problem:

In order to illustrate the example below, we have used the Human Resource Analytics Dataset found on Kaggle.

HR Analytics Dataset: https://www.kaggle.com/ludobenistant/hr-analytics

In this classification problem we want to predict which employees will leave the firm and which employees would stay with the firm based on various factors.

We are going to be using a simple decision tree algorithm to solve this problem.

The first step is to install and load the packages that are required to implement this algorithm.

The “rpart” stands for recursive partitioning.

Before we go any further into this guide we want to understand how exactly the decision tree algorithm works.  Take a look at the plot below:

The box on the top of the tree is called the ‘root’. The algorithm automatically chooses the feature in the dataset that has the most impact on the classification outcome as the root.  In the above example the “satisfaction_level” was picked to be the root of our tree. Now you’re probably wondering, what exactly are those numbers inside the boxes. Let’s take a close look at these numbers:

 

The 100% indicates that this particular factor is trained on 100% of the training data. In other words, all of the training data passes through the “satisfaction_level” box. Notice how as we go further down the tree, this percentage shrinks in number because the data gets filtered according to the condition below the box. In this case the condition is: “satisfaction_level >= 0.46”.

The 0.17 basically states that only 17% of the training data satisfies the condition of “satisfaction_level >=0.46”. As a result the number 0 was assigned on the top of the box. If greater than 50% of the training data satisfied the condition the number 1 would have been assigned to the top of the box.

If the number 0 is assigned, the tree moves towards the right and takes the “no” branch. If the number 1 is assigned, the tree moves towards the left and takes the “yes” branch.

Now that we understand how the tree works let’s shuffle our HR analytics data and split it into training and test sets with the code shown below:

We first store the total number of rows in the data frame into a variable “n”. We then shuffle these rows randomly using the sample() function on “n”. We then split the data into training and test sets with 70% of the data being assigned to the training set and 30% of the data being assigned into the test set.

We don’t have to use the 70-30 ratio and this number is arbitrary. If we wanted to have a 60-40 split we can do so using the code shown below:

The next step is to build the classification model. We can do this using the code shown below:

The rpart() function is used to do this. The first argument “left ~ .” models the “left” column of the data frame against all the other columns present in the data frame. Now, we don’t have to model the column of interest against all the other columns present in the data frame. If we want to choose specific columns to model it against we can do so using the code shown below

Here we built the model by factoring in only the Last Evaluation and Satisfaction Levels of the employees.  The next argument is the training data frame – “train”. The last argument specifies the method that we are using to build the machine learning problem – Since this is a classification problem we use the “class” to the method argument.

The next step is to predict the outcome of training the “train” dataset using the model using the “test” dataset. We can do this using the code shown below:

We then build a confusion matrix using the code shown below:

I bet you’re confused now. What exactly is this confusion matrix?

Let’s take a look at the output of the confusion matrix that we got as a result of the code written above:

A confusion matrix is simply a table (like the one shown above) that is used to show the performance of a classification model.

Let us simplify the above table a little bit further so that you can make sense of it: 

The purpose of the confusion matrix is to predict the accuracy of our classifier algorithm. In order to do this we have to look at the True Positives and True Negatives. True Positive is the number that lie in the intersection between the Predicted Yes and the Actual Yes – 535. The True Negative is the number that lie in the intersection between the Predicted No and the Actual No – 2975.

The formula for accuracy is – (True Positive + True Negative / Total Number of Numbers in the Confusion Matrix)

So in this case – (535 + 2975) / 3597 = 97.5% Approx.

Seems like our model was pretty good at predicting who would leave the firm! Alternatively we can also write code to compute the accuracy from the confusion matrix:

We hope the guide has given you a good understanding about how classification algorithms work using R and how you can build one for yourself!

Happy Classification!

 

LEAVE A REPLY