How do you solve a linear regression machine learning problem in Python?

How do you solve a linear regression machine learning problem in Python?

With Automated Machine Learning taking the spotlight in the world of data science it does not mean that a data scientist should ignore the underlying mechanisms of the algorithms that he or she uses. Knowing how the linear regression algorithm works will help you diagnose your model and fine tune many parameters that may lead to a better solution.

For the purpose of illustrating how the linear regression algorithm works we have used the Human Resource Analytics Dataset which can be accessed by using this link:

The first thing that we need to understand is that Linear Regression is a form of supervised machine learning. This means that we use this algorithm on labelled data (Data that has a target variable and features).

Goal: To predict the future values of the target variable based on the features present in the data.

Example:  To predict the satisfaction levels of the employees in the HR data based on all the other features present in the data.

Before we go further into how you can implement a linear regression model to a dataset let’s understand how it works on the inside.

The Linear Regression Algorithm: 

Linear Regression is nothing but fitting a 2-Dimensional line through a set of data points. The variable on the y axis is called the “Target Variable” and the variable(s) on the X-axis are called the “Predictors“. In Linear Regression we use the predictors to predict a value for the target variable.

Since we are fitting a line through the data, it takes the form: y = mx + c – Where “y” is the target variable, “x” is the predictor, “m” is the slope of the line and “c” is the intercept. To get the best possible line that fits all the data points we need to choose the right values of “m” and “c” in such a way that it minimizes the “Loss function“.

So what exactly is this loss function? Take a look at the image below:

The distance between a point and the linear regression line is called the “Residual“. This distance is positive if it’s above the line and it’s negative if its below the line. We then take the squared value of this residual – Residual^2. In this fashion we square all the residuals for all the data points and then take the sum of it. This is called the Sum of the Squares of the Residuals. Minimizing the sum of the square of the residuals is called – Ordinary Least Squares or OLS.

In order to get the best line that fits all the data points perfectly we need to minimize the loss function as much as possible which can be achieved by minimizing the value of sum of the squares of the residuals or OLS.

Now that you have a decent idea of how the algorithm works let’s build our first linear regression model. The first step is to import all the packages that we need

Next, we want to import the packages that we will need to build the model:

In this model, we are going to predict the values of satisfaction levels of the employees based on all the predictors present in the dataset. To do this we need to let the model know what the Target variable is and what the predictors are. Additionally these variables have to be in a NumPy array. In order to implement this in Python we use the code shown below:

In the code above – The .values at the end of the code converts the dataframe into a NumPy array.

Next, we are going to split our data into training and tests sets. We train the model on the training set and make predictions on the test test. The code that we are going to need to implement this is found below:

The argument – “test_size = 0.3” splits the training and test sets into a 70-30 ratio. This number is arbitrary but we would always recommend having a bigger training set compared to the test set. The “random_state” ensures that the model is reproducible if we use the same number – 42 the next time we build this model.

We then build and fit the model into the training sets using the code shown below:

We can then make a prediction on the test set using the code shown below:

Finally we can compute the R squared score for the model we have built to evaluate how well the model performs using the code shown below:

The R-Squared score is a value that ranges from 0 to 1. A higher score means that your model fits the data points really well. You have to be skeptical about this score because a higher R-Squared score does not necessarily mean that you have a great model. This is a topic that I will discuss in a forthcoming guide in the future.

Another metric that you can calculate from your model is the Root Mean Squared Error or the RMSE. The code to calculate this value is given below:

The goal here is to keep the value of the RMSE as low as possible as this is indicative of a model with fewer errors in it’s predictions.

In conclusion, the Linear Regression has been one of the most popular algorithms for statisticians and data scientists alike and proper understanding of this algorithm forms a foundation for learning more complex algorithms such as neural networks.

Happy Machine Learning!