Understanding Correlation and Regression using R – A beginners guide

Understanding Correlation and Regression using R – A beginners guide

Correlation is NOT Causation.

These words echo in the mind of every data scientist out there and before we look into correlation and regression we need to understand what makes correlation different from causation.


Suppose after plotting a scatter plot between variables and we see that a change in the value of variable along the x axis leads to a change in the value of the variable along the y axis we say that there is correlation between x and y. Correlation implies that while we see a relationship between two variables we CANNOT confirm that the change in the value of one variable causes the change in the value of the other variable. For example, while we can correlate the growing trend of data science to the emergence of technologies that can manipulate this data it is not the only reason why data science has seen an increase in popularity over the last 5 years.


Causation implies that a change in the value of the variable along the x axis has 100% taken place due to the change in the value of the variable along the y  axis. In simpler terms y causes changes in x.

The most common form of correlation is that of a bi-variate relationship. That is the relationship between two variables. In order to visualize the bi-variate relationship in R we can use the ggplot package along with the code shown below:

Here the data argument specifies the dataset of interest. Inside the aes() we have the variables that we want to plot in the x and y axis respectively. The output of the following lines of code was:

Here we can see a moderately  positive relationship between user score and global sales of video games which was to be expected because hey, when a game’s pretty good you have to buy it.

One of the most interesting things that we might be able to notice about the plot above is the was the lone outlier that was sitting just on top having a Global Sales value of around 80. This is something we need to filer out in the dataset and figure out why the outlier was present in the first place.

Correlation Strength:

The strength of the correlation can vary between -1 to +1. Here, -1 indicates a strong negative relationship ( When x goes up, y goes down) while + 1 indicates a strong positive relationship ( When x goes up, y goes up). We can compute the strength of the correlation for our analysis in R using the code shown below:

In the above code we use the summarize() function along with the pipe ‘%>%’ from the dplyr package to obtain the correlation coefficient. When we want to compute the correlation coefficient we have to make sure that the variables of interest have numeric data only and hence I converted the user scores to numeric since it contained NA values that were stored as strings. In addition to this we used the ‘use’ argument and set this to “pairwise.complete.obs” in the cor() function because we had NA values.

The output of the code above is shown below: 

Here we used that N is 16719 which stands for the number of observations that were in the two variables of interest while ‘r’ is the value of the correlation coefficient – 0.088. This means that were was only a weak positive correlation between the two variables.

Remember how we initially assumed there was a moderately strong relationship? Turns out I was wrong. This shows us why it is important to compute the value of the correlation coefficient.


Linear Regression

Building a linear regression implies that we are fitting the line that best fits the points in a scatter plot. This enables us to make predictions. In order to fit a line of best fit we can follow the code shown below:

The geom_smooth() function is used to generate the line of best fit where the first argument method = “lm” indicates the linear regression method while se = FALSE indicates that we don’t want the standard error. The output of the code looks something like this: 

Understanding the difference between Regression and Regression to the mean is important to any data analysis.

Regression to the mean indicates that the predicted values of the regression model always regresses to the mean of the dataset. For example, Tall parents tend to produce tall children but then these Children’s heights will more or less be similar to the mean of the their parents heights.

Let’s now try constructing a linear regression model. Consider the code snippet shown below that we are going to be using to construct the regression model:

We use the lm() function to construct the linear regression model. The first argument is the formula ( y ~ x reads as “Build a regression model for y based on the variable x ). The second argument is the dataset where these variables belong to. The output of the regression model is shown below: 

We can see that the NA_Sales value is 1.79. This means that for every 1 unit of increase of Global Sales, the North American (NA) sales increases by 1.79. Interesting. Seems like America makes a lot of money from the sales of video games. This was expected since they have a lot of video game companies that produce very popular titles!

This guide should have given you a great introduction to the world of correlation and linear regression and how you can apply these concepts using R. Don’t forget that statistics forms the basis of any data science project and you should never neglect them!

Happy Coding!



  1. I blog quite often and I seriously appreciate your content. Your article has truly peaked my interest. I’m going to book mark your website and keep checking for new details about once a week. I subscribed to your RSS feed too.|