The **ggplot** package in R encompasses some of the most versatile and powerful data visualization tools available to the Data Scientist. This guide will give you a comprehensive introduction on how you can utilize the power of this amazing package.

For your reference, we have used the popular Human Resource Analytics Dataset found on – https://www.kaggle.com/ludobenistant/hr-analytics

The first step to using ggplot is to install and load the package. We can do this using the code shown below:

1 2 3 4 |
#Installing the package install.packages("ggplot2") #Loading the package library(ggplot2) |

The first plot that we are going to introduce is the scatterplot. We can construct a basic scatter plot using the code shown below:

1 |
ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation)) + geom_point() |

Fundamentally the ggplot package is built on top of the lattice graphics in R and is made up of a number of layers that we can use for visualization. In the code above, the “aes” is a layer known as **aesthetics**. The geom_point() ensures that we obtain a scatter plot.

The code above resulted in a scatter plot as shown below:

We can further modify the scatter plot to add a line of regression along with a bunch of other features using the code shown below:

1 |
ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation)) + geom_point(alpha = 0.8) + scale_x_log10() + scale_x_log10() + stat_smooth(method= "lm", se = FALSE, size = 1, col = "red") |

Here we scaled the data using log10 along the x and y axis. We have also added argument called “alpha = 0.8” to make the datapoints transparent. We also added another layer called stat_smooth which was used to fit the line of best fit, set the standard error (se) to FALSE and ensured that the line of regression is red.

The resulting plot is shown below:

We can also construct the scatter plot by coloring it by a categorical column with the code shown below:

1 |
ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation, col = left)) + geom_point(alpha = 0.8) |

The “col” argument is used to specify which categorical column we want to color our scatter plot by. The resulting plot is shown below:

We can now take this one step further and facet our scatter plot by another categorical column of interest with the code shown below:

1 |
ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation, col = left)) + geom_point(alpha = 0.8) + facet_grid(. ~ salary) |

Here we have used the “facet_grid()” layer to facet our scatter plot by the categorical variable “salary” which has three levels – 0, 1 and 2. The resulting plot is shown below:

The next plot that we will learn about is the bar plot. We can plot a bar plot using the code shown below:

1 |
ggplot(data = HR, aes(x = factor(department), fill = factor(left))) + geom_bar() |

The resulting plot is shown below:

The next plot is the simple but incredibly useful histogram. The code to construct a histogram in R is shown below:

1 |
ggplot(data = HR, aes(x = average_montly_hours)) + geom_histogram() |

This results in the plot shown below:

We can take this one step further and facet it by a categorical column with the code shown below

1 |
ggplot(data = HR, aes(x = average_montly_hours)) + geom_histogram() + facet_grid(.~salary) |

Faceting it by the salary column resulted in the plot as shown below:

We can plot a pair plot that tells us how all the columns in our dataset our correlated with each other using the code shown below:

1 2 3 |
install.packages("GGally") library(GGally) ggpairs(HR) |

The resulting plot is shown below:

Density plots give us a plethora of information with respect to the distribution of data. We can construct a density plot in R using the code shown below:

1 2 3 4 |
install.packages('viridis') library(viridis) ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation)) + stat_density_2d(geom = "tile", aes(fill = ..density..), contour = FALSE) + scale_fill_viridis() |

The resulting plot shows us where most of the datapoints are concentrated at and where it’s not.

We can plot an alternate version of the same density plot shown above with the code shown below:

1 |
ggplot(data = HR, aes(x = satisfaction_level, y = last_evaluation)) + geom_density_2d() |

This results in a plot shown below:

Violin plots are a pretty useful way to capture the distribution of categorical variables against continuous ones. In order to construct a violin plot we use the code shown below:

1 |
ggplot(data = HR, aes(x = factor(salary), y = last_evaluation, fill = factor(salary))) + geom_violin(col = NA) |

The resulting plot is shown below:

Density plots that show the distribution of a single continuous variable factored by a categorical variable can be constructed using the code shown below:

1 |
ggplot(data = HR, aes(x = average_montly_hours, fill = factor(salary))) + geom_density(alpha = 0.35, col = NA) |

The resulting plot is shown below:

The next plot that we will construct is the statistically helpful box plot. The code for which is shown below:

1 |
ggplot(data = HR, aes(x = factor(salary), y = satisfaction_level)) + geom_boxplot() |

The resulting box plot is shown below:

We might want to use a different theme for our plot in order to make our plots look more professional or stylish. We can do this by using the code shown below:

1 2 3 4 |
install.packages("ggthemes") library(ggthemes) plot <- ggplot(data = HR, aes(x = factor(salary), y = satisfaction_level)) + geom_boxplot() plot + theme_economist() |

The same boxplot that we plotted earlier using the base theme now looks like this as a result of using the “economist” theme:

Sometimes we might want to build statistically significant plots with summary statistics such as the mean and a 95% confidence interval for the datapoints. We can do this using ggplot using the code shown below:

1 2 3 |
install.packages("Hmisc") library(Hmisc) ggplot(data = HR, aes(x = factor(salary), y = satisfaction_level)) + geom_point() + stat_summary(fun.data = "mean_cl_normal", geom = "crossbar", width = 0.2, col = "red") |

The “stat_summary()” is a layer that can be used for statistical computations. The resulting plot looks like this:

While there are plenty of functions left for you to explore in the ggplot package, this guide should serve as a great starting point from which you can build your fundamentals of the package and can now work on expanding your knowledge as a data visualization expert.

Happy Visualization!