How do you visualize data in Python?

2
How do you visualize data in Python?

Python has some amazing packages like the ‘matplotlib’ and ‘seaborn’ for beautiful and statistically meaningful visualizations. This guide explores what a data scientist can accomplish with the knowledge of these packages. In order to facilitate an approach that will help you understand both these packages effectively, we have divided the guide into two parts. Part one will take you through ‘matplotlib’ while part two will take you through ‘seaborn’ since the ‘seaborn’ package is fundamentally built on top of ‘matplotlib’.

PART ONE: Matplotlib

The first thing we need to do is install and import the package. In order to install the package into your machine’s terminal we use the code shown below:

In order to import the package into your python IDE we use the code shown below:

In this guide we are going to use the HR analytics dataset found on Kaggle. Feel free to use the same dataset for practice.

HR Analytics Dataset: https://www.kaggle.com/ludobenistant/hr-analytics

In order to plot a simple histogram of the “number_project” column we use the code shown below:

For which the output that we get is shown below: 

You can plot histograms by a categorical variable as shown below:

The output of which is shown below:

Next let’s look at how we can plot a simple box plot. A box plot gives you a lot of statistically useful information such as the the 1st, 2nd and 3rd quartiles as well as outliers.

The output of the code shown above is shown below:

You can also plot box plots by categorizing by a particular column as shown below:

The output of the code produces two box plots side by side since the column ‘left’ had two levels – ‘0’ and ‘1’.

The next kind of plot is the scatter plot. Scatterplots show you the relationship between two variables X and Y.

Notice how we have an additional argument called “xlim”. The “xlim”T and “ylim” arguments can be used to set custom limits for the x and y axis of the plot so that we can zoom in and zoom out as and when required. The output of the code is shown below:

We can also add custom styles to our plots using matplotlib. For example if we want to use the ‘ggplot’ style for our plots we can use the code shown below:

The output that we get is shown below:

You can add legends to your plots using the code shown below

The ‘loc’ argument stands for the location on the plot at which you want your legend to be displayed. The location can be ‘upper center’, ‘upper right’, ‘upper left’, ‘lower left’, ‘lower center’, ‘lower right’, ‘center left’, ‘center’ and ‘center right’. You can also use ‘best’ to obtain the best location for your legend.

We can add labels along the x and y axis as well as titles to our plots using the code below:

 

PART TWO: Seaborn

Seaborn is fundamentally a layer over matplotlib and offers you a wide array of options when it comes to statistical plots compared to matplotlib. Let’s explore what we can do with seaborn:

In order to use seaborn we need to install and import it as shown below

We can plot a scatterplot coupled with a linear regression line of best fit by using the code shown below:

The ‘hue’ argument was used to categorize it by a categorical variable while the ‘palette’ argument was used to give colors to the plot using the ‘Set1’ palette.

The output of the plot is displayed below:

Stripplots are useful to visualize relationships between continuous and categorical variables. We can plot a strip plot using the code shown below:

The output of the plot is shown below:

Pairplots are a great way to visualize the relationship between all the numerical variables in your dataset. In order to plot a pair plot we need to use the code as shown below:

The code above has the ‘hue’ argument set to the ‘left’ to categorize our pair plot by the people who left and the people who stayed in the company.

Joint plots are useful because it helps you visualize the relationship between two variables X and Y and it gives you the individual histogram distributions of the two variables. The code to generate a joint plot is shown below:

The output that is generated looks like this:

The most useful aspect about joint plots are the fact that you can use an argument called ‘kind’ to specify the kind of plot you want. For example if we use kind = ‘kde’ we get a density plot as shown below:

The ‘kind’ argument can be set to:

  • ‘scatter’ – For simple scatter plots
  • ‘hex’ – For a hex bin of the plot distribution
  • ‘reg’ – For a linear regression plot
  • ‘resid’ – For a residual plot

In conclusion the world of data visualization is a vast one, filled with many tools that you can explore. The plots mentioned in the guide feature some of the most popular ones that are used in the field of data science. Read up on the documentation on matplotlib and seaborn to find out all the plots that you can use to make beautiful and meaningful visualizations.

Link for Matplotlib Documentation:  https://matplotlib.org/contents.html

Link for seaborn Documentation:  https://seaborn.pydata.org

Happy Visualization!

 

 

 

LEAVE A REPLY