The ultimate guide to cleaning data in R

0
The ultimate guide to cleaning data in R

Underestimating the power of R compared to that of Python when it comes to cleaning data would be a fatal mistake. R has a set of comprehensive tools that are specifically designed to clean data in an effective and comprehensive manner.

STEP 1: Initial Exploratory Analysis

The first step to the overall data cleaning process involves an initial exploration of the data frame that you have just imported into R. It is very important to understand how you can import data into R and save it as a data frame. You can find the guide to import data into R, using the link here: http://lineardata.net/how-do-you-import-data-into-rstudio/

The first thing that you should do is check the class of your data frame:

This renders an output as shown below in which we can clearly see that our dataset is saved as a data frame.
Next, we want to check the number of rows and columns the data frame has:

The code above give us:

Here we can see that the data frame has 1000 rows and 7 columns.

After this we want to import the “dplyr” package and glimpse the data frame as shown below:

The output of the code above renders us the output shown below:

Glimpsing the data frame gives us a brief idea about the types of each column as well as the data contained within these columns.

Next, we would want to take a look at the first 5 observations and the last five observations of the data frame in the form of a table. We can do this using the code shown below:

We can view the summary statistics for all the columns of your data frame using the code shown below:

This renders an output as shown below:

 

STEP 2: Visual Exploratory Analysis

There are 3 types of plots that you should use during your cleaning process – The Histogram, The BoxPlot and the Scatter Plot.

1.The Histogram

The histogram is very useful in visualizing the overall distribution of a numeric column. We can determine if the distribution of data is normal or bi-modal or unimodal or any other kind of distribution of interest. We can also use Histograms to figure out if there are outliers in the particular numerical column under study. In order to plot a histogram for any particular column we need to use the code shown below:

2. The BoxPlot

Boxplot’s are super useful because it shows you the median, along with the first, second and third quartiles. BoxPlot’s are THE BEST way of spotting outliers in your data frame. In order to visualize a box plot we need to use the code shown below:

3. The Scatterplot

Scatter plots helps us visualize bi-variate and multi-variate relationships. Why is this useful? Simple – We can determine how two or more variables are correlated. We can also use these plots to visualize for outliers. In order to plot a scatter plot we need to use the code shown below:

Notice how we could locate an outlier at the extreme top right corner of the plot?

STEP 3: Correcting the errors!

This steps focuses on the methods that you can use to correct all the errors that you have seen thanks to the data exploration that you carried out in steps 1 and 2.

If we want to change the name of our data frame we can do so using the code shown below:

If we see an incorrect column name we can change it using the code shown below:

In the code above we renamed the user rating score column as simply “rating”.

Sometimes columns have an incorrect type associated with them. For example, a column containing text elements stored as a numeric column. In such a case we can change the type of column by using the code shown below:

There are a wide array of type conversions you can carry out in R. They are listed below:

String manipulation in R comes in handy when you are working with datasets that have a lot of text based elements.

In order to to change all the text to uppercase or lowercase in a particular column we need to execute the code shown below:

If we want to trim the whitespaces in the next under a column we need to use the code shown below:

If we want to replace a particular word or letter under a column we can do so using the code below:

The next section will show you how to deal with your missing values:

Suppose we want to unite two columns in our data frame we can do so using the code shown below:

The unite() function takes 4 arguments – The data frame, the new column name, the first column and the second column name that you want to unite.

Conversely we can also separate a column as shown below:

The separate() function takes 4 arguments – The data frame, the column that we want to separate, the names of the new columns and the indicator at which we want the column to be separated at.

Following steps 1 to 3 above should give you a relatively clean dataset. Remember to go through the documentation of packages such as “stringr” and the “tidyr” packages to find out about all the tools that you could use to clean messy data. Always keep exploring new ways that you can clean your data and never stop exploring.

Happy Cleaning!

 

 

LEAVE A REPLY