The ultimate guide to cleaning data in Python!

4
The ultimate guide to cleaning data in Python!

A data scientist roughly spends about 70% of his time cleaning messy and unstructured data. If you are going to be spending this much time with one task you might as well know how to clean data well. This guide gives you a comprehensive step by step procedure when it comes to cleaning your data in Python.

Step 1: Initial Exploratory analysis

Inspect the head and tail of your dataset using the code shown below:

This will show you the first 5 rows and the last 5 rows of your dataset so that you can briefly inspect your dataset.  Note: The ‘df’ stands for a dataframe. We saved our dataset into a dataframe called ‘df’ using pandas prior to this – Please read our http://lineardata.net/how-do-you-import-data-into-python/ guide to learn how you can import data into python.

Next you will want check if all the column names are consistent or not. Column names might have capital letters that we might want to convert to a small letters, white spaces we should eliminate, meaningless names that have no direct correlation with the data under the column name and many other common errors. To view all the column names we can use:

This returns a list of all the column names that you could inspect. Now assume I have a column named “Work_tym” – We can obviously see that time is spelled wrong as “tym”. In order to change this we have to use the code shown below:

We need to use the df.drop() function to drop the incorrect column name otherwise it would remain in your dataframe as an additional column. We specify “1” because 1 indicates we want to drop the column itself while 0 indicates that we want to drop the rows.

Next, we want to drop the rows that have duplicate data. We can do this using the code shown below:

After this we want to check if all the columns have the right data type associated with it. For example, sometimes a column that has numeric data might be stored as an ‘object’ type because it has a missing value stored as ‘missing’. To address this issue we need to first:

This renders an output as shown below: 

With this, we can see what types each of the columns are. We can now change the type of any particular column using the code below:

The code above converts the ‘float64’ type of the satisfaction_level column to ‘category’.

Other useful code for initial exploration are the:

which returns the summary statistics like the mean, median, and mode of all the columns and the

which returns the number of columns and rows in the datasets.

Another useful metric you could add to your initial exploration would be the frequency counts of each element in a particular column. We can implement this using the code shown below:

We get an output that looks like this:

STEP 2: Visualize your data

I’m not asking you to visualize your data in great detail but I want you to understand if there are errors in terms of outliers. There are 3 fundamental visualizations that you must carry out

1.The Histogram:

The histogram gives you a basic frequency count and allows you to determine weather a particular column is normally distributed or not. The code to plot a histogram is shown below:

The resulting output is shown below: 

2. The boxplot

A box plot is the best way to figure out if you have outliers in your dataset. It gives you the values of the 1st, 2nd and 3rd quartiles as well the median. In order to plot a box plot we need to follow the code below:

The output of the box plot is shown below:

3. The scatterplot

The scatterplot helps us visualize the relationship between two or more variables. The code needed to execute a scatterplot is shown below:

The output of the plot is shown below:

STEP 3: Deal with your missing values

Dealing with missing values is crucial because of it does not make sense to feed these missing values into a machine learning algorithm or a neural network. There are a couple of ways you could deal with missing values. The first method is to drop them completely from the entire dataset or from a particular column. The second method is to fill them with some summary statistic or a number or a string based on the problem at hand.

In order to drop all the missing value from a dataframe we use the code shown below:

If you want to drop the missing value from a particular column only we can use the code shown below:

Note that the column we are dropping the missing values from is called ‘salary’ in the code above.

The next way that you could deal with missing values is to fill them. In the code below I have filled the missing values in the ‘salary’ column with the value of 0

You could also fill in missing values with strings as shown below:

The final way you could fill missing value is to use a summary statistic like the mean. You can implement this using the code shown below:

STEP 4: Merge datasets

In most of your data analysis we are going to have multiple datasets. The idea here is to follow steps 1 to 3 for each dataset that you have understudy and once it’s all cleaned you can merge them together.

Assume we have two data frames “DF1” and “DF2” and we want to merge them we simply use the code shown below

In conclusion, following steps 1 to 4 should yield you a relatively clean dataset depending on the complexity of the dataset in hand. There are techniques like ‘melting’ which will give you a much greater control over the data cleaning process but this will be covered in another guide in which manipulating data frames using pandas is the key focus.

Happy Cleaning!

 

LEAVE A REPLY