The Christmas guide to Natural Language Processing!

1
The Christmas guide to Natural Language Processing!

In the last guide for this year we explore how we can extract information and make sense of large volumes of text based data using the NLTK package in python!

Let’s start by importing all the necessary packages required for natural language processing in python:

In order to take you through this guide we are going to be using the popular twitter dataset that was used for performing sentiment analysis in Kaggle. The dataset can be downloaded here – Tweets

When we import the dataset into our Jupyter Notebook we find that it has a column called ‘text’ which contains all our tweets in text. We need to isolate and extract the ‘text’ column using the code shown below:

We need to ensure that the ‘text’ column is in the string format. Let’s convert the column to strings and view the 1500th tweet using the code shown below:

The output of the code is illustrated below: 

We now want to convert our tweets into an NLTK text based object so that we can perform some fun tasks on it in order to extract some useful information. We can do this using the code shown below:

In the code above we have convert each word in a tweet into an object known as a ‘token’. We then convert all these tokens into an NLTK text object using the nltk.text() function.

Let’s now check and see all the tweets that have some concordance with the word – ‘angry’:

The output of the code is illustrated below :

Thus we see that the 4 tweets above have some concordance with the word – ‘angry’.

Let’s now check and and see the tweets that have some common context with the words ‘delay’ and ‘flight’:

The output of the code is illustrated below: 

These are the tweets that have a common context between the words ‘delay’ and ‘flight’.

Next, let’s find the tweets that are similar to the word ‘delay’:

The output of the code is illustrated below: 

These are the tweets that are similar to the word ‘delay’.

Let’s now proceed to find the 50 most commonly used words:

The output of the code is illustrated below: 

The output of the fdist.most_common() function shows us the most frequently occurring words along with their frequencies.

Let’s now plot the frequencies of the 50 most common words:

The output of the code is illustrated below: 

We can take this a step further and plot a cumulative frequency distribution plot using the code shown below:

The output of the code is illustrated below: 

Hapaxes are the words that occur only once. We can view these words using the code shown below:

The output of the code is illustrated below: 

Next we will proceed to remove all the stop words. Stop words are the words that occur very frequently in the english language. These words do not provide a lot of insight when it comes to making sense of the text at hand. Words such as ‘the’, ‘and’, ‘but’ are examples of stop words. We can remove stop words using the code shown below:

Here we first download the ‘stopwords’ package and then store the corpus of stop words from the english language into a variable called ‘stopwords’.  We then use a simple list comprehension to extract the tweet tokens that are not a part of the stop words corpus if they are lower cased.

Next we want to determine the lexical diversity of our word tokens. Lexical diversity is simple how unique or diverse the set of words are. It is defined as the ratio of the unique tokens to the set of the total number of tokens. We can find this number by using the code shown below:

The output of the code is shown below: 

From the output we can see that the lexical diversity in this case was 13.3. A higher lexical diversity score usually indicates that a more diverse set of words are contained within the your text.

This concludes the fundamentals of what you can achieve using the NLTK package in python. To get a more in-depth review of what the package can do, have a look at the documentation which can be found here – NLTK documentation.

Happy Natural Language processing and Merry Christmas from everyone here at LinearData!

 

 

  1. I simply want to mention I am beginner to weblog and absolutely loved this web-site. Almost certainly I’m going to bookmark your website . You absolutely come with perfect article content. With thanks for revealing your web-site.

LEAVE A REPLY