The ultimate guide to sentiment analysis using Python!

The ultimate guide to sentiment analysis using Python!

Author:  Sarthak Anand

Sentiment analysis is a fundamental way of assessing what your customers feel about your product or how people are generally reacting to a situation. In this guide we explore how SafeCity, a non-profit that’s committed to making cities safer for women using a collection of resources ranging from safety products to mobile applications is using sentiment analysis to achieve this goal.

Safecity is a platform that crowdsources personal stories of sexual harassment & abuse in public spaces. This data which maybe anonymous, gets aggregated as hot spots on a map indicating trends at a local level.

Here’s a link to their mobile application that’s presently available to download for free on the Google Playstore and the iOS store.

Below are the steps that we are required to take in order to perform a through sentiment analysis on Twitter:

Step 1: Data Scraping

The most important part of sentiment analysis is collecting the data and pre-processing the text data. We can use twitter streaming A.P.I which allows to stream tweets in real- time . But there is great library in Python called Tweepy which is basically a wrapper build around the twitter A.P.I which makes our task easier .

The code to import the libraries that are required are displayed below:

If you don’t have any of these libraries , try installing them using pip . Note: Preprocess_text is not a library but another python file for pre-processing text .

Next we want to write the code to authenticate the twitter API:

You can get your access key after signing up for twitter API at . After signing up you will get the required tokens (Do not share these tokens )

Next we want to initialize a couple of empty lists for the sentiments or parameters that we want to collect from the tweets:

All the tweets and other information related to tweet such as username , location , language is stored in a list which will be latter used to save all data in csv file .

Next we want to create a stream listener. Stream listener streams the tweet using twitter’s API unless there is a connection error or you reach limit .Streaming is terminated when cmd+ esc is pressed .

The code for the stream listener is shown below:

We then write code to store the streamed data into a CSV file:

This will generate a Csv file and a text file. The text file can be used to generate word cloud and Csv file can be used for further analysis

Step 2: Data Cleaning

Data cleaning is an important part in the textual data analysis . In analysing the tweets we don’t require emoji’s or other symbols or hyperlinks . Removing this type of data makes our task much simpler .

While importing the dependencies , from preprocess, function preprocess_text was imported. Which was nothing but custom function for removing the above characters using simple regex techniques .

Step 3: Data Visualization 

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. It creates amazing plots (usually imported as sns ) .

The code below is used to create a barplot between, number of tweets versus the user locations . This is one use case of seaborn , It allows to visualise data in various forms checkout ( .

The output of the plot is shown below:

Step 4: Sentiment Analysis 

Python has a toolkit for natural language processing called NLTK . If you want do to classify only as Positive,Negative or Neutral It has pre-trained function that returns polarity of your text between Pos/Neg .
For example :

It is a pre-trained model which performs with decent accuracy and is very easy to use . But if you want train a custom model for custom then you have to build your own model and train it as well .

Training Your Custom Model Using LSTMs.

Sentences are sequence of words as last few words determines the next words.To encode the information in a vector Recurrent Neural Networks(RNNs) can be used , but as they hard to train , we can LSTMs which performs much better than RNNs .

Char Based Approach :
In this approach we using one – hot encoding of chars . For example : Let my vocabulary contains only { ‘a’ ,’b’ , ‘c’ } (set of unique chars ) Then c =[ 0, 0, 1 ] or b= [0 ,1 ,0 ]

Similarly we represent our sentences assuming all sentences have fixed characters like 140 in twitter . Example : vocabulary { ‘a’,’b’ ,’c’ ,’d’ } and your sent is aab then sent will be : sent = [ [1,0,0,0] ,[1,0 ,0,0 ] ,[0,1,0,0] ] where first is one-hot sec of a and last be the one- hot vector of b .

As any tweet has maximum of 140 characters and let say unique chars are n_unique All these tweets can be represent as vector of shape : (140 , n_unique)

Once the input is ready , it can be trained with some loss function like categorical cross- entropy and any optimiser ( Adam is preferred though ) .

Coding the model in Keras :

I hope this guide has given you the ultimate introduction to performing a basic sentiment analysis for your business.

For more details on the work that SafeCity does to keep cities safe using data science, you can reach out to: Vihang Jumle, lead of data analytics at SafeCity.

Happy analysis!