Sentiment Analysis with Python: Bag of Words

By | October 12, 2020

Sentiment Analysis Overview

Sentiment Analysis(also known as opinion mining or emotion AI) is a common task in NLP (Natural Language Processing). It involves identifying or quantifying sentiments of a given sentence, paragraph, or document that is filled with textual data.

Sentiment Analysis techniques are widely applied to customer feedback data (ie., reviews, survey responses, social media posts). Sentiment Analysis has proved to be really important in deriving various powerful business insights like:-

  • Better product Recommendations
  • Behavioral Analysis of the market
  • Understanding public opinions through social media
  • Effectiveness of customer-support staff

In this article, we will train a traditional machine learning sentiment classification model from scratch. We will utilize the bag-of-words feature creation technique for this task. With bag-of-words features, we will experiment with the following three machine learning algorithms and compare the results-

  1. Logistic Regression
  2. Linear SVM
  3. MultiNomial Naive Bayes

The rest of the article is divided into the following sub-sections:-

  1. Dataset Overview
  2. Data Preprocessing
  3. Bag of Words features
  4. Logistic Regression
  5. Linear Support Vector Machine (LSVM)
  6. Naive Bayes
  7. Summary
Sentiment Analysis with Python: Bag of Words
Sentiment Analysis with Python: Bag of Words | Image by Tengyart | Image Source

1. Dataset Overview

IMDB movie review dataset for sentiment analysis

IMDB Movie Review dataset is having 50K movie reviews for natural language processing or text analytics. All these movie reviews are labeled with the true sentiment value(positive or negative). Dataset is well balanced having 25K examples of each sentiment class(positive and negative).

Let’s quickly peek into the dataset.

import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
import nltk
import sklearn
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
%matplotlib inline

You can download this dataset from Kaggle (url is provided in references below). You can read this data into your python notebook with the following snippet-

data = pd.read_csv("data/IMDB Dataset.csv")
print (data.shape)
data.head(10)
Sentiment Analysis with Python: Bag of Words

Each of the 50K reviews is tagged(or labeled) with its true sentiment value. Let’s look at the distribution of sentiments in this dataset:-

data.sentiment.value_counts()
Out[4]: positive    25000
        negative    25000
        Name: sentiment, dtype: int64

Dataset seems perfectly balanced as each sentiment value is associated with an equal number of examples(reviews in this case).


2. Data Preprocessing

Unstructured datasets are often noisy in nature. So, the very first step would be to preprocess the dataset and make it ready(consumable) for machine learning algorithms. We will apply the following data preprocessing techniques to our dataset (these are common data preprocessing techniques in NLP)-

  1. Data Cleaning
  2. Stop Words Removal
  3. Stemming

Data Cleaning

Data preprocessing steps depend upon the nature of the problem you are solving. What kind of data cleaning you need to do, totally depends upon the problem statement.

For sentiment analysis- as only language words matter, so it makes sense to remove special characters, symbols, and numbers from the text as they don’t contribute towards the sentiment of paragraph or sentence.

Let’s remove the HTML tags and special characters from the data as they do not add value to the sentiment of a review. Additionally, let’s convert all the reviews to lowercase so that ‘Happy’ and ‘happy’ would be similar for the algorithm.

def remove_html(text):
    bs = BeautifulSoup(text, "html.parser")
    return ' ' + bs.get_text() + ' '

def keep_only_letters(text):
    text=re.sub(r'[^a-zA-Z\s]','',text)
    return text

def convert_to_lowercase(text):
    return text.lower()

def clean_reviews(text):
    text = remove_html(text)
    text = keep_only_letters(text)
    text = convert_to_lowercase(text)
    return text

Stop Words Removal

Stop words are words(often very common words in a particular language) that do not add value to the meaning of a sentence or paragraph. Any given word should be considered as a stop-word or not, again depends upon the problem you are solving.

For sentiment analysis, common language words like- ‘You’, ‘This’, ‘That’, ‘The’ do not help in determining the sentiment of a given sentence. The frequency of these words is generally high in English sentences, so it makes sense to remove them beforehand to reduce the complexity of our model.

Natural Language Toolkit (nltk) comes with pre-defined common stop words for english language. You can also define your custom set of stop words.

We will remove the following 179 stop words from our dataset-

english_stop_words = nltk.corpus.stopwords.words('english')
print(len(english_stop_words))
print (english_stop_words[:20])
179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

Let’s remove these stop words from our clean reviews.

def remove_stop_words(text):
    for stopword in english_stop_words:
        stopword = ' ' + stopword + ' '
        text = text.replace(stopword, ' ')
    return text

data['review'] = data['review'].apply(remove_stop_words)

Stemming

In the English language, words can take multiple different forms depending upon where and how we use them. Stemming is a process of bringing all different forms of a word to its root form so that machine looks at them as similar words.

For example- {‘keep’, ‘keeping’, ‘keeper’, ‘keeps’} will be reduced to a single word-‘keep’.

NLTK comes with a pre-defined stemming utility. Let’s use that on our dataset:-

def text_stemming(text):
    stemmer = nltk.porter.PorterStemmer()
    stemmed = ' '.join([stemmer.stem(token) for token in text.split()])
    return stemmed

data['review'] = data['review'].apply(text_stemming)

With this- our basic preprocessing of the data is complete and we are ready to pass this processed data to machine learning algorithms. Here is how the data looks like after performing-cleaning, stop-word removal, and stemming-

Sentiment Analysis with Python: Bag of Words

3. Bag of Words features (BOW)

Our preprocessed dataset is now ready. One last step is to convert it to numerical form(as machines only understand mathematical operations). In this article, we will apply the bag-of-words technique to convert the dataset into numerical form.

Bag of Words is a natural language processing(NLP) technique that is used to represent a text document into numerical form by considering the occurrence of words in the given document. It considers only two things-1. A vocabulary of words, 2. presence(or frequency) of a word in a given document ignoring the order of the words(or grammar).

Before applying bag-of-words, let’s divide our dataset into training and test first. The first 40K reviews are considered for training while rest 10K reviews are kept as a test dataset.

imdb_train = data[:40000]
imdb_test = data[40000:]

We will use CountVectorizer from the sklearn package to get the bag-of-words representation of our training and testing dataset.

Note: We will only consider training dataset to define the vocabulary and use the same vocabulary to represent the test dataset (as test data is supposed to be hidden).

Thus we will fit our vectorizer on the training data and use it to transform the test data-

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
(40000, 150374) (10000, 150374)

(40000, 150374) means that there are 150374 unique English words in our vocabulary(derived from the training dataset) and each word is represented with a unique column in the dataset.

For each review in our dataset, the Frequency of words(term-frequency) is represented through a vocabulary vector of size-150374. That’s why we have 40K such vectors in our training-set and similarly 10K vectors of the similar shape in our test dataset.

Note: binary=False argument means that we fill the vocabulary vector with term-frequency. If binary=True, the vocabulary vector is filled by the presence of words (1 if the word is present and 0 otherwise).

Let’s convert our output labels also into the numerical form. Positive sentiment value is represented by 1, while negative sentiment is represented with 0.

train_labels = [1 if sentiment=='positive' else 0 for sentiment in imdb_train['sentiment']]
test_labels = [1 if sentiment=='positive' else 0 for sentiment in imdb_test['sentiment']]
print (len(train_labels), len(test_labels))
40000 10000

4. Logistic Regression for sentiment analysis

Now that we have converted our dataset into numerical format, we are ready to train classification models. We will start with Logistic Regression classifier and apply it on three different kinds of feature sets:

  1. UniGram bag-of-words features
  2. (UniGram + BiGram) bag-of-words features
  3. (UniGram + BiGram + TriGram) bag-of-words features

Unigrams: All unique words in a document

BiGrams: All permutations of two consecutive words in a document

TriGrams: All permutations of three consecutive words in a document

UniGram bag-of-words features

When the Bag of Words algorithm considers only single unique words in the vocabulary, the feature set is said to be UniGram. Let’s define train Logistic Regression classifier on unigram features:-

clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)
print (clf)

Default state of the classifier-

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Here is how, we can get predictions on our test set and calculate the accuracy and confusion matrix.

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

The following result shows that our model is able to predict with 88% accruacy-

              precision    recall  f1-score   support

    Negative       0.88      0.88      0.88      4993
    Positive       0.88      0.88      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

[[4398  595]
 [ 581 4426]]

Unigrams + Bigrams

Let’s repeat the same exercise with UniGram +BiGram features. This time our Bag-of-Words algorithm also considers consecutive pairs of words in the dictionary along with unique words. We can calculate these features by simply changing the ngram_range parameter to (1,2).

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Our feature set size increases as we are considering Bi-Grams also. This time our model performs a little better as we have passed more information. Accuracy on test-set is now 90%.

(40000, 2494028) (10000, 2494028)
 
             precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.89      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4457  536]
 [ 496 4511]]

Unigrams + Bigrams + Trigrams

We repeat the same exercise after adding Tri-Gram features also into our feature set. This time we also consider three consecutive word permutations also into our vocabulary.

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

This time we don’t see any significant increase in the accuracy-

(40000, 6802553) (10000, 6802553)

              precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.89      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4452  541]
 [ 500 4507]]

5. Linear Support Vector Machine (LSVM) for sentiment analysis

We are going the repeat the same exercise with Linear support vector machine(LSVM) classification result in order to check that which algorithms gives to best results.

UniGrams

Here is the first Iteration with Unigram feature-set.

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Our model achieves 86% accuracy on test dataset which is slightly lower to what Logistic regression achieved-

(40000, 150374) (10000, 150374)
              precision    recall  f1-score   support

    Negative       0.86      0.86      0.86      4993
    Positive       0.86      0.86      0.86      5007

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

[[4308  685]
 [ 685 4322]]

UniGrams + BiGrams

Lets check if adding Bi-Gram features given any significant improvements over the previous version-

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

And Yes. This time model achieves an accuracy of 90% on test set. Now our LSVM is very close to the Logistic Regression results.

(40000, 2494028) (10000, 2494028)
              precision    recall  f1-score   support

    Negative       0.90      0.90      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4469  524]
 [ 509 4498]]

UniGrams + BiGrams + TriGrams

Finally, let’s feed in Tri-Grams also and check the impact on our LSVM classifier-

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Again, similar to Logistic Regression results, model does not improve after addition of TriGram features. Accuracy stands at 90% only.

(40000, 6802553) (10000, 6802553)
              precision    recall  f1-score   support

    Negative       0.90      0.89      0.89      4993
    Positive       0.89      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4459  534]
 [ 513 4494]]

6. Naive Bayes for sentiment analysis

We have seen so far that Logistic Regression and LSVM are giving an almost similar performance on our test set and achieve an accuracy of 90% with UniGram + BiGram feature sets. We will do a similar iterations for the Multinomial Naive Bayes algorithm also-

UniGram

Again, let’s start with unigram feature-set only and train Multinomial Naive Bayes classifier-

from sklearn.naive_bayes import MultinomialNB

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Results are not much encouraging this time. We get the lowest accuracy with MNB classifier on our test data. Let’s try adding more information to check if it improves from existing 85% accuracy.

(40000, 150374) (10000, 150374)
              precision    recall  f1-score   support

    Negative       0.84      0.87      0.86      4993
    Positive       0.87      0.83      0.85      5007

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

[[4359  634]
 [ 837 4170]]

Unigram+BiGrams

Let’s train it again with added BiGram features-

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Accuracy has improved by 3% from the previous iteration but it still 2% below the results the other two approaches have given. Though it’s not too much lesser than other two-

(40000, 2494028) (10000, 2494028)
              precision    recall  f1-score   support

    Negative       0.87      0.89      0.88      4993
    Positive       0.89      0.87      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

[[4452  541]
 [ 650 4357]]

UniGrams + BiGrams + TriGrams

Finally the last iteration with added Tri-Gram features. Let’s see if we get anything better this time-

vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)

clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)

predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

And NO, this model is not any better from the last iteration. We don’t see any accuracy improvements after adding TriGrams into the feature set. Result stands at 88%.

(40000, 6802553) (10000, 6802553)
              precision    recall  f1-score   support

    Negative       0.88      0.89      0.89      4993
    Positive       0.89      0.88      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

[[4456  537]
 [ 614 4393]]

Summary– (Sentiment Analysis)

This article gives an overview of basic natural language processing (NLP) techniques using the IMDB movie reviews dataset as an example for the task of Sentiment Analysis.

We started by applying common data preprocessing techniques and experimented with three machine learning classification algorithms on bag-of-words features.

The following table shows a comparison of the results achieved on our test dataset (last 10K reviews).

ClassifierUniGram features(UniGram+Bi-Gram) features(UniGram+Bi-Gram+Tri-Gram) features
Logistic Regression88%90%90%
Linear SVM86%90%90%
MultiNomial Naive Bayes85%88%88%
Results: Sentiment Analysis with Python: Bag of Words

Github repo: https://github.com/kartikgill/SentimentAnalysis

We can see that Logistic Regression and LSVM perform equally well and achieve an accuracy of 90% while the MNB classifier gives a slightly lower accuracy of 88%. Logistic Regression or LSVM model with Unigram+BiGram bag-of-words features can be considered as the best model from this case study.

This article only covers the Bag-of-Words features, In the next article, we will experiment with TF-IDF features (a richer form of textual features).

Thanks for reading! Hope this article was helpful. Kindly provide your valuable feedback by commenting below. See you in the next article.


Read Next >>>

  1. Sentiment Analysis with Python: TFIDF features
  2. Sentiment Classification with Deep Learning: RNN, LSTM, and CNN
  3. Boosting your Sequence Generation Performance with ‘Beam Search + Language model’ decoding
  4. Optimizing TensorFlow models with Quantization Techniques
  5. Deep Learning with PyTorch: Introduction
  6. Deep Learning with PyTorch: First Neural Network
  7. 1D-CNN based Fully Convolutional Model for Handwriting Recognition

References (sentiment analysis)

  1. Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
  2. Downloaded from: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

2 thoughts on “Sentiment Analysis with Python: Bag of Words

  1. Pingback: Sentiment Analysis with Python: TFIDF features - Drops of AI

  2. Pingback: Sentiment Classification with Deep Learning: RNN, LSTM, and CNN - Drops of AI

Comments are closed.