Sentiment Analysis Overview
Sentiment Analysis(also known as opinion mining or emotion AI) is a common task in NLP (Natural Language Processing). It involves identifying or quantifying sentiments of a given sentence, paragraph, or document that is filled with textual data.
Sentiment Analysis techniques are widely applied to customer feedback data (ie., reviews, survey responses, social media posts). Sentiment Analysis has proved to be really important in deriving various powerful business insights like:-
- Better product Recommendations
- Behavioral Analysis of the market
- Understanding public opinions through social media
- Effectiveness of customer-support staff
In this article, we will train a traditional machine learning sentiment classification model from scratch. We will utilize the bag-of-words feature creation technique for this task. With bag-of-words features, we will experiment with the following three machine learning algorithms and compare the results-
- Logistic Regression
- Linear SVM
- MultiNomial Naive Bayes
The rest of the article is divided into the following sub-sections:-
- Dataset Overview
- Data Preprocessing
- Bag of Words features
- Logistic Regression
- Linear Support Vector Machine (LSVM)
- Naive Bayes
- Summary
1. Dataset Overview
IMDB movie review dataset for sentiment analysis
IMDB Movie Review dataset is having 50K movie reviews for natural language processing or text analytics. All these movie reviews are labeled with the true sentiment value(positive or negative). Dataset is well balanced having 25K examples of each sentiment class(positive and negative).
Let’s quickly peek into the dataset.
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
import nltk
import sklearn
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
%matplotlib inline
You can download this dataset from Kaggle (url is provided in references below). You can read this data into your python notebook with the following snippet-
data = pd.read_csv("data/IMDB Dataset.csv")
print (data.shape)
data.head(10)
Each of the 50K reviews is tagged(or labeled) with its true sentiment value. Let’s look at the distribution of sentiments in this dataset:-
data.sentiment.value_counts()
Out[4]: positive 25000
negative 25000
Name: sentiment, dtype: int64
Dataset seems perfectly balanced as each sentiment value is associated with an equal number of examples(reviews in this case).
2. Data Preprocessing
Unstructured datasets are often noisy in nature. So, the very first step would be to preprocess the dataset and make it ready(consumable) for machine learning algorithms. We will apply the following data preprocessing techniques to our dataset (these are common data preprocessing techniques in NLP)-
- Data Cleaning
- Stop Words Removal
- Stemming
Data Cleaning
Data preprocessing steps depend upon the nature of the problem you are solving. What kind of data cleaning you need to do, totally depends upon the problem statement.
For sentiment analysis- as only language words matter, so it makes sense to remove special characters, symbols, and numbers from the text as they don’t contribute towards the sentiment of paragraph or sentence.
Let’s remove the HTML tags and special characters from the data as they do not add value to the sentiment of a review. Additionally, let’s convert all the reviews to lowercase so that ‘Happy’ and ‘happy’ would be similar for the algorithm.
def remove_html(text):
bs = BeautifulSoup(text, "html.parser")
return ' ' + bs.get_text() + ' '
def keep_only_letters(text):
text=re.sub(r'[^a-zA-Z\s]','',text)
return text
def convert_to_lowercase(text):
return text.lower()
def clean_reviews(text):
text = remove_html(text)
text = keep_only_letters(text)
text = convert_to_lowercase(text)
return text
Stop Words Removal
Stop words are words(often very common words in a particular language) that do not add value to the meaning of a sentence or paragraph. Any given word should be considered as a stop-word or not, again depends upon the problem you are solving.
For sentiment analysis, common language words like- ‘You’, ‘This’, ‘That’, ‘The’ do not help in determining the sentiment of a given sentence. The frequency of these words is generally high in English sentences, so it makes sense to remove them beforehand to reduce the complexity of our model.
Natural Language Toolkit (nltk) comes with pre-defined common stop words for english language. You can also define your custom set of stop words.
We will remove the following 179 stop words from our dataset-
english_stop_words = nltk.corpus.stopwords.words('english')
print(len(english_stop_words))
print (english_stop_words[:20])
179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']
Let’s remove these stop words from our clean reviews.
def remove_stop_words(text):
for stopword in english_stop_words:
stopword = ' ' + stopword + ' '
text = text.replace(stopword, ' ')
return text
data['review'] = data['review'].apply(remove_stop_words)
Stemming
In the English language, words can take multiple different forms depending upon where and how we use them. Stemming is a process of bringing all different forms of a word to its root form so that machine looks at them as similar words.
For example- {‘keep’, ‘keeping’, ‘keeper’, ‘keeps’} will be reduced to a single word-‘keep’.
NLTK comes with a pre-defined stemming utility. Let’s use that on our dataset:-
def text_stemming(text):
stemmer = nltk.porter.PorterStemmer()
stemmed = ' '.join([stemmer.stem(token) for token in text.split()])
return stemmed
data['review'] = data['review'].apply(text_stemming)
With this- our basic preprocessing of the data is complete and we are ready to pass this processed data to machine learning algorithms. Here is how the data looks like after performing-cleaning, stop-word removal, and stemming-
3. Bag of Words features (BOW)
Our preprocessed dataset is now ready. One last step is to convert it to numerical form(as machines only understand mathematical operations). In this article, we will apply the bag-of-words technique to convert the dataset into numerical form.
Bag of Words is a natural language processing(NLP) technique that is used to represent a text document into numerical form by considering the occurrence of words in the given document. It considers only two things-1. A vocabulary of words, 2. presence(or frequency) of a word in a given document ignoring the order of the words(or grammar).
Before applying bag-of-words, let’s divide our dataset into training and test first. The first 40K reviews are considered for training while rest 10K reviews are kept as a test dataset.
imdb_train = data[:40000]
imdb_test = data[40000:]
We will use CountVectorizer from the sklearn package to get the bag-of-words representation of our training and testing dataset.
Note: We will only consider training dataset to define the vocabulary and use the same vocabulary to represent the test dataset (as test data is supposed to be hidden).
Thus we will fit our vectorizer on the training data and use it to transform the test data-
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
(40000, 150374) (10000, 150374)
(40000, 150374) means that there are 150374 unique English words in our vocabulary(derived from the training dataset) and each word is represented with a unique column in the dataset.
For each review in our dataset, the Frequency of words(term-frequency) is represented through a vocabulary vector of size-150374. That’s why we have 40K such vectors in our training-set and similarly 10K vectors of the similar shape in our test dataset.
Note: binary=False argument means that we fill the vocabulary vector with term-frequency. If binary=True, the vocabulary vector is filled by the presence of words (1 if the word is present and 0 otherwise).
Let’s convert our output labels also into the numerical form. Positive sentiment value is represented by 1, while negative sentiment is represented with 0.
train_labels = [1 if sentiment=='positive' else 0 for sentiment in imdb_train['sentiment']]
test_labels = [1 if sentiment=='positive' else 0 for sentiment in imdb_test['sentiment']]
print (len(train_labels), len(test_labels))
40000 10000
4. Logistic Regression for sentiment analysis
Now that we have converted our dataset into numerical format, we are ready to train classification models. We will start with Logistic Regression classifier and apply it on three different kinds of feature sets:
- UniGram bag-of-words features
- (UniGram + BiGram) bag-of-words features
- (UniGram + BiGram + TriGram) bag-of-words features
Unigrams: All unique words in a document
BiGrams: All permutations of two consecutive words in a document
TriGrams: All permutations of three consecutive words in a document
UniGram bag-of-words features
When the Bag of Words algorithm considers only single unique words in the vocabulary, the feature set is said to be UniGram. Let’s define train Logistic Regression classifier on unigram features:-
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)
print (clf)
Default state of the classifier-
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
Here is how, we can get predictions on our test set and calculate the accuracy and confusion matrix.
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
The following result shows that our model is able to predict with 88% accruacy-
precision recall f1-score support
Negative 0.88 0.88 0.88 4993
Positive 0.88 0.88 0.88 5007
accuracy 0.88 10000
macro avg 0.88 0.88 0.88 10000
weighted avg 0.88 0.88 0.88 10000
[[4398 595]
[ 581 4426]]
Unigrams + Bigrams
Let’s repeat the same exercise with UniGram +BiGram features. This time our Bag-of-Words algorithm also considers consecutive pairs of words in the dictionary along with unique words. We can calculate these features by simply changing the ngram_range parameter to (1,2).
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Our feature set size increases as we are considering Bi-Grams also. This time our model performs a little better as we have passed more information. Accuracy on test-set is now 90%.
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.90 0.89 0.90 4993
Positive 0.89 0.90 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4457 536]
[ 496 4511]]
Unigrams + Bigrams + Trigrams
We repeat the same exercise after adding Tri-Gram features also into our feature set. This time we also consider three consecutive word permutations also into our vocabulary.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
This time we don’t see any significant increase in the accuracy-
(40000, 6802553) (10000, 6802553)
precision recall f1-score support
Negative 0.90 0.89 0.90 4993
Positive 0.89 0.90 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4452 541]
[ 500 4507]]
5. Linear Support Vector Machine (LSVM) for sentiment analysis
We are going the repeat the same exercise with Linear support vector machine(LSVM) classification result in order to check that which algorithms gives to best results.
UniGrams
Here is the first Iteration with Unigram feature-set.
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Our model achieves 86% accuracy on test dataset which is slightly lower to what Logistic regression achieved-
(40000, 150374) (10000, 150374)
precision recall f1-score support
Negative 0.86 0.86 0.86 4993
Positive 0.86 0.86 0.86 5007
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000
[[4308 685]
[ 685 4322]]
UniGrams + BiGrams
Lets check if adding Bi-Gram features given any significant improvements over the previous version-
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
And Yes. This time model achieves an accuracy of 90% on test set. Now our LSVM is very close to the Logistic Regression results.
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.90 0.90 0.90 4993
Positive 0.90 0.90 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4469 524]
[ 509 4498]]
UniGrams + BiGrams + TriGrams
Finally, let’s feed in Tri-Grams also and check the impact on our LSVM classifier-
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = sklearn.svm.LinearSVC()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Again, similar to Logistic Regression results, model does not improve after addition of TriGram features. Accuracy stands at 90% only.
(40000, 6802553) (10000, 6802553)
precision recall f1-score support
Negative 0.90 0.89 0.89 4993
Positive 0.89 0.90 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4459 534]
[ 513 4494]]
6. Naive Bayes for sentiment analysis
We have seen so far that Logistic Regression and LSVM are giving an almost similar performance on our test set and achieve an accuracy of 90% with UniGram + BiGram feature sets. We will do a similar iterations for the Multinomial Naive Bayes algorithm also-
UniGram
Again, let’s start with unigram feature-set only and train Multinomial Naive Bayes classifier-
from sklearn.naive_bayes import MultinomialNB
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,1))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Results are not much encouraging this time. We get the lowest accuracy with MNB classifier on our test data. Let’s try adding more information to check if it improves from existing 85% accuracy.
(40000, 150374) (10000, 150374)
precision recall f1-score support
Negative 0.84 0.87 0.86 4993
Positive 0.87 0.83 0.85 5007
accuracy 0.85 10000
macro avg 0.85 0.85 0.85 10000
weighted avg 0.85 0.85 0.85 10000
[[4359 634]
[ 837 4170]]
Unigram+BiGrams
Let’s train it again with added BiGram features-
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,2))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Accuracy has improved by 3% from the previous iteration but it still 2% below the results the other two approaches have given. Though it’s not too much lesser than other two-
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.87 0.89 0.88 4993
Positive 0.89 0.87 0.88 5007
accuracy 0.88 10000
macro avg 0.88 0.88 0.88 10000
weighted avg 0.88 0.88 0.88 10000
[[4452 541]
[ 650 4357]]
UniGrams + BiGrams + TriGrams
Finally the last iteration with added Tri-Gram features. Let’s see if we get anything better this time-
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary=False,ngram_range=(1,3))
tf_features_train = vectorizer.fit_transform(imdb_train['review'])
tf_features_test = vectorizer.transform(imdb_test['review'])
print (tf_features_train.shape, tf_features_test.shape)
clf = MultinomialNB()
clf.fit(tf_features_train, train_labels)
predictions = clf.predict(tf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
And NO, this model is not any better from the last iteration. We don’t see any accuracy improvements after adding TriGrams into the feature set. Result stands at 88%.
(40000, 6802553) (10000, 6802553)
precision recall f1-score support
Negative 0.88 0.89 0.89 4993
Positive 0.89 0.88 0.88 5007
accuracy 0.88 10000
macro avg 0.88 0.88 0.88 10000
weighted avg 0.88 0.88 0.88 10000
[[4456 537]
[ 614 4393]]
Summary– (Sentiment Analysis)
This article gives an overview of basic natural language processing (NLP) techniques using the IMDB movie reviews dataset as an example for the task of Sentiment Analysis.
We started by applying common data preprocessing techniques and experimented with three machine learning classification algorithms on bag-of-words features.
The following table shows a comparison of the results achieved on our test dataset (last 10K reviews).
Classifier | UniGram features | (UniGram+Bi-Gram) features | (UniGram+Bi-Gram+Tri-Gram) features |
Logistic Regression | 88% | 90% | 90% |
Linear SVM | 86% | 90% | 90% |
MultiNomial Naive Bayes | 85% | 88% | 88% |
Github repo: https://github.com/kartikgill/SentimentAnalysis
We can see that Logistic Regression and LSVM perform equally well and achieve an accuracy of 90% while the MNB classifier gives a slightly lower accuracy of 88%. Logistic Regression or LSVM model with Unigram+BiGram bag-of-words features can be considered as the best model from this case study.
This article only covers the Bag-of-Words features, In the next article, we will experiment with TF-IDF features (a richer form of textual features).
Thanks for reading! Hope this article was helpful. Kindly provide your valuable feedback by commenting below. See you in the next article.
Read Next >>>
- Sentiment Analysis with Python: TFIDF features
- Sentiment Classification with Deep Learning: RNN, LSTM, and CNN
- Boosting your Sequence Generation Performance with ‘Beam Search + Language model’ decoding
- Optimizing TensorFlow models with Quantization Techniques
- Deep Learning with PyTorch: Introduction
- Deep Learning with PyTorch: First Neural Network
- 1D-CNN based Fully Convolutional Model for Handwriting Recognition
References (sentiment analysis)
- Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
- Downloaded from: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Pingback: Sentiment Analysis with Python: TFIDF features - Drops of AI
Pingback: Sentiment Classification with Deep Learning: RNN, LSTM, and CNN - Drops of AI