In my previous article on ‘Sentiment Analysis with Python: Bag of Words‘, We compared the results of three traditional machine learning sentiment classification algorithms using bag-of-words features(from scratch). This is my second article on sentiment analysis in continuation of that and this time we are going to experiment with TFIDF features for the task of Sentiment Analysis on English text data.
As I have already covered some common data preprocessing techniques in my last article, we will directly start working on the TFIDF features creation in this one.
Just like the last post, we will experiment with the same three machine learning classification algorithms and compare results. We will also compare results with the bag-of-words model results on the same test data.
The rest of article follows the following template-
- What are the TFIDF features?
- TFIDF features extraction
- Logistic Regression
- Linear Support Vector Machine (LSVM)
- Multinomial Naive Bayes (MNB)
- Result Comparison (TFIDF features + BOW features)
- Summary
1. What are the TFIDF features?
TFIDF (or tf-idf) stands for ‘term-frequency-Inverse-document-frequency’. Unlike the bag-of-words (BOW) feature extraction technique, we don’t just consider term frequencies in determining TFIDF features. But we also consider ‘inverse document frequency‘ in addition to that.
Term Frequency
Term-frequency refers to the count of occurrences of a given word in the given document. If a particular ‘word‘ occurs very frequently in a particular document then that document is considered to be relevant for that word(query) and the frequency value is known as term-frequency.
The problem with using this term-frequency value alone is that-Some irrelevant words (like ‘the‘, ‘and‘, ‘or‘…etc) occur very frequently in English text documents and these words get higher weightage wrt. term-frequency value while they are not much useful regarding the context of a sentence or paragraph.
Inverse Document Frequency (IDF)
Inverse document frequency(IDF) on the other hand looks at the presence of a query word among all documents. If a word occurs only in a few documents then it gets a higher IDF value, while if the given word occurs in most of the documents(means not relevant) gets a lower IDF value.
In this way, infrequent important words get some highlight and frequent non-useful words are penalized by inverse document frequency value.
Thus it solves the issue with frequent irrelevant words like ‘the‘, but it is still not ideal as it does not rank documents based on the frequency of given query word. In other words, it does not care about the frequency of a word within a document.
TFIDF features
TF-IDF is a product of ‘term-frequency‘ and ‘inverse document frequency‘ statistics. Thus it solves both above-described issues with TF and IDF alone and gives a score value to rank documents based on both.
TFIDF score tells the importance of a given word in a given document (when a lot of documents are present). In other words, for a given word query you can actually rank the documents wrt. to the relevance with tf-idf score.
tf-idf score of a term (t), in a given document (d) with respect to a set of documents (D), is defined as-
In many NLP tasks, it makes more sense to use TFIDF features in place of bag-of-words features as they provide a better view of word-relevance in a given document corpus.
2. TFIDF features creation
Just like the previous article on sentiment analysis, we will work on the same dataset of 50K IMDB movie reviews.
Quick dataset background: IMDB movie review dataset is a collection of 50K movie reviews tagged with corresponding true sentiment value. Out of which 25K reviews belong to the ‘positive‘ category and rest 25K belong to the ‘negative‘ sentiment category.
You can download this dataset from Kaggle (URL is provided in the references below). Here is a quick peek into the data-
data = pd.read_csv("data/IMDB Dataset.csv")
print (data.shape)
data.head(10)
Out of these 50K reviews, we will take first 40K as training dataset and rest 10K are left out as test dataset. We will use this test-dataset to compare different classifiers.
Here is how we can extract TFIDF features for our dataset using TfidfVectorizer from sklearn.
imdb_train = data[:40000]
imdb_test = data[40000:]
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
(40000, 150374) (10000, 150374)
3. Logistic Regression
Time to feed these tf-idf features into the classification algorithms. We will do three experiments with each of the classification algorithm where features would be-
- Unigram tf-idf features
- UniGrams + BiGram tf-idf features
- UniGrams + BiGrams + TriGram tf-idf features
Unigrams: All unique words in a document
BiGrams: All permutations of two consecutive words in a document
TriGrams: All permutations of three consecutive words in a document
Unigram
keeping the argument value ngram_range=(1,1) we will get the tf-idf matrix with unigram features only. Let’s fit the logistic regression model on these features and check how well it performs-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Our classifier gives a good accuracy number 89% on our test dataset with just unigram tf-idf features.
(40000, 150374) (10000, 150374)
precision recall f1-score support
Negative 0.90 0.88 0.89 4993
Positive 0.88 0.90 0.89 5007
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
[[4399 594]
[ 516 4491]]
UniGrams + BiGrams
Let’s see if we get any accuracy gain by adding bi-gram features also. We can include bi-gram features by setting ngram_range=(1,2).
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Umm, we don’t see any improvements in the results. Test set accuracy is again stuck at 89%.
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.90 0.88 0.89 4993
Positive 0.88 0.90 0.89 5007
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
[[4395 598]
[ 505 4502]]
UniGrams + BiGrams + TriGrams
Finally let’s check the model performance on adding tri-gram features to our existing feature set.
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Surprisingly accuracy has dropped a little on the addition of tri-gram features. (It means that tf-idf representation of tri-gram features is not adding any value in the sentiment identification).
(40000, 6802553) (10000, 6802553)
precision recall f1-score support
Negative 0.89 0.88 0.88 4993
Positive 0.88 0.89 0.88 5007
accuracy 0.88 10000
macro avg 0.88 0.88 0.88 10000
weighted avg 0.88 0.88 0.88 10000
[[4372 621]
[ 549 4458]]
4. Linear Support Vector Machine (LSVM)
Logistic Regression gives best accuracy of 89% on test dataset with just unigram features. Let’s see how linear support vector machine (LSVM) classifier performs on tf-idf dataset.
Again we will do same three experiments-
Unigrams
Let’s check out how LSVM performs when only unigram level tf-idf features are shown to the model-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Just like Logistic Regression, LSVM also achieves an accuracy of 89% on test dataset.
(40000, 150374) (10000, 150374)
precision recall f1-score support
Negative 0.90 0.89 0.89 4993
Positive 0.89 0.90 0.89 5007
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
[[4425 568]
[ 512 4495]]
UniGrams + BiGrams
Time to add BiGram features also and see if it does improve the LSVM model little bit-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Yes! slight improvement. With bigram tf-idf features our LSVM model achieves 90% accuracy on test dataset.
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.91 0.89 0.90 4993
Positive 0.90 0.91 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4461 532]
[ 457 4550]]
UniGrams + BiGrams + TriGrams
What if we add tri-gram features also? Will it improve again? let’s check out-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
And No, accuracy is almost similar to the previous experiment. So tri-gram features are not adding any value in this case.
(40000, 6802553) (10000, 6802553)
precision recall f1-score support
Negative 0.90 0.89 0.90 4993
Positive 0.89 0.91 0.90 5007
accuracy 0.90 10000
macro avg 0.90 0.90 0.90 10000
weighted avg 0.90 0.90 0.90 10000
[[4444 549]
[ 468 4539]]
5. Multinomial Naive Bayes (MNB)
In the last two experiments, we saw that LSVM performs slightly better than Logistic Regression and beats the best results of LR (89%) by 1%, and achieves an accuracy of 90% on the test dataset. Let’s see how does the Multinomial Naive Bayes model performs with tf-idf features.
Unigrams
Fitting MNB classifier on unigrams tf-idf features from our training dataset and evaluating on test set-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
This is the lowest accuracy we have seen so far but still the results are not very bad, 86% test accuracy is still good.
(40000, 150374) (10000, 150374)
precision recall f1-score support
Negative 0.85 0.88 0.86 4993
Positive 0.87 0.84 0.86 5007
accuracy 0.86 10000
macro avg 0.86 0.86 0.86 10000
weighted avg 0.86 0.86 0.86 10000
[[4388 605]
[ 779 4228]]
Unigrams + BiGrams
Can it beat the other classifiers when bigrams are added to the feature list? let’s check out-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
Not bad! It comes closer to the LR and LSVM results after adding bigrams and achieves accuracy of 89% on test set.
(40000, 2494028) (10000, 2494028)
precision recall f1-score support
Negative 0.88 0.90 0.89 4993
Positive 0.90 0.87 0.88 5007
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
[[4483 510]
[ 631 4376]]
UniGrams + BiGrams + TriGrams
Again the last iteration with added tri-gram features. Let’s see if it improves the performance further-
#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)
#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)
#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))
No improvements! Just like other two models, adding tri-gram features is not helping and the accuracy value the for MNB model is stuck at 89% on our test dataset.
precision recall f1-score support
Negative 0.88 0.89 0.89 4993
Positive 0.89 0.88 0.89 5007
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
[[4464 529]
[ 604 4403]]
6. Result Comparison (TFIDF features and BOW features)
Our experiments are done, it’s time to compare the performances of the models. The following table shows the consolidated results of all three models on tf-idf features-
Result comparison (TFIDF features)
Looks like all three models are doing equally well in this case. It’s hard to choose one best model in this case. If I still have to choose, I will go with LSVM.
Classifier | unigram features | unigram+bigram features | unigram+bigram+trigram features |
Logistic Regression | 89% | 89% | 88% |
LSVM | 89% | 90% | 90% |
MNB | 86% | 89% | 89% |
Result Comparison (BOW features)
Just to put everything on the same page, I am pasting below the result table with bag-of-words(BOW) features on the same test dataset. These results are from my previous post on sentiment analysis-
Classifier | UniGram features | (UniGram+Bi-Gram) features | (UniGram+Bi-Gram+Tri-Gram) features |
Logistic Regression | 88% | 90% | 90% |
Linear SVM | 86% | 90% | 90% |
MultiNomial Naive Bayes | 85% | 88% | 88% |
Github link: https://github.com/kartikgill/SentimentAnalysis
Summary
In this article, we discussed tf-idf feature extraction method in natural language processing(NLP).
We extracted tf-idf features on IMDB movie reviews dataset and fed them into three different machine learning classification algorithms to train a sentiment classification model. We further compared the sentiment classification results for all experiments.
Hope this article gives a starting point for beginners to jump into NLP tasks.
Thanks for reading! Kindly give your valuable feedback by commenting below. See you in the next article.
Read Next >>
- Sentiment Analysis with Python: Bag of Words
- Sentiment Classification with Deep Learning: RNN, LSTM, and CNN
- Boosting your Sequence Generation Performance with ‘Beam Search + Language model’ decoding
- Optimizing TensorFlow models with Quantization Techniques
- Deep Learning with PyTorch: Introduction
- Deep Learning with PyTorch: First Neural Network
- 1D-CNN based Fully Convolutional Model for Handwriting Recognition
References
- Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
- Downloaded from: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Pingback: Sentiment Analysis with Python: Bag of Words - Drops of AI
Pingback: Sentiment Classification with Deep Learning: RNN, LSTM, and CNN - Drops of AI