Sentiment Analysis with Python: TFIDF features

In my previous article on ‘Sentiment Analysis with Python: Bag of Words‘, We compared the results of three traditional machine learning sentiment classification algorithms using bag-of-words features(from scratch). This is my second article on sentiment analysis in continuation of that and this time we are going to experiment with TFIDF features for the task of Sentiment Analysis on English text data.

As I have already covered some common data preprocessing techniques in my last article, we will directly start working on the TFIDF features creation in this one.

Just like the last post, we will experiment with the same three machine learning classification algorithms and compare results. We will also compare results with the bag-of-words model results on the same test data.

The rest of article follows the following template-

What are the TFIDF features?
TFIDF features extraction
Logistic Regression
Linear Support Vector Machine (LSVM)
Multinomial Naive Bayes (MNB)
Result Comparison (TFIDF features + BOW features)
Summary

Sentiment Analysis with Python: TFIDF features | Image by Apaha Spi | Image Source

1. What are the TFIDF features?

TFIDF (or tf-idf) stands for ‘term-frequency-Inverse-document-frequency’. Unlike the bag-of-words (BOW) feature extraction technique, we don’t just consider term frequencies in determining TFIDF features. But we also consider ‘inverse document frequency‘ in addition to that.

Term Frequency

Term-frequency refers to the count of occurrences of a given word in the given document. If a particular ‘word‘ occurs very frequently in a particular document then that document is considered to be relevant for that word(query) and the frequency value is known as term-frequency.

The problem with using this term-frequency value alone is that-Some irrelevant words (like ‘the‘, ‘and‘, ‘or‘…etc) occur very frequently in English text documents and these words get higher weightage wrt. term-frequency value while they are not much useful regarding the context of a sentence or paragraph.

Inverse Document Frequency (IDF)

Inverse document frequency(IDF) on the other hand looks at the presence of a query word among all documents. If a word occurs only in a few documents then it gets a higher IDF value, while if the given word occurs in most of the documents(means not relevant) gets a lower IDF value.

In this way, infrequent important words get some highlight and frequent non-useful words are penalized by inverse document frequency value.

Thus it solves the issue with frequent irrelevant words like ‘the‘, but it is still not ideal as it does not rank documents based on the frequency of given query word. In other words, it does not care about the frequency of a word within a document.

TFIDF features

TF-IDF is a product of ‘term-frequency‘ and ‘inverse document frequency‘ statistics. Thus it solves both above-described issues with TF and IDF alone and gives a score value to rank documents based on both.

TFIDF score tells the importance of a given word in a given document (when a lot of documents are present). In other words, for a given word query you can actually rank the documents wrt. to the relevance with tf-idf score.

tf-idf score of a term (t), in a given document (d) with respect to a set of documents (D), is defined as-

$\mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)$

source wikipedia

In many NLP tasks, it makes more sense to use TFIDF features in place of bag-of-words features as they provide a better view of word-relevance in a given document corpus.

2. TFIDF features creation

Just like the previous article on sentiment analysis, we will work on the same dataset of 50K IMDB movie reviews.

Quick dataset background: IMDB movie review dataset is a collection of 50K movie reviews tagged with corresponding true sentiment value. Out of which 25K reviews belong to the ‘positive‘ category and rest 25K belong to the ‘negative‘ sentiment category.

You can download this dataset from Kaggle (URL is provided in the references below). Here is a quick peek into the data-

data = pd.read_csv("data/IMDB Dataset.csv")
print (data.shape)
data.head(10)

Out of these 50K reviews, we will take first 40K as training dataset and rest 10K are left out as test dataset. We will use this test-dataset to compare different classifiers.

Here is how we can extract TFIDF features for our dataset using TfidfVectorizer from sklearn.

imdb_train = data[:40000]
imdb_test = data[40000:]

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

(40000, 150374) (10000, 150374)

3. Logistic Regression

Time to feed these tf-idf features into the classification algorithms. We will do three experiments with each of the classification algorithm where features would be-

Unigram tf-idf features
UniGrams + BiGram tf-idf features
UniGrams + BiGrams + TriGram tf-idf features

Unigrams: All unique words in a document
BiGrams: All permutations of two consecutive words in a document
TriGrams: All permutations of three consecutive words in a document

Unigram

keeping the argument value ngram_range=(1,1) we will get the tf-idf matrix with unigram features only. Let’s fit the logistic regression model on these features and check how well it performs-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Our classifier gives a good accuracy number 89% on our test dataset with just unigram tf-idf features.

(40000, 150374) (10000, 150374)

              precision    recall  f1-score   support

    Negative       0.90      0.88      0.89      4993
    Positive       0.88      0.90      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

[[4399  594]
 [ 516 4491]]

UniGrams + BiGrams

Let’s see if we get any accuracy gain by adding bi-gram features also. We can include bi-gram features by setting ngram_range=(1,2).

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Umm, we don’t see any improvements in the results. Test set accuracy is again stuck at 89%.

(40000, 2494028) (10000, 2494028)

              precision    recall  f1-score   support

    Negative       0.90      0.88      0.89      4993
    Positive       0.88      0.90      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

[[4395  598]
 [ 505 4502]]

UniGrams + BiGrams + TriGrams

Finally let’s check the model performance on adding tri-gram features to our existing feature set.

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Surprisingly accuracy has dropped a little on the addition of tri-gram features. (It means that tf-idf representation of tri-gram features is not adding any value in the sentiment identification).

(40000, 6802553) (10000, 6802553)

              precision    recall  f1-score   support

    Negative       0.89      0.88      0.88      4993
    Positive       0.88      0.89      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

[[4372  621]
 [ 549 4458]]

4. Linear Support Vector Machine (LSVM)

Logistic Regression gives best accuracy of 89% on test dataset with just unigram features. Let’s see how linear support vector machine (LSVM) classifier performs on tf-idf dataset.

Again we will do same three experiments-

Unigrams

Let’s check out how LSVM performs when only unigram level tf-idf features are shown to the model-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Just like Logistic Regression, LSVM also achieves an accuracy of 89% on test dataset.

(40000, 150374) (10000, 150374)
              precision    recall  f1-score   support

    Negative       0.90      0.89      0.89      4993
    Positive       0.89      0.90      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

[[4425  568]
 [ 512 4495]]

UniGrams + BiGrams

Time to add BiGram features also and see if it does improve the LSVM model little bit-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Yes! slight improvement. With bigram tf-idf features our LSVM model achieves 90% accuracy on test dataset.

(40000, 2494028) (10000, 2494028)
              precision    recall  f1-score   support

    Negative       0.91      0.89      0.90      4993
    Positive       0.90      0.91      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4461  532]
 [ 457 4550]]

UniGrams + BiGrams + TriGrams

What if we add tri-gram features also? Will it improve again? let’s check out-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.svm.LinearSVC()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

And No, accuracy is almost similar to the previous experiment. So tri-gram features are not adding any value in this case.

(40000, 6802553) (10000, 6802553)
              precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.89      0.91      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

[[4444  549]
 [ 468 4539]]

5. Multinomial Naive Bayes (MNB)

In the last two experiments, we saw that LSVM performs slightly better than Logistic Regression and beats the best results of LR (89%) by 1%, and achieves an accuracy of 90% on the test dataset. Let’s see how does the Multinomial Naive Bayes model performs with tf-idf features.

Unigrams

Fitting MNB classifier on unigrams tf-idf features from our training dataset and evaluating on test set-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,1))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

This is the lowest accuracy we have seen so far but still the results are not very bad, 86% test accuracy is still good.

(40000, 150374) (10000, 150374)
              precision    recall  f1-score   support

    Negative       0.85      0.88      0.86      4993
    Positive       0.87      0.84      0.86      5007

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

[[4388  605]
 [ 779 4228]]

Unigrams + BiGrams

Can it beat the other classifiers when bigrams are added to the feature list? let’s check out-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,2))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

Not bad! It comes closer to the LR and LSVM results after adding bigrams and achieves accuracy of 89% on test set.

(40000, 2494028) (10000, 2494028)
              precision    recall  f1-score   support

    Negative       0.88      0.90      0.89      4993
    Positive       0.90      0.87      0.88      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

[[4483  510]
 [ 631 4376]]

UniGrams + BiGrams + TriGrams

Again the last iteration with added tri-gram features. Let’s see if it improves the performance further-

#Create features
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,3))
tfidf_features_train = vectorizer.fit_transform(imdb_train['review'])
tfidf_features_test = vectorizer.transform(imdb_test['review'])
print (tfidf_features_train.shape, tfidf_features_test.shape)

#train model
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(tfidf_features_train, train_labels)

#evaluation
predictions = clf.predict(tfidf_features_test)
print(sklearn.metrics.classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(sklearn.metrics.confusion_matrix(test_labels, predictions, labels=[0, 1]))

No improvements! Just like other two models, adding tri-gram features is not helping and the accuracy value the for MNB model is stuck at 89% on our test dataset.

              precision    recall  f1-score   support

    Negative       0.88      0.89      0.89      4993
    Positive       0.89      0.88      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

[[4464  529]
 [ 604 4403]]

6. Result Comparison (TFIDF features and BOW features)

Our experiments are done, it’s time to compare the performances of the models. The following table shows the consolidated results of all three models on tf-idf features-

Result comparison (TFIDF features)

Looks like all three models are doing equally well in this case. It’s hard to choose one best model in this case. If I still have to choose, I will go with LSVM.

Classifier	unigram features	unigram+bigram features	unigram+bigram+trigram features
Logistic Regression	89%	89%	88%
LSVM	89%	90%	90%
MNB	86%	89%	89%

result comparison: Sentiment Analysis with Python: TFIDF features

Result Comparison (BOW features)

Just to put everything on the same page, I am pasting below the result table with bag-of-words(BOW) features on the same test dataset. These results are from my previous post on sentiment analysis-

Classifier	UniGram features	(UniGram+Bi-Gram) features	(UniGram+Bi-Gram+Tri-Gram) features
Logistic Regression	88%	90%	90%
Linear SVM	86%	90%	90%
MultiNomial Naive Bayes	85%	88%	88%

Results: Sentiment Analysis with Python: Bag of Words

Github link: https://github.com/kartikgill/SentimentAnalysis

Summary

In this article, we discussed tf-idf feature extraction method in natural language processing(NLP).

We extracted tf-idf features on IMDB movie reviews dataset and fed them into three different machine learning classification algorithms to train a sentiment classification model. We further compared the sentiment classification results for all experiments.

Hope this article gives a starting point for beginners to jump into NLP tasks.

Thanks for reading! Kindly give your valuable feedback by commenting below. See you in the next article.

References

Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
Downloaded from: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Sentiment Analysis with Python: TFIDF features

1. What are the TFIDF features?

Term Frequency

Inverse Document Frequency (IDF)

TFIDF features

2. TFIDF features creation

3. Logistic Regression

Unigram

UniGrams + BiGrams

UniGrams + BiGrams + TriGrams

4. Linear Support Vector Machine (LSVM)

Unigrams

UniGrams + BiGrams

UniGrams + BiGrams + TriGrams

5. Multinomial Naive Bayes (MNB)

Unigrams

Unigrams + BiGrams

UniGrams + BiGrams + TriGrams

6. Result Comparison (TFIDF features and BOW features)

Result comparison (TFIDF features)

Result Comparison (BOW features)

Summary

Read Next >>

References

You might also like-

2 thoughts on “Sentiment Analysis with Python: TFIDF features”

1. What are the TFIDF features?

Term Frequency

Inverse Document Frequency (IDF)

TFIDF features

2. TFIDF features creation

3. Logistic Regression

Unigram

UniGrams + BiGrams

UniGrams + BiGrams + TriGrams

4. Linear Support Vector Machine (LSVM)

Unigrams

UniGrams + BiGrams

UniGrams + BiGrams + TriGrams

5. Multinomial Naive Bayes (MNB)

Unigrams

Unigrams + BiGrams

UniGrams + BiGrams + TriGrams

6. Result Comparison (TFIDF features and BOW features)

Result comparison (TFIDF features)

Result Comparison (BOW features)

Summary

Read Next >>

References

Share:

You might also like-

2 thoughts on “Sentiment Analysis with Python: TFIDF features”