How to deal with Imbalanced data in classification?

If you have some experience in solving the classification problems then you must have encountered imbalanced data several times. An imbalanced dataset is a dataset that has an imbalanced distribution of the examples of different classes.

Consider a binary classification problem where you have two classes 1 and 0 and suppose more than 90% of your training examples belong to only one of these classes. Now if you try to train a classification model on top of this data, your model is going to be biased towards the majority class because machine learning models learn from the examples and most of the examples in your dataset belong to a single class.

This kind of problem in classification is very common and is termed as class imbalance(imbalanced data) problem.

There are quite a few ways to handle imbalanced data in machine classification problems. In this article, we are going to get into the details of the following six techniques that are commonly used to handle imbalanced data in classification.

Random under-sampling
Random over-sampling
Synthetic over-sampling: SMOTE
Choose the algorithm wisely
Play with the loss function
Solve an anomaly detection problem

1. Random under-sampling

Random under-sampling is a simple technique to handle class imbalance (or imbalanced data). This approach is generally used when you have a huge amount of training data with you. The random under-sampling technique works by randomly eliminating the samples from the majority class until the classes are balanced in the remaining dataset.

Related Read:
Sampling Techniques in Statistics for Machine Learning

Under-Sampling for Imbalanced Data — Under-Sampling | Image Source

Though this technique is simple and reduces the model complexity, model runtime, and also the storage requirements by decreasing the training data, it comes with some known disadvantages. This random elimination of samples from the majority class might eliminate some useful information from the training dataset that the model needs to have attended. Secondly, this reduced training set might not be a correct representation of the population and a model trained on such data might not generalize well on the test dataset (unseen data).

2. Random over-sampling

Random over-sampling is very similar to the random under-sampling technique. This time, instead of reducing the samples of the majority class, we will be focusing on increasing the examples of the minority class. This technique is more suitable when training data is less and we cannot afford under-sampling.

In this technique, we try to increase the instances of the minority class by random replication of the already present samples. This technique is better than under-sampling as there is no information loss here and it outperforms the under-sampling in practice.

Over-Sampling for Imbalanced Data — Over-Sampling | Image Source

The only catch is that replication of the minority class might increase the chances of overfitting. One important thing to remember while applying the over-sampling technique is that the training and test-set split should be done before applying over-sampling otherwise dataset split could be uneven.

3. Synthetic over-sampling: SMOTE

Synthetic minority oversampling technique, also termed as SMOTE, is a clever way to perform over-sampling over the minority class to avoid overfitting(unlike random over-sampling that has overfitting problems). In SMOTE, a subset of minority class is taken and new synthetic data points are generated based on it. These synthetic data points are then added to the original training dataset as additional examples of the minority class.

SMOTE technique overcomes the overfitting problem from random over-sampling as there is no replication of the examples. Secondly, as no data points are removed from the dataset, so no loss of useful information.

SMOTE | Image Source

Despite the advantages, SMOTE has few limitations as well. As vanilla SMOTE does not take the similar majority class samples into consideration while creating the synthetic examples of the minority class, it might increase the class overlap and result in additional noise to the training dataset which is again a problem. Secondly, SMOTE is not very effective on high dimensional datasets.

To overcome these problems a modified version of SMOTE is introduced-MSMOTE(Modified synthetic minority oversampling technique). While generating the synthetic examples, MSMOTE selects the minority class samples cleverly such that no additional noise is introduced to the training dataset.

4. Choose the algorithm wisely

Although it is always a good practice to try and apply multiple classification algorithms to check which one performs the best, it becomes necessary to choose the algorithm wisely when imbalanced data is concerned.

Random forest for imbalanced data — Random Forest | Image Source

Generally, decision tree-based algorithms perform well on imbalanced datasets. Similarly bagging and boosting based techniques are good choices for imbalanced classification problems.

You might be interested in reading my article on:
“Bagging, Boosting and Stacking in Machine Learning“

5. Play with the loss function

With minor modifications to the existing algorithms, you can make them work for your imbalanced data.

What you need to do is-put an additional cost every time your model misclassifies the minority class, this would force the model to pay more attention to the minority class and the model would try to learn to make lesser mistakes for the minority class.

Every model works in a different way and there is a different way to apply this trick to every model. Setting up this penalty cost value could be complex, you might need to try out multiple schemes and check what works best for your case.

6. Solve an anomaly detection problem

Another interesting way to deal with highly imbalanced data is to treat the classification as an anomaly detection problem.

Anomaly detection is a term for the problems concerned with the prediction of rare events. These events could be system failures, fraud-detection…etc. Anomaly detection problems consider the minority class(rare-events) as outliers and apply several approaches to detect them.

We can do a similar thing for our classification problem by treating the minority class as an outlier. This might give you new ideas for solving your problem efficiently.

Conclusion

In this article, we have discussed six different ways to handle the imbalanced data in machine learning classification problems. Each of these methods has its own strengths and weaknesses. Depending upon your dataset and the problem you are solving, you can choose the appropriate way to handle your imbalanced data.

Or, you can try them all and choose the one giving the best results.

You might be interested in reading my article on:
“Mining Interpretable Rules from Classification Models“

Thanks for reading, hope you have enjoyed the article. Kindly provide your feedback by commenting below.