This is one of the major issues I have seen with the projects of undergraduate students in my university. So, I decided to clear this confusion by writing a blog about using accuracy as a measure of model performance in machine learning models when the dataset has a class imbalance problem. This blog will cover this problem using the following sub-sections:
- What is class imbalance?
- How to spot class imbalance problem?
- Is accuracy is good measure for datasets with class imbalance? Why?
- What are the solutions or alternate measures?
What is Class Imbalance?
In the case of classification problems, a labeled dataset is used where the classification class of a set of attributes (features) is provided so that model can learn patterns from the training dataset and use them for future predictions. Class imbalance occurs when the number of instances of one class (majority class) in the training dataset dominates the other classes (rare class). The occurrence of the dominant class is far more than the other classes, for example, let’s say for a binary classification one class is 95% of the examples. Some of the common examples of class imbalance are:
The dataset which contains legitimate and frauded transactions of any real-world scenario would have a class imbalance problem where the number of legitimate transactions would be more than 99%. On the other hand, the correct prediction of the frauded transaction which is rare has a higher value than the legitimate transaction.
In an assembly line for manufacturing any product, the chances of producing a defective product are rare as compared with a normal product. Again, the correct detection of a defective product is more important than others.
Most of the medical dataset which detects some disease in a patient also suffers from the class imbalance problem. For example, if we take random 1000 chest x-rays to detect any disease what are the chances of having a positive outcome. Clearly, this is far less likely than having a normal x-ray.
How to spot class imbalance?
Spotting a class imbalance in any dataset is not difficult. There are multiple methods to do that and one of the most common is displaying the value count of the class column in the dataset using
The above code would print the values of different classes in the class column of the dataset. Then a simple percentage calculation can be done. There is no rule, however, if the percentage of occurrence of any class is more than 90% it surely has a class imbalance problem.
Secondly, the importance of detection of the rare class is also a factor while considering class imbalance problems. If the rare class does not have any significance in the final prediction system, then, these rare occurrences can be treated as outliers or we should not worry about them that much.
Class Imbalance VS Accuracy
Accuracy is widely used as a performance measure for comparing multiple machine learning algorithms or various instances of the same machine learning algorithm. However, if the dataset has a class imbalance, it is a kind of crime to use only accuracy. Let’s understand it with an example,
Let’s say you have a dataset that contains multiple symptom values based on which you predict whether a person has cancer or not. Also, your dataset has 1000 instances out of which 950 instances are of normal cases and 50 are of cancer. If your algorithm classifies all instances into normal instances and failed to detect even a single case of cancer. As per the accuracy calculations, you have a 95% accurate model which is an excellent one. On the contrary, your prediction system is the worst and does nothing. Furthermore, if the rare class is more important than the majority class, this becomes even more dangerous. For example, if your prediction system detects cancer in a normal patient, we can run further tests to confirm it. But, if it classifies a cancer patient as a non-cancer patient, it may cause the hospital to ignore that patient which can yield his or her early death.
The above examples clearly state the relevancy of accuracy when the dataset has a class imbalance problem. Additionally, the parameters used to train any model from a machine learning algorithm should be focused on rare cases rather than majority cases.
Solutions to Class imbalance
It is evident that accuracy can not be used in datasets with class imbalance, we need to identify some alternate measures or metrics which can be used to compare the performance of machine learning algorithms. Let us assume that the rare class is the positive class in binary classification and the confusion matrix is generated as shown below figure.
In the example of cancer discussed above the confusion matrix would look like the below figure.
As our classification algorithm classified all instances as non-cancer so our confusion matrix would have TP as zero and FP as zero because no class is predicted as positive. Now if we focus on other parameters which are discussed below we can have a better idea.
True Positive Rate (Sensitivity)
This parameter let us know that how many positive instances we correctly predicted out of all positive instances. Higher the value of TPR better is than our classification model. When we consider our rare class as a positive class then this parameter is a better indication of the classification of rare instances. The formula would be
TPR = TP /(TP + FN)
TPR = 0/(0+50) = 0
As TPR is zero in our example, this suggests that our classification model is worst in rare case classification and should be replaced with some other model. Similarly, we can calculate True Negative Rate (aka Specificity) for the negative class also. It would suggest that how your model is able to correctly classify the negative class.
Recall and Precision
Recall and precisions are widely used parameters when the dataset has class imbalance problems. Both of these parameters focus only on positive class. So, we should make sure that our rare class is positive in the confusion matrix. Precision let us know that out of all positively predicted instances how many are correctly predicted. Hence it focuses on TP and FP instances only. On the other hand, Recall is actual TPR which states the fraction of instances correctly predicted out of actual positive instances. The formula for both parameters are as follows
Precision, p, = TP/(TP+FP)
Recall, r, TPR= TP/(TP+FN)
The key motivation for choosing any classification algorithm is that it should maximize both Precision and Recall considering that positive class is our class of concern.
Receiver Operating Characteristic (ROC) Curve
If we focus on rare classes then two major parameters for which we are most concerned are TPR and FPR. ROC curve allows us to create a graphical representation between TPR and FPR. The below figure shows a sample ROC curve taken from Supervised Learning using SCIKIT-Learn course provided by datacamp.com. It has FPR on the x-axis and TPR on the Y-axis. The more the ROC curve is inclined towards the left top corner, the better is the classification model. This is because our model should have high TPR and low FPR rates.
ROC curve for any given dataset can be designed using the following code. All variables in the following code have the usual meaning. Output is shown below the code.
from sklearn.metrics import roc_curve y_pred_prob = model._predict_proba_lr(X_test)[:,1] fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr, label='Logistic Regression') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Logistic Regression ROC Curve') plt.savefig('roc curve.png', dpi=300) #plt.show()
We can also find the area under the curve for comparing multiple classification models.
Machine learning and data mining have methods that can be used to deal with the class imbalance problem. However, they are part of the discussion in some other blogs. Some of the most common methods are
- Cost Sensitive Learning
- Sampling of datasets
If I ever write about these two alternative techniques, I would link them to their blog article.
Class imbalance is a common problem and we may encounter many datasets which have a class imbalance between rare class and majority class. The main issue occurs when we have to compare multiple classification algorithms/models’ performances and find the best one. Here, Accuracy if used may result in bad judgment. The main aim of this blog is that we should be aware if our dataset has a class imbalance problem and we should use alternative performance metrics like precision and recall.
Furthermore, we should make a habit to calculate TPR, TNR, precision, and recall for all of our classification algorithms. Last but not the least, ROC curve can also be used for comparing multiple models/algorithms.