The reasons why evaluation of confusion matrix such as Accuracy cannot be used in binary classification

The characteristic such as Accuracy are often used to evaluate the accuracy of the machine learning, but this is not the case in the financial sector.

Before explaining the reasons why it is not used in the financial sector, let me explain about the evaluation index including confusion matrix.  For those who are familiar with the terms, please skip to “Problems” and “Conclusion”.

Confusion matrix

Confusion matrix is a summary of the prediction and its results in a matrix format.

In the case of binary classification in the machine learning, the predicted probability and its predicted clarification will be the output.

For example,  the ratio of becoming delinquent and not becoming delinquent will be shown as 0 and 1 in credit scoring.  It will look like as follows in the matrix format:

Prediction (will not become delinquent) – PositivePrediction (will become delinquent) – Negative
Result (is not delinquent) – Positive TP (True Positive)FN (False Negative)
Result (is delinquent) – NegativeFP (False Positive)TN (True Negative)

When it was predicted as “will not become delinquent” and the result was also “is not delinquent”, it falls into TP, and if the result was “is delinquent”, it falls into FN.

If the prediction was correct, it falls into either TP or TN.

Evaluation Indexes

Evaluation indexes that use the confusion matrix have various kinds as described below and each has its own characteristics.  It is common for all that the higher the number is, better the performance (or quality or efficiency) is.

1. Accuracy

Accuracy is a frequently used index and it describes the accuracy of prediction ratio.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁

2. Precision

Precision shows a ratio of both prediction and the results are Positive (not delinquent).

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑇𝑃𝑇𝑃+𝐹𝑃

3. Recall

Recall shows a ratio of how many of results in “Positive” were actually predicted correctly.

𝑅𝑒𝑐𝑎𝑙𝑙=𝑇𝑃𝑇𝑃+𝐹𝑁

4. Specificity

Specificity shows a ratio of how many of results in Negative were actually predicted correctly.

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁𝐹𝑃+𝑇𝑁

Problems and characteristics of evaluation indexes

For example, in the case when the prediction result was as follows;

Prediction (will not become delinquent) – PositivePrediction (will become delinquent) – Negative
Result (is not delinquent) – Positive 980 980(TP)0(FN)
Result (is delinquent) – Negative20(FP)0(TN)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁=9801000=0.98

Therefore, the value of Accuracy is high.

However, if you look into details, all predictions are Positive and there is no Negative in results.  Depending on the data’s bias levels, even when it was randomly predicted all as Positive, Accuracy will be a high value.

Using Specificity will provide another aspect:

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁𝐹𝑃+𝑇𝑁=020=0

As you can see, the probability of becoming Negative was not predicted at all.

Other Indexes

It tends to become rather complicated when there are several evaluation indexes and each has its on characteristics.  In those cases, an index called F-number (also known as F-score, F1 score, F-measure) is used. It is an average number of  Precision and Recall in harmonic mean.

𝐹1=2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

Problem

These evaluation indexes described above will not be used in the finance field.  The principle of confusion matrix is to categorize the data into Positive and Negative and evaluate the accuracy of its classification.

One of the problems occurs when the data is biased.  When the original data contains 99% of Positive and 1% of Negative, it will most likely to result in a biased value of the evaluation index.

Another point would be how to set the threshold.  The ratio of Positive and Negative change depending on how to set the threshold and the threshold could leave ambiguities in its adequancy.

For example,  in the credit scoring, let’s say that company A will not contract with people with predition of more than 5% of delinquent, thus the threshold will be set at 5%, while company B will set the threshold at 10% and will not contract with people that are classified in Negative.  In this case, can we say that all the people who are classifed as Positive will not have delinquent? Not necessarily.  This is because there are cases of both become and not become delinquent even when all the given information (such as attributes and transaction histories) are the same.

The most important point in the credit scoring is how many percent the risk levels is, and not about how accurately data (or people?) can be classified as Positive or Negative, as this is a product that is designed based on the risk levels.

Conclusion

If the accuracy of binary classification is a higher priority without considering the risk levels,  the evaluation indexes described above can be very important.

In conclusion, the index should be carefully selected otherwise the accuracy of the evaluation result is assessed incorrectly based on the meaningless value.

For the methods on how to explain the reasons of evaluation or prediction of the model, please refer to the following:

Follow me!