Wanna become a data scientist? Checkout Beyond Machine!

Matthew’s correlation coefficient: A metric for imbalanced class problems

Sometimes in data science and machine learning we encounter problems of imbalanced classes. These are problems when one class might have more instances than another. This makes accuracy a bad metric. If class A shows up 90% in our sample, and the other one 10%, then we can simply get 90% accuracy by always predicting class A.

One metric that helps with this problem is Matthew’s Correlation Coefficient (MCC), which  was introduced in the binary setting by Matthews in 1975. Before we show the calculation for the MCC let’s first revisit the concept of a confusion matrix. As you can see in the image below, a confusion matrix has 4 cells, created by a combination of the predicted values against the real values. Two of those cells represent correct predictions (True Positives and True Negatives), and the other represent incorrect predictions (False Positives and False Negatives).

confusion matrix

Matthew’s correlation coefficient is calculated as follows:

matthew's correlation coefficient

The MCC takes values between -1 and 1. A score of 1 indicates perfect agreement. But how does the MCC compare against other popular metrics for imbalanced classes?

Matthew’s correlation coefficient vs the F1-score

The F1-score is another very popular metric for imbalanced class problems. The F1-score is calculated as:

f1 score

So, it is simply the harmonic mean of precision and recall. According to a paper, the MCC has two advantages over the F1-score.

  1. F1 varies for class swapping, while MCC is invariant if the positive class is renamed negative and vice versa.
  2. F1 is independent from the number of samples correctly classified as negative.

Therefore, it is argued that the MCC is a more complete measure, compared to F1.

Matthew’s correlation coefficient vs Cohen’s Kappa

Cohen’s kappa is one of my favourite measures, and one that we’ve written about on this blog. It’s a great metric for imbalanced class problems. This paper compares the two metrics. An argument is made in favour of the MCC, but I personally believe that it’s too theoretical. I’ve found that in practice both metrics give similar results, and I am using both in all my projects.

Using the MCC in Python and R

Using the MCC in python is very easy. You can just use Scikit Learn’s metrics API. The MCC can be executed through the function matthews_corrcoef. In R, you can use the function mcc from the mltools package.

Do you want to become data scientist?

beyond machine

Do you want to become a data scientist and pursue a lucrative career with a high salary, working from anywhere in the world? I have developed a unique course based on my 10+ years of teaching experience in this area. The course offers the following:

  • Learn all the basics of data science (value $10k+)
  • Get premium mentoring (value at $1k/hour)
  • We apply to jobs for you and we help you land a job, by preparing you for interviews (value at $50k+ per year)
  • We provide a satisfaction guarantee!

If you want to learn more book a call with my team now or get in touch.

Wanna become a data scientist? Checkout Beyond Machine!

Categories: Machine learning

Dr. Stylianos Kampakis is the owner and author of The Data Scientist.