Confusion matrix#

Inaccuracy matrix is a very important concept for evaluating classification models.

import numpy as np
import pandas as pd

import sklearn
from sklearn.datasets import make_classification

Example task#

Consider a binary classification problem. We have two classes, Positive and Negative.

Let be:

  • \(P\) is the number of positive observations in the sample;

  • \(N\) is the number of negative observations in the sample.

The following code generates possible outputs of the classification task:

  • \(y\) - real targets, 0 corresponds to negative 1 to positive;

  • \(t_i\) - score that indicates the probability that a particular object belongs to the positive class;

  • \(\hat{y}=\left[t_i>T\right]\) - final predicts that depends which depend on the cut-off threshold - \(T\). You can choose different \(T\) and it will fill the confusion matrix - it’s discussed in the following sections. For the example below, \(T\) is the mean of \(t_i=\overline{1,n}\).

scores, y = make_classification(
    n_features=1,
    n_informative=1,
    n_redundant=0,
    n_repeated=0,
    n_clusters_per_class=1,
    flip_y=0.3,
    random_state=2
)
scores = scores.ravel()
pd.DataFrame({
    "$$y$$" : y, 
    "$$t_i$$" : scores, 
    "$$\hat{y}$$":(scores>np.mean(scores)).astype("int")
}).head()
$$y$$ $$t_i$$ $$\hat{y}$$
0 0 -0.644251 0
1 0 -1.162201 0
2 1 -0.624533 0
3 1 2.033877 1
4 1 -1.012203 0

Idea#

Now, suppose we have formed some classifier. We have the following groups of observations.

  • True positive: observations that were positive in the sample and we correctly predicted them as positive. We will denote their number as \(TP\);

  • True negative: observations that were negative in the sample and we correcrly predicted then as negative. We will denote their number as \(TN\);

  • False positve: observations that were negative in the sample, but which we then mistakenly predicted to be positive. We will denote their number as \(FP\);

  • False negative: observations that were positive in the sample, but wich we then mistakenly predicted to be negative. We will denote their number as \(FN\).

So, if you put the actual value on the rows and the predicted value on the columns, you will get a confusion matrix.

Predicted \(N\)

Predicted \(P\)

Actual \(N\)

\(TN\)

\(FP\)

Actual \(P\)

\(FN\)

\(TP\)

Also valuable is the representation of the confusion matrix using relative values.

Let be:

  • \(P^* = TP + FP\) - number of observations from the sample predicted as positive;

  • \(N^* = TN + FN\) - number of observations from the sample predicted as negative;

  • \(TNR = TN/N^*\) - true negative rate, the proportion of correct predictions among observations that are predicted negative;

  • \(FNR = FN/N^*\) - false negative rate, the proportion of incorrect predictions among observations that are predicted to be negative;

  • \(TPR = TP/P^*\) - true positive rate, the proportion of correct predictions among observations that are predicted to be positive;

  • \(FPR = FP/P^*\) - false positve rate, the proportion of incorrect predicitons among observations that are predicted to be negative.

So using these notations the confusion matrix can also be written:

Predicted \(N\)

Predicted \(P\)

Actual \(N\)

\(TNR\)

\(FPR\)

Actual \(P\)

\(FNR\)

\(TPR\)

Here is an example of calculating the confusion matrix using sklearn.metrics.confusion_matrix:

from sklearn import metrics
sklearn.metrics.confusion_matrix(y, scores > np.mean(scores))
array([[39,  7],
       [17, 37]])

Confusion table#

Idea#

Many classification models allow to return a score that indicates the probability that a particular object belongs to the positive class. You can select the threshold above which you consider the object under consideration to be positive. Different treshold values will consequently produce different confusion matrixes.

The table that puts in correspondence to some selected threshold the table of contiguity will be called the confusion table.

\(T\)

\(TN\)

\(FP\)

\(FN\)

\(TP\)

\(t_1\)

\(TN_1\)

\(FP_1\)

\(FN_1\)

\(TP_1\)

\(t_2\)

\(TN_2\)

\(FP_2\)

\(FN_2\)

\(TP_2\)

\(t_i\)

\(TN_i\)

\(FP_i\)

\(FN_i\)

\(TP_i\)

\(t_n\)

\(TN_n\)

\(FP_n\)

\(FN_n\)

\(TP_n\)

Realisation#

I didn’t find ready realisation of the similar concept. So here is my own realisation.

The first thing that comes to mind is to use sklearn.metrix.confusion_matrix for all needed \(t_i\). But this solution is extremely slow - an estimate of the complexity of the algorithm is \(O(nT')\), where \(n\) number of samples \(T'\) is the number of thresholds to check.

The following is a description of the algorithm that will work with complexity \(O(n)\) assuming that the observations are sorted in ascending order \(s_i\) and all thresholds are sorted in asceding order:

Text description#

As input we have three arrays:

  • \(\left\{y_1,y_2, ... , y_n\right\}\) - real classes of the observations, where: $\(y_i = \begin{cases} 0, \text{if i-th observation negative};\\ 1, \text{if i-th observation positive}. \end{cases}\)$

  • \(\left\{s_1,s_2, ..., s_n\right\}\) - scores of the observations;

  • \(\left\{t_1, t_2, ..., t_{T'}\right\}\) - the thresholds we’re interested in.

Let’s introduce cursors for the values we are interested in, and setup them values for the smallest rational threshold for a given problem:

  • \(TP'=P=\sum_{i=1}^n{y_i}\) - with the lowest threshold, all positive outcomes are truly classified as positive;

  • \(FP'=N = n -\sum_{i=1}^n{y_i}\) - with the lowest threshold, all negative outcomes are mistaken classified as positive;

  • \(TN'=0, FN'=0\) - with the lowest threshold there are no values that classified as negative;

  • \(i=1\) indexes \(y_i\) and \(s_i\);

  • \(j=1\) indexes \(t_j\).

Now let’s talk about the iterative procedure:

At each step we compare \(s_i\) and \(t_j\):

  • If \(s_i < t_j\):

    • If \(y_i=1\), then we add one to \(FN'\) and subtract one from \(TP'\);

    • If \(y_i=0\), then we add one to \(TN'\) and subtract one from \(FP'\);

    • Come the the next observation - add one to \(i\);

  • If \(s_i \geq t_j\):

    • Add the current values of \(TP', FP', TN', FN'\) to the confusion table row corresponding to \(t_j\);

    • Come to the next threshold, - add one to \(j\);

  • The algorithm stops when we’ve circled all thresholds \(j>T'\).

Python realisaton#

In this section is a python function that implements the algorithm below.

import numpy as np

def get_sorted_by_scores(y_true, y_score):
    '''
    Get sort y_true and y_score which
    elements corresponds to each other
    sorted by y_score.

    Parameters
    ----------
    y_true : numpy.array
        array of outcomes;
    y_scores : numpy.array
        array of scores that defines order
        of the outputs.

    Returns
    ----------
    y_true : numpy.array
        y_true sorted according to y_score
    y_scores : numpy array
        sorted y_score;
    '''
    sorting = np.argsort(y_score)
    return y_true[sorting], y_score[sorting]


def get_confusion_table(
    y_true,
    y_score,
    thresholds=None
):
    '''
    Get confusion table - set of confusion
    matrixes for given set of thresholds.

    Parameters
    ----------
    y_true : numpy.array
        binary array that represents really classes of
        observations;
    y_score : numpy.array
        array which elements corresponds to the y_true
        and represents result scores of the classifier;
    thresholds : numpy.array
        set of thresholds. By default the same as y_score

    Returns
    ---------
    out : numpy.array
        where each row is confusion matrix for particular
        threshold. Each row contains values 
        threshold, TN, FP, FN, TP.
    '''
    if thresholds is None:
        thresholds = y_score

    # it's crusial for y_true
    # and y_score to have same size
    n = len(y_score)
    if n != len(y_true):
        raise ValueError(
            "y_score and y_true must "
            "have the same size. But got "
            f"{n} and {len(y_true)}"
        )

    y_true, y_score = get_sorted_by_scores(y_true, y_score)
    thresholds = (y_score if (thresholds is None) else np.sort(thresholds)) 
    
    T_len = len(thresholds)

    # setting up initial values for cursors
    TP = np.sum(y_true)
    FP = n-TP
    TN=0;FN=0;i=0;j=0
    
    res = []
    
    while True:
        if y_score[i] < thresholds[j]:
            if y_true[i]: FN+=1;TP-=1
            else: TN+=1;FP-=1
            i+=1
        else:
            res.append([thresholds[j],TN,FP,FN,TP])
            j+=1            

        if j >= T_len:
            return np.array(res)

So let’s see how it works with the example we added earlier:

result = get_confusion_table(y, scores)

pd.DataFrame(
    result[:, 1:], 
    index=pd.Index(result[:,0], name="$$T$$"),
    columns=["$$TN$$", "$$FP$$", "$$FN$$", "$$TP$$"]
).astype("int")
$$TN$$ $$FP$$ $$FN$$ $$TP$$
$$T$$
-1.528063 0 46 0 54
-1.463307 0 46 1 53
-1.405160 1 45 1 53
-1.388968 2 44 1 53
-1.308240 3 43 1 53
... ... ... ... ...
1.814459 45 1 50 4
1.834142 45 1 51 3
1.892121 45 1 52 2
1.917869 46 0 52 2
2.033877 46 0 53 1

100 rows × 4 columns

Tests#

For the example under consideration, everything looks great, but it’s good practice to make a simple example that can be easily handled in mind. Here’s the set of unit tests described and implemented for the above function.

Consider case:

  • \(y_i\in \left\{1,0,0,1,0\right\}\) - real classes of the observations;

  • \(s_i\in \left\{4,1,1,2,3\right\}\) - scores of the observations;

  • \(t_j \in \left\{2,3\right\}\) - the thresholds we’re interested in.

Let us consider the specified cut-off thresholds and indicate which of them observations at a given threshold belong to such a sector of the confusion matrix.

For \(t_1=2\):

\(i\)

\(s_i\)

\(y_i\)

\(TN\)

\(FP\)

\(FN\)

\(TP\)

1

4

1

×

×

×

2

1

0

×

×

×

3

1

0

×

×

×

4

2

1

×

×

×

5

3

0

×

×

×

Total

2

1

0

2

For \(t_2=3\):

\(i\)

\(s_i\)

\(y_i\)

\(TN\)

\(FP\)

\(FN\)

\(TP\)

1

4

1

×

×

×

2

1

0

×

×

×

3

1

0

×

×

×

4

2

1

×

×

×

5

3

0

×

×

×

Total

2

1

1

1

So let’s see if the results of our hand calculations are the same as the function’s output.

test_y =      np.array([1,0,0,1,0])
test_scores = np.array([4,1,1,2,3])
test_thresholds = np.array([2,3])

result = get_confusion_table(test_y, test_scores, test_thresholds)

pd.DataFrame(
    result[:, 1:], 
    index=pd.Index(result[:,0], name="$$T$$"),
    columns=["$$TN$$", "$$FP$$", "$$FN$$", "$$TP$$"]
).astype("int")
$$TN$$ $$FP$$ $$FN$$ $$TP$$
$$T$$
2 2 1 0 2
3 2 1 1 1

Here are automatic test cases for the cases discussed earlier and some additional ones. So you don’t have to check everything yourself if you want to select something in the implementation of the function:

import unittest

class TestCounfusionTable(unittest.TestCase):
    def test_array_sizes(self):
        '''
        If y_true and y_score have different
        sizes if should raise ValueError.
        '''
        with self.assertRaises(ValueError):
            get_confusion_table(
                y_true=[1,2,3,4], 
                y_score=[3,2,1,4,3]
            )

    def test_basic_output(self):
        '''
        This is basic case for wich
        we know actual answer.
        '''
        test_y = np.array([1,0,0,1,0])
        test_scores = np.array([4,1,1,2,3])
        test_thresholds = np.array([2,3])

        exp_ans = np.array([
            [2, 2, 1, 0, 2],
            [3, 2, 1, 1, 1]
        ])
        act_ans = get_confusion_table(
            test_y, test_scores, test_thresholds
        )
        self.assertTrue(
            (act_ans == exp_ans).all()
        )
ans = unittest.main(argv=[''], verbosity=2, exit=False)
test_array_sizes (__main__.TestCounfusionTable)
If y_true and y_score have different ... ok
test_basic_output (__main__.TestCounfusionTable)
This is basic case for wich ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.002s

OK