recall@k

recall@k#

\(recall_j@k\) gives a measure of how many of the relevant items are present in top \(k\) out of all the relevant items, where \(k\) is the number of recommendations generated for a \(j\)-th object. Or more formally:

\[recall_j@k = \frac{\sum_{i=1}^k r_{ij}}{\sum_{i=1}^n r_{ij}}\]

Where:

  • items are sorted according to their preference for \(j\)-th object for the model under consideration;

  • \(\sum_{i=1}^k r_{ij}\) - number of relevant items in first \(k\) items;

  • \(\sum_{i=1}^n r_{ij}\) - total number of relevant items for \(j\)-th object.

import numpy as np
import pandas as pd

import unittest
from IPython.display import HTML

R_frame = pd.read_parquet("example.parquet")

Consider specific#

Let’s examine a specific object to gain a clear understanding of the situation and calculate the recall at 3 (\(recall@3\)) for it. We will compare models to discern any differences. In the following cell, we have extracted a subframe for the specific object and sorted it based on the results from the models. The example has been selected to highlight the disparity in \(recall@3\) between the models:

k = 3
obj = 4

model1_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "Random scores"
    ]
].sort_values(
    "Random scores", 
    ascending=False
).set_index("item")

model2_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "KNN scores"
    ]
].sort_values(
    "KNN scores", 
    ascending=False
).set_index("item")

model1_recall = (
    model1_tab["relevant"].iloc[:k].sum()/
    model1_tab["relevant"].sum()
)
model2_recall = (
    model2_tab["relevant"].iloc[:k].sum()/
    model2_tab["relevant"].sum()
)

display(HTML(
    f"""
    <div style='display: flex;justify-content: space-around;'>
        <div>
            {model1_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model1_recall*100,2)}%
            </p>
        </div>
        <div>
            {model2_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model2_recall*100,2)}%
            </p>
        </div>
    </div>
    """
))
relevant Random scores
item
3 1 2.465325
18 1 1.985386
8 0 1.656717
25 0 1.614408
19 1 1.447166
4 0 1.383232
16 1 1.339926
2 1 1.236205
29 0 1.134973
6 0 1.022516
9 0 0.667890
24 0 0.377753
5 0 0.346233
28 0 0.332350
13 0 0.313831
7 1 0.166810
17 0 0.029310
22 1 -0.048041
15 0 -0.221793
10 1 -0.229947
20 1 -0.287629
27 1 -0.388728
23 1 -0.480787
0 1 -0.573113
12 0 -0.639963
26 0 -1.123104
11 0 -1.129551
14 1 -1.225836
1 0 -1.320448
21 0 -1.359311

recall@3 - 15.38%

relevant KNN scores
item
0 1 0.913773
14 1 0.792041
3 1 0.779723
20 1 0.737135
16 1 0.735866
8 0 0.654573
10 1 0.653200
29 0 0.648329
4 0 0.646239
27 1 0.643070
9 0 0.641759
23 1 0.631478
7 1 0.561225
2 1 0.561225
19 1 0.551094
18 1 0.548907
22 1 0.534151
25 0 0.465849
15 0 0.453531
26 0 0.440866
11 0 0.410943
1 0 0.360683
17 0 0.354840
13 0 0.354840
28 0 0.354639
6 0 0.352548
24 0 0.278986
5 0 0.250546
12 0 0.190521
21 0 0.172158

recall@3 - 23.08%

Python code#

There is a function that represents the realisation of \(recall@k\) in python.

def recall_k(relevance_array, pred_score, k):
    '''
    The calculation of recall@k is a metric that measures 
    the proportion of relevant items present within the top k 
    recommendations out of all relevant elements. It signifies 
    the ability to identify and include relevant items in the 
    initial recommendations.
    
    Parameters
    ----------
    relevance_array : numpy.array
        binary array marking observations that were relevant;
    pred_score : numpy.array
        predicted scores are expected to be 
        higher the more relevant item is.

    Returns
    ----------
    out : float
        realisation of the metric.
    '''
    if len(relevance_array)!=len(pred_score):
        raise ValueError(
            "`relevance_array` and `pred_score` must be the same size"
        )
    elif len(relevance_array) < k:
        raise ValueError(
            "k is greater than the number of observations"
        )
    
    relevant_in_k = np.sum(
        relevance_array[np.argsort(pred_score)[::-1]][:k]
    )
    relevant_total = np.sum(relevance_array)
    return relevant_in_k/relevant_total

Here is some unitests for function below:

class TestRecall(unittest.TestCase):
    def test_different_sizes(self):
        '''
        We must check that if the sizes of arrays with 
        relevance and prediction differ, an error must 
        be rased.
        '''
        with self.assertRaises(ValueError):
            recall_k(
                np.array([1, 1, 0]),
                np.array([0.3, 0.2, 0.3, 0.2]),
                1
            )

    def test_k_more_obs(self):
        '''
        K cannot be more than the number of observations 
        we are considering.
        '''
        with self.assertRaises(ValueError):
            recall_k(
                np.array([1, 1, 0, 0, 1]),
                np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
                10
            )
    
    def test_computions(self):
        '''
        Just basic test with known result
        '''
        real_ans = recall_k(
            np.array([1, 1, 0, 0, 1]),
            np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
            3
        )
        exp_ans = 2/3
        self.assertAlmostEqual(real_ans, exp_ans, delta=0.000001)
ans = unittest.main(argv=[''], verbosity=2, exit=False)
del TestRecall
test_computions (__main__.TestRecall)
Just basic test with known result ... ok
test_different_sizes (__main__.TestRecall)
We must check that if the sizes of arrays with ... ok
test_k_more_obs (__main__.TestRecall)
K cannot be more than the number of observations ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK

The following cell shows the code to calculate the recall for our example. We calculated it for each object, but then took the average.

show = R_frame.groupby("object").apply(
    lambda object: pd.Series({
        "recall for model 1" : recall_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["Random scores"].to_numpy(),
            k=4
        ),
        "recall for model2" : recall_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["KNN scores"].to_numpy(),
            k=4
        )
    }),
    include_groups=False
)
display(show)
display(show.mean().rename("mean value").to_frame().T)
recall for model 1 recall for model2
object
0 0.000000 0.307692
1 0.190476 0.095238
2 0.062500 0.250000
3 0.058824 0.235294
4 0.153846 0.307692
5 0.153846 0.230769
6 0.166667 0.222222
7 0.125000 0.250000
8 0.153846 0.153846
9 0.105263 0.210526
recall for model 1 recall for model2
mean value 0.117027 0.226328