Collaborative filtering

Collaborative filtering#

It’s type of models that requires only matrix \(R\).

import numpy as np
import pandas as pd

Memory based#

User based#

Aka user-user filtering.

This method is based on approach, that if users ranked their common items close it means that it have sence to recommend them same items in future:

	\(i_1\)	\(i_2\)	…	\(i_n\)
\(u_1\)	\(r_{11}\)	\(r_{12}\)	…	\(r_{1n}\)
\(u_2\)	\(r_{21}\)	\(r_{22}\)	…	\(r_{2n}\)
…	…	…	…	…
\(u_m\)	\(r_{m1}\)	\(r_{m2}\)	…	\(r_{mn}\)

Consider a simple example:

	\(i_{1}\)	\(i_{2}\)	\(i_{3}\)	\(i_{4}\)	\(i_{5}\)	\(i_{6}\)	\(i_{7}\)	\(i_{8}\)	\(i_{9}\)	\(i_{10}\)
\(u_{1}\)	10	5	1	2	10	1	2	9	10	1
\(u_{2}\)	10	4	2	1	9	1	3	8	10	2
\(u_{3}\)	1	4	8	9	1	10	8	1	1	9

Without any special techniques, you can see that user 1 and user 2 have very similar preferences. But user 3 has completely different preferences.

In the following cell, the \(R\) matrix under consideration is defined as a numpy array.

arr = np.array([
    [10,  5,  1,  2, 10,  1,  2,  9, 10,  1],
    [10,  4,  2,  1,  9,  1,  3,  8, 10,  2],
    [ 1,  4,  8,  9,  1, 10,  8,  1,  1,  9]
])

So let’s try to formalise their similarity somehow - in this case we’ll use Pirson’s correlation coefficient:

corrmatrix = np.corrcoef(arr)
print("Similarity between first and second user:", corrmatrix[0,1])
print("Similarity between second and third user:", corrmatrix[1,2])

Similarity between first and second user: 0.9802681400094546
Similarity between second and third user: -0.9650036744440637

So we proved that the first client is close to the second, but the third is very different. So according to our approach, we are more likely to recommend to the second user the games that the first one liked and vice versa.

Item based#

Aka item-item filtering.

Item based approach technically same as user based. Here we are looking for items that are similar in \(r_{ij}, i=\overline{1,n}\) by the same users.

So the \(R\) matrix is just transposed compared to the user based approach:

	\(u_1\)	\(u_2\)	…	\(u_m\)
\(i_1\)	\(r_{11}\)	\(r_{12}\)	…	\(r_{1m}\)
\(i_2\)	\(r_{21}\)	\(r_{22}\)	…	\(r_{2m}\)
…	…	…	…	…
\(i_n\)	\(r_{m1}\)	\(r_{m2}\)	…	\(r_{mn}\)

Note here, in contrast to the definition given above, the order of indexing of the elements of the matrix \(R\) is changed (\(i\) and \(j\) are swapped). That is, here \(r_{ji}\).

Generalised algorithm#

Here we take a general look at how collaborative filtering works. It will have identical operations in the user and item based cases, so we will only consider the abstract matrix \(R\). So we have target value \(r_{ij}\) for abstract combination of row \(i\) and column \(j\). For the sake of simplicity, I’ll sometimes refer to lines as objects.

Estimate similarities For each pair of rows, let’s say rows \(k,t,t\neq k\), we compute some metric that estimates their similarities \(Sim(k,t)\), for example the Pearson correlation coefficient.

An important nuance is that for some \(r_{ij}\) can be undefined, which means that the combination \(i,j\) did not occur. So let’s denote \(I_k\) and \(I_t\) as sets of defined relevancies for the \(k\)th and \(t\)th rows, respectively. So set of columns that have defined ratings both for \(k\)th and \(t\)th rows can be denoted as \(J=I_k \cap I_t\).

\(Sim(k,t),t\neq k\) - similarity measure of the \(k\)-th and \(t\)-th rows. The simplest and perhaps most popular - Pearson’s correlation coefficient:

\[Sim(k,t) = \frac{\sum_{j\in J}(r_{kj}-\overline{r_{k}})(r_{tj}-\overline{r_{t}})} {\sqrt{\sum_{j\in J}(r_{kj}-\overline{r_{k}})^2}\sqrt{\sum_{j \in J}(r_{tj}-\overline{r_{t}})^2}} \]

Estimate rating. Once we have computed similarities between all rows, we can make a prediction about the relevance of \(j\)-th column to \(k'\) row. Usually it’s a function that uses similarities between \(k'\) and all other rows that have rating for \(j\)-th column.

Really popular option is:

\[\hat{r}_{k'j}=\overline{r_k'} + \frac{\sum_{t\in P_k';(j)} Sim(k',t)(r_{k'j}-\overline{r_k'})}{\sum_{t\in P_k'(j)} |Sim(k',t)|}\]

Where:

\(P_k'(j)\) - the set of closest rows to target object \(k'\), who have speciﬁed ratings for item \(j\). Can be defined anyhow;
\(\hat{r}_{k'j}\) - estimation of the relevance of \(j\)-th column to the \(k'\)-th row.

Check more description about this function and other options in prediction functions page.

Finally for \(k'\)-row we select those \(j\) that have the best \(\hat{r}_{k'j}\), so final answer is \(j' = argmax_{j}\left[\hat{r}_{k'j}\right]\)

Model based#

Model-based methods work by building models that attempt to predict a rating for a user-item pair by using ratings as features.