Recommender systems: matrix factorization

Minimal matrix factorization

When trying to make a recommender system, one popular (and currently state of the art as of September 2019) approach is to approximate the rating that a user would give to an unrated item using the ratings of the other users. To do so, one can use a low rank approximation of the (mostly empty) rating matrix $R$ where $R_{ui}$ is the rating given by the user $u$ to the item $i$.

This method is used by big names in the recommendation game like Netflix, Deezer, Spotify…

The main modeling idea of matrix factorization is to view the users and items as low dimensional vectors such that the dot product between a user’s vector and an item’s vector gives an approximation of the rating the user would give to that item.

This way, the rating matrix is predicted as the product of two smaller matrices, hence the term matrix factorization.

Mathematically, this translates to :

$$ W \odot (R - U^\top I) \approx 0 $$

where:

  • $U$ and $I$ are the users and items embedding matrices;
  • $W$ is the matrix such that $W_{ui} = 1$ if user $u$ rated item $i$, $0$ otherwise. In plain english that means we ignore (for now) the ratings we don’t know in our optimization algorithm.

As $U$ and $I$ are low dimensional, too small to learn the ratings individually, constraining them on the known ratings will intuitively force the embeddings to encode general information, thus leading to hopefully coherent ratings on the unknown (user, item) pairs.

We can derive an objective function from this equation and use it to tune $U$ and $I$ with a gradient descent approach.

Finding the best $U,I$ is equivalent to finding the $U$ and $I$ that minimize

$$ \lVert W \odot (R - U^\top I)\rVert^2 + \lambda(\lVert U\rVert^2 + \lVert I\rVert^2) $$

where $\lambda(\lVert U\rVert^2 + \lVert I\rVert^2)$ is a regularization term that will prevent $U$ and $I$ from diverging.

In practice

Although not always ideal, an easy way of exploring data and interactively doing stuff on it is through a Jupyter notebook.

You can either use the very nice deepo docker image to setup a local notebook server with all the machine learning tools pre-installed or just use google colab.

I’ll use Pytorch to illustrate the maths with some code.

from torch import nn

class MinimalMatrixFactorization(nn.Module):
    def __init__(self, n_users, n_items, n_features=20):
        super().__init__()
        self.user_features = nn.Embedding(n_users, n_features)
        self.item_features = nn.Embedding(n_items, n_features)
        self.user_features.weight.data.uniform_(-0.01, 0.01)
        self.item_features.weight.data.uniform_(-0.01, 0.01)

    def forward(self, user, item):
        return (self.user_features(user) * self.item_features(item)).sum(1)

def train_model(model, dataloader, criterion, optimizer, n_epochs=15):
    loss_hist = []
    for _ in tqdm(range(n_epochs)):
        curr_loss = 0
        for user_ids, item_ids, ratings in dataloader:
            optimizer.zero_grad()
            
            # Predict and calculate loss
            prediction = model(user_ids, item_ids)
            loss = criterion(prediction, ratings)
            
            curr_loss += loss.item()
                
            # Backpropagate
            loss.backward()
            # Update the parameters
            optimizer.step()

        curr_loss /= len(dataloader)
        loss_hist.append(curr_loss)
        print(f"RMSE: {np.sqrt(curr_loss):.10} \t MSE: {curr_loss:.10}")
    
    plt.plot(loss_hist)

And there you have the simplest form of matrix factorization.

Making recommendations

The process of making recommendations is then simply a similarty measure between users and items. The favored metric for that is the cosine similarity defined as follow:

$$ \text{coss}(u, i) = \frac{\langle u, i \rangle}{\lVert u \rVert \times \lVert i \rVert} $$

it measures the cosine of the angle $\theta$ between the two vectors $u$ and $i$ because by definition $\langle u, i \rangle = \lVert u\rVert \times \lVert i\rVert \cos(\theta)$.

Now we can compute the similarity between a user and the items, ranking them accordingly.

coss = nn.CosineSimilarity(dim=-1, eps=1e-6)
u = model.user_features(torch.as_tensor(42)) # retreive the embeddings of user 42
items = model.item_features.weight

similarities = coss(u, items)
values, indices = similarities.topk(14)

for rank, (v, i) in enumerate(zip(values, indices), 1):
    print(f"{rank}. id: {i.item()}, similarity: {v.item()})

Extending the model

Adding bias

Some user tend to like everything and some items tend to be liked by a majority of users. Introducing user and item biases will help our model learn things that matter.

Let the user/item bias $B_{ui} = \mu + \frac{b_u + \tilde{b}_i}{2}$ where $\mu$ is a constant, $b_u$ a scalar that depends on the user and $\tilde{b}_i$ one that depends on the item.

Intuitively, $\mu$ could be the global average rating and $b_u$ and $\tilde{b}_i$ the specific deviation of the item and the user, i.e. the difference between the global average rating and their own average rating.

$$ \begin{aligned} \mu &= \frac{1}{n}\sum_{u, i \text{ } \in \text{ training set}}{R_{ui}} \\ b_u &= \frac{1}{n_u}\sum_{i \text{ rated by }u}{R_{ui}} - \mu \\ \tilde{b}_i &= \frac{1}{n_i}\sum_{u \text{ that rated }i}{R_{ui}} -\mu \end{aligned} $$

Now we can just add bias matrix to our model equation:

$$ \|W \odot (U^\top I + B - R)\|^2 + \lambda(\|U\|^2 + \|I\|^2) $$

Those intuitive bias proved to be the best in practice, outperforming learned bias by a wide margin.

We can initialize the bias before the training process like so:

class BiasMatrixFactorization(MinimalMatrixFactorization):
    def __init__(self, n_users, n_items, n_features=20):
        super().__init__(n_users, n_items, n_features)
    
    def init_bias(self, train_df):
        self.user_bias = torch.zeros(len(self.user_features.weight))
        self.item_bias = torch.zeros(len(self.item_features.weight))

        # The average known rating
        self.mu = train_df['rating'].mean()

        # The user deviation from the average known rating
        ubias = train_df.groupby(train_df['user_id'])['rating'].mean() - self.mu
        self.user_bias[ubias.index] = torch.as_tensor(ubias.values, dtype=torch.float32) / 2

        # The item deviation from the average known rating
        ibias = train_df.groupby(train_df['item_id'])['rating'].mean() - self.mu
        self.item_bias[ibias.index] = torch.as_tensor(ibias.values, =torch.float32) / 2

    def forward(self, user, item):
        return (self.user_features(user) * self.item_features(item)).sum(1) + \
         self.user_bias[user].squeeze() + self.item_bias[item].squeeze() + self.mu

Folding, about spurious recommendations

“Good” models, i.e. with good performance according to the usual metrics (e.g. RMSE), may create unintentional overlap of disparate groups of users and items. This behaviour is called “folding” and can lead to spurious recommendations.
Furthermore, a single bad suggestion can outweigh many good ones (e.g. horror movies in Disney). Models are especially subject to folding during cold start phase.

To hinder this tendency, one can split the rating matrix in multiple sub-matrices, sperating user according to some metadata (e.g. the country they are from, their age…) but this results in a net loss of usefull information. Other leads are using specialized training approaches such as Bayesian Personalized Ranking or weighting the unknown.

The latter proved successful on the Netflix Prize. $W$ is set to the matrix such that $W_{ui} = 1$ if user $u$ rated item $i$ and $\alpha$ otherwise, where $\alpha$ is the roughly the order of sparsity of $R$.
Doing so shares evenly the total weight between observed and unobserved ratings, which will make the model less likely to mix unrelated groups.

Embedding meta-data

Knowing where the countries, the age groups or the genders are relatively from each other could be usefull to make group-specific recommendations, or to make more accurate recommendations during the cold start phase, when the user haven’t rated anything yet.

To embed these groups in the same space as the users and the items, we may:

  • average the embedding of the group’s users / items;
  • learn the group’s embedding and use it to compute the user’s ratings.

The latter is a recent invention (to our knowledge) of a research group I was in. We used an approach inspired from Natural Language Processing and added a shared “group vector” to the user or the item.

$$ \|W \odot ((U + G)^\top I + B - R)\|^2 + \lambda(\|U\|^2 + \|I\|^2 + \|G\|^2) $$

where $G$’s columns are a reference to the unique group vector corresponding to the user’s one. This approach warrants further research.

Other approaches

In addition to being among the simplest and -in my humble opinion- the most elegant methods, matrix factorization remains one of the most efficient collaborative filtering algorithm; but not the only one.

Neural matrix factorization

In its simplest form, NMF replaces the dot product between the users and items by a function learned with a multilayer perceptron.

$$ \|W \odot (f(U, I) + B - R)\|^2 + \lambda(\|U\|^2 + \|I\|^2) $$

We can either let the model learn everything at the same time or first fit the embeddings using calssical matrix factorization and then train the neural network after that.

Auto-encoders

Auto-encoders recently gathered quite a lot of attention. My findings are that basic auto-encoders are less intuitive and less practical to train than MF algorithms, and while the encoded embeddings did make vectorial sense in my experimentations, the results were worse than MF with a strong tendency to overfit.
They take the user represented as the vector of all of its ratings as input which does not scale well. The item are not naturaly embeddable. A trick can be to input a user with the maximum rating on a specific item and no other rating. This is works better than averaging the embeddings of the users that rated it (weighted by the rating) as it would pack all the items in the same region of the embedding space.

Training strategies

Auto-encoders trained for recommendation must learn to generalize and not limit themselves to reconstructing the (sparse) input vector. To do so, one can randomly mask some entries of the input vector and still penalize the model for getting these masked ratings wrong. This proved to be a key factor in the training of auto-encoders.

Another way to improve the training of auto-encoders for recommendation is to use dense re-feeding, a process in which the prediction (a dense vector) is re-fed to the neural network afer a first gradient update. Indeed, in an idealized scenario $\hat{y} = f(x)$ should be a fixed point of the auto-encoder.

Using contractive or denoising auto-encoders could lead to a better latent space quality.

Even more recently, variationnal auto-encoders were on the more promising side. Indeed, approximating distributions rather than data points makes them more robust than traditional auto-encoders. They also offer the possiblility to disentangle the latent space which could be interesting for recommendation purposes.