Lesson-7: An Introduction to Recommendation Systems and Collaborative Filtering

In this lesson Jeremy takes us through the idea of recommendation systems, collaborative filtering, embeddings, and latent factors. Chapter-8 in the book covers the exact same information as this lesson.
lesson-notes
chapter-notes
code
Author

Uzair Tahamid Siam

Published

April 10, 2023

Recommendation systems are a huge topic of research and one that is particularly of interest in the industry. Let’s look at one of the most prominent algorithms for recommendation systems - Collaborative Filtering

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from fastai.collab import *
from fastai.tabular.all import *

Download and inspect data


path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user','movie','rating','timestamp'])
ratings.head()
fig, ax = plt.subplots()
ratings.rating.plot(kind='hist', ax=ax)
ax.set_xlabel('Rating');

The movie ids aren’t as useful. Let’s grab the titles.


movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                         usecols=(0,1), names=('movie','title'), header=None)
movies.head()

ratings = ratings.merge(movies)
ratings.head()

The ratings view is not very informative. What we want to see is for each user, what did they rate all the movies. To get a table like that we want to use the pd.crosstab function with rows being the user and columns being the movie and the values are the rating. One thing you will notice is there is a LOT of NaN values in the table because a lot of users have not watched most of the movies. This is something very common in most recommendation datasets and the entire goal of this exercise is to do something called matrix completion.

pd.crosstab(ratings.user, ratings.movie, values=ratings.rating, aggfunc=lambda x:x)

In order to figure out whether user 1 who liked movie 1 will also like movie 1673 that they haven’t watched, we need to somehow understand what exactly made user 1 like movie 1. Obviously if we had more metadata on the movies and the users, that would be a lot simpler - e.g. if we had the genre, year, and language of all the movies and the age, country, and occupation of the users.

However, we do not have this information. What we can do is LEARN this information from the data to create what are known as latent factors using SGD.

Learning Latents

We break this into four steps:

  • Step 1: Randomly initialize some parameters for both the users and items (in this case movies). This is much like the random initialization of weights in deep learning we have seen so far. How big do you want these parameters to be is a choice you get to make. More on that later.

  • Step 2: Calculate the dot product between the latent factors of the items and the users. This gives us a sense of similarity between the items and the users. If, for instance, the first latent user factor represents how much the user likes action movies and the first latent movie factor represents whether the movie has a lot of action or not, the product of those will be particularly high if either the user likes action movies and the movie has a lot of action in it, or the user doesn’t like action movies and the movie doesn’t have any action in it.

  • Step 3: Calculate the loss between the result of the dot product and the actual ratings given in the table. For the ratings that are missing, just replace them with 0s. For the loss function for now we will just use RMSE.

  • Step 4: Optimize the latent factors using SGD or whatever other optimization algorithms you wish.

Creating the DataLoaders

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Let’s create our random latent factors (lfs) with sizes (n_users, n_factors) and (n_items, n_factors) for users and items (movies) respectively.

n_users = len(dls.classes['user'])
n_items = len(dls.classes['title'])
n_factors = 5
user_lfs = torch.randn(n_users, n_factors)
movie_lfs = torch.randn(n_items, n_factors)

To calculate the result for a particular movie and user combination, we have to look up the index of the movie in our movie latent factor matrix, and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors. Looking up is not an operation of our deep learning models; they only multiple matrices.

This is where the idea of Embeddings come into place. Embeddings are a clever computational trick that is equivalent to looking things up in an array/matrix when the embedding matrix is multiplied with a one-hot encoded vector.

Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the compu‐ tational shortcut, index into directly) is called the embedding matrix.


one_hot_3 = one_hot(3, n_users).float()
user_lfs.t() @ one_hot_3
user_lfs[3]

This is quite a simple idea if you think about it. One-hot encoding leads to a sparse vector where only one of the items is a 1 which means that you just get 0s for everything except the row (or column) you care about where you get the identity back.

Collaborative Filtering From Scratch

We will use PyTorch to create our own collaborative filtering algorithm before using the one provided by fastai

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_lfs = Embedding(n_users, n_factors)
        self.movie_lfs = Embedding(n_movies, n_factors)
    
    def forward(self, x):
        users = self.user_lfs(x[:, 0])
        movies = self.movie_lfs(x[:, 1])
        return (users * movies).sum(dim = 1)
        

The __init__ method initializes the embeddings with the Embedding layer from PyTorch with a given shape. These layers are callables which we can see being used in the forward method. The input x is a tensor of shape batch_size x 2 where the first column is for users and the second is for movies. We do a dot product over dim=1. We could’ve used torch.tensordot as well instead of doing a sum product ourselves.

model = DotProduct(n_users, n_items, n_factors)
learn = Learner(dls, model, loss_func=MSELossFlat())
lr = learn.lr_find().valley
learn.fit_one_cycle(5, lr)

Making the model better

Using biases and restricting output range using sigmoid_range

There are two problems that are easy to solve in our model.

  • There is no way to discriminate between users who like to score everything high vs those who like to score everything low.

  • There is nothing restricting our model outputs to be between a valid range of 0 to 5.

How do we handle these?

  • Add a bias term to the latent factors for both the users and the movies that should allow us to discriminate between those types of users.

  • We have restricted outputs before between 0 to 1 using a sigmoid. We can do something similar here.

class DotProductBias(Module):
    def __init__(self, n_users, n_items, n_factors, y_range = (0, 5.5)):
        self.user_lfs = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.item_lfs = Embedding(n_items, n_factors)
        self.item_bias = Embedding(n_items, 1)
        self.y_range = y_range
    
    
    def forward(self, x):
        users = self.user_lfs(x[:, 0])
        items = self.item_lfs(x[:, 1])
        res = (users * items).sum(dim=1, keepdim=True)
        res += (self.user_bias(x[:, 0]) + self.item_bias(x[:, 1]))
        return sigmoid_range(res, *self.y_range)
    
model = DotProductBias(n_users, n_items, 5)
learn = Learner(dls, model, loss_func = MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

It seems there is a fair bit of overfitting going on here. The train loss is much lower than the validation loss. What can we do to fix that?

Using weight decay (L2 norm)

Regularization is one of the most common ways of dealing with overfitting. But what is regularization?

Regularization is a technique where we penalize the model not only based on the difference in the prediction and the label but also on the weights. There are many forms of regularization. Here we will use L2 regularization. Mathematically, this is what it looks like:

\[ L = \frac{1}{m}\sum_{m=1}^{M}(\hat{y} - y)^2 + \lambda \sum_{n=1}^{N}||w||^2 \]

The first part of our sum is just the original loss function. The second part is the regularization term. The \(\lambda\) is akin to the learning rate which scales the regularization term. It’s called the regularization hyperparameter or weight decay factor.

Looking at the second part of the sum closely we see that if the weights are too big, the loss increases and the model will try to minimize it greatly. In general, the model will try to make sure the weights aren’t too large but also, they are not too big since weights being 0 will mean our model learns nothing.

To add weight decay, we just specify the parameter wd in the Learner object.

learn = Learner(dls, model, loss_func = MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd = 0.1)

Embeddings

We have been using Embeddings that PyTorch or fastai provides us so far. But what exactly is going on in there? Let’s try to recreate our own!

To do this we need a way to create:

  • Randomized Matrices of a given shape

  • Be able to take the gradient of these random matrices

class T(Module):
    def __init__(self): self.a = torch.ones(3)

L(T().parameters())

Simply creating tensors doesn’t store the tensor as a parameter. And in PyTorch anything that is defined as a parameter is automatically tagged with requires_grad=True. To create tensors that are actually parameters we need to wrap the tensors with nn.Parameter.

class T(Module):
    def __init__(self): self.a = nn.Parameter(torch.ones(3))

L(T().parameters())

Now that that’s fixed. Let’s just create a function that takes as input the sizes and create randomized tensors sampled from a normal distribution!

def create_parameters(shape):
    return nn.Parameter(torch.zeros(*shape).normal_(0, 0.01))

class DotProductBias(Module):
    def __init__(self, n_users, n_items, n_factors, y_range = (0, 5.5)):
        self.user_lfs = create_parameters([n_users, n_factors])
        self.user_bias = create_parameters([n_users, 1])
        self.item_lfs = create_parameters([n_items, n_factors])
        self.item_bias = create_parameters([n_items, 1])
        self.y_range = y_range
    
    
    def forward(self, x):
        users = self.user_lfs[x[:, 0]]
        items = self.item_lfs[x[:, 1]]
        res = (users * items).sum(dim=1, keepdim=True)
        res += (self.user_bias[x[:, 0]] + self.item_bias[x[:, 1]])
        return sigmoid_range(res, *self.y_range)
    
model = DotProductBias(n_users, n_items, 5)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

Interpreting Embeddings and Biases

Our model is already useful, in that it can provide us with movie recommendations for our users. But it is also interesting to see what parameters it has discovered.

Let’s look at:

  • the biases

  • latent space the model has learned using a dimensionality reduction technique called Principal Component Analysis (PCA) so we can reduce our n_factors dimensional space into say a 3 dimensional space we can plot!

Model Bias

learn.model.item_bias
movie_bias = learn.model.item_bias.squeeze()
indx = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in indx]

Let’s take a moment to understand what the bias actually means in this context.

The way to interpret the biases is as follows:

Even among those who typically enjoy movies like “Children of the Corn: The Gathering (1996)” received a low rating. For instance, a user (let’s call them “u”) who typically rates movies with comparable characteristics (defined by their embeddings) to “Children of the Corn: The Gathering (1996)” would still give it a low rating.

movie_bias = learn.model.item_bias.squeeze()
indx = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in indx]

Conversely,

Even among those who typically DO NOT enjoy movies similar to “Titanic (1997)” received a high rating.

Embedding or Latent Space representation

Interpreting the bias was easy because it is just merely a vector. But our embeddings are matrices of many dimensions. We can’t really interpret them the way they are. The best we can do is somehow plot them to understand their relative meanings. This is where dimensionality reduction is useful. We will be using an algorithm called PCA to reduce the dimensions of our embedding space and projecting it onto a lower dimensional space. The underlying math of PCA uses the technique in mathematics called Singular Value Decomposition (SVD). To get an intuition for this technique take a look at this StatQuest video.

g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.item_lfs[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

We can see here that the model seems to have discovered a concept of classic versus pop culture movies, or perhaps it is critically acclaimed that is represented here.

Using fastai.collab

Now that we have created our own versions, we should feel good about using libraries. Let’s look at the collab_learner from fastai that does almost the same things (plus a few more)

learn = collab_learner(dls, n_factors = 50, y_range = (0, 5.5))
lr_valley = learn.lr_find().valley;
learn.fit_one_cycle(5, lr_valley, wd=0.15)

Let’s inspect the underlying model used by the collab_learner

learn.model

Let’s inspect the item bias (i_bias) weights.

movie_bias = learn.model.i_bias.weight.squeeze()
sorted_indx = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in sorted_indx]

Finding similar movies

We can look at embedding distances as well. So, given a movie, how similar is it to the rest of the movies? What movie in the dataset is the most similar to it?

movie_factors = learn.model.i_weight.weight
chosen_movie_indx = dls.classes['title'].o2i['Titanic (1997)']

# get distance from all movies to chosen movie 
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[chosen_movie_indx][None])

# sort them
similar_movie_indxs = distances.argsort(descending=True)[1:]


most_similar = dls.classes['title'][similar_movie_indxs[0]]
most_similar

Bootstrapping a Collaborative Filtering Model

Bootstrapping is the biggest problem to any sort of recommendation system. It refers to the idea of what to suggest to someone who we have absolutely no data on e.g. a brand new user. In fact, if you think about it, even for humans this is hard. How do you suggest a friend a movie when they haven’t watched any movies before? How do you buy your coworker a gift if they’ve never told you what kind of things they have used before?

There is no magic solution to this problem, and really the solutions are based on the domain, the usual practices in the company/domain, and some critical thinking. At the end of the day it all comes down to the idea of being careful about the machine learning systems we deploy and ensuring that we deploy them gradually with human supervision.

Some ways to handle this problem are:

  • In the case that this isn’t your first ever user, you can maybe use the mean of all the other users to impute values into the row of that new user. Of course, as usual, means are very susceptible to outliers. This means that if there is overrepresentation of some enthusiastic people who watch only certain types of movies and none other, the system will not work well for that new user and possibly might even break for others. Additionally, there’s also the problem that certain combinations of embeddings might lead to very uncommon things like a romantic movie with a lot of horror (for instance, the average for the romance and horror factors are both very high while everything is much lower).

  • Use a tabular model based on user metadata to construct your initial embedding vector. When a user signs up, think about what questions you could ask to help you understand their tastes - this is what companies like Netflix do if you haven’t noticed. Then you can create a model in which the dependent variable is a user’s embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup metadata.

Recommendation systems are also very susceptible to positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias at an exponential rate.

Deep Learning for Collaborative Filtering

So far, we have looked at Recommendation Systems from a non-neural network point-of-view. We have not used any neural networks at all so far. What we were doing is formally given the name probabilistic matrix factorization (PMF).

As we know, deep learning methods love when they get large matrices that they can quickly perform matrix multiplications on using GPUs. We can do Collaborative Filtering using Deep Learning by representing our latent embeddings using large matrices and passing them through linear layers.

However, the first important thing to think about is what do we want our embedding sizes to be? We have used n_factors=5 and n_factors=50 but is there a way to find out the “best possible size?” Well, we would need to do a lot of experiments for that. But the next best thing we can do is probably use the results of other people’s experiments; more specifically fastai’s experiments. fastai has a function get_emb_sz() that takes in a DataLoaders object and returns some recommended sizes. Let’s use that and build our CollabNN() model.

embs = get_emb_sz(dls)
embs
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0, 5.5), n_hidden = 100):
        self.user_lfs = Embedding(*user_sz)
        self.item_lfs = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1] + item_sz[1],n_hidden),
            nn.ReLU(),
            nn.Linear(n_hidden, 1)
        )
        self.y_range = y_range
    
    def forward(self, x):
        embeddings = torch.cat([self.user_lfs(x[:, 0]), self.item_lfs(x[:, 1])], dim=1)
        x = self.layers(embeddings)
        return sigmoid_range(x, *self.y_range)
        

model = CollabNN(*embs)
learn = Learner(dls, model, loss_func = MSELossFlat())
lr_valley = learn.lr_find().valley
learn.fit_one_cycle(5, lr_valley, wd=0.01)

Using fastai’s collab_learner for Collab Filtering with Deep Learning

All we have to do is set use_nn to True and specify the layers parameter. Below we train a network with 2 hidden layers!

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

The Neural Network did not seem to have improved the performance. In practice, most companies will use a hydrid model of both the DotProduct model as well as the CollabNN model. This way they can get the best of both worlds. In fact, if these companies have more metadata (which they usually do) on the users and the movies, using the CollabNN tends to perform much better.

Conclusion

This was my first time ever learning about a lot of these ideas. And for some like Embedding a whole new understanding of it. Next up are some chapters in the book that are not directly covered in the lessons. Two of these chapters are dedicated to Computer Vision tasks other than binary classification like mulilabel classification or image segmentation. The other two are mostly about best practices to train state of the art models and using the mid-level API, DataBlock API effectively.