# Probabilistic models for Recommender systems (part I): Probabilistic Matrix Factorization

In Recommender Systems design we are faced with the following problem: given incomplete information about users preference, content information, user-items rating and  contextual information, learn the user preference and suggest new items for users based on features as:

• previous pattern of items preference of the user;
• preference of users with similar rating pattern;
• contextual information: location, time (of the day, week, month, year), social network.

This is usually formulated as a matrix completion problem using Matrix Factorization techniques to offer a optimal solution. Is this case latent features for users and item are inferred from the observed/rated items for each user, and from this latent features the missing entries are estimated. One major modelling tool for this problem is probabilistic modelling, and there are many proposals in the literature of different Probabillistic Matrix Factorization approaches. We will briefly discuss some of this models, starting with the seminal paper: Probabilistic Matrix Factorization (PMF) – [Salakhutdinov and Mnih, 2008, NIPS].

In this model for collaborative filtering, each item of the user-item rating matrix is a result of a linear combination of items and users latent attributes perturbed by a Gaussian white noise. The latent features are modeled as random vectors generated by a zero-mean isotropic multivariate Gaussian. We will use this model and try to show one of its key properties: calculating its MAP (mode of the posterior) estimator is equivalent to minimizing the Frobenius norm (sum of square roots) of the error matrix (difference between estimated and real values), plus L1 terms of the latent matrices.

Keeping the notation of the figure, we have N user and M items. Each user i and each item j is represented in a latent space as a D-dimensional vector $U_i$ and $V_j$. (note that $I_k$ is the identity matrix with $k \times k$ elements).  The generative model with three parameters $(\sigma, \sigma_U, \sigma_V)$ is:

1. for $i \in [N]$ generate i.i.d. samples $U_i \sim \mathcal{N}(0,\sigma_U \text{I}_D)$
1. for $k \in [D]$, $U_{ik} \sim \mathcal{N}(0,\sigma_U)$
2. for $j \in [M]$ generate i.i.d. samples $V_j \sim \mathcal{N}(0,\sigma_V \text{I}_D)$
1. for $k \in [D]$, $V_{jk} \sim \mathcal{N}(0,\sigma_V)$
3. for $i \in [N]$ and $j \in [M]$ generate i.i.d. samples $R_{ij} \sim \mathcal{N}(U_i^TV_j,\sigma)$ (a variation would be $R_{ij} \sim \mathcal{N}(g(U_i^TV_j),\sigma)$, where the function $g$ is applied to bound the mean in the same scale of the observed rating/values)

With the full description of the model we can calculate the probability of generating a specific matrix $R$, with latent features, given the parameters of the model.

$p(R,U,V|\sigma, \sigma_U, \sigma_V) = p(R|U,V,\sigma)p(U|\sigma_U)p(V|\sigma_V)$
$p(R,U,V|\sigma, \sigma_U, \sigma_V) = \prod_{i \in [N]} \prod_{j \in [M]} \mathcal{N}(R_{i,j} | U_i^TV_j,\sigma) \prod_{i \in [N]} \mathcal{N}(U_i | 0,\sigma_U\text{I}_D) \prod_{j \in [M]} \mathcal{N}(V_j | 0,\sigma_V\text{I}_D)$

Given that some elements of the ratings matrix are not observed we will use an auxiliary variable $I_{ij}$ equaling to 1 whenever the value $R_{ij}$ is observed and 0 otherwise. Now the model probability conditioned on the parameters and latent features can be written as
$p(R|U,V,\sigma, \sigma_U, \sigma_V) = \prod_{i \in [N]} \prod_{j \in [M]} \mathcal{N}(R_{i,j} | U_i^TV_j,\sigma)^{I_{ij}}$

The posterior of the U an V latent features can be expressed as
$p(U,V|R,\sigma, \sigma_U, \sigma_V) = \frac{p(R,U,V|\sigma, \sigma_U, \sigma_V)}{p(R|\sigma, \sigma_U, \sigma_V)}$
Taking the log we have
$\text{ln } p(U,V|R,\sigma, \sigma_U, \sigma_V) = \text{ln } p(R,U,V|\sigma, \sigma_U, \sigma_V) - \text{ln } p(R|\sigma, \sigma_U, \sigma_V)$

Since the log does not affect the computation of the mode of the posterior, maximizing the log posterior is equivalent to maximizing the posterior. Since we are maximizing with respect to the matrices U and V, the term $p(R|\sigma, \sigma_U, \sigma_V)$ does not affect the optimization and should be ignored. So,
$\text{argmax}_{U,V}\text{ } p(U,V|R,\sigma, \sigma_U, \sigma_V) =\text{argmax}_{U,V}\text{ } \text{ln } p(R,U,V|\sigma, \sigma_U, \sigma_V)$

Now we will describe the term in the $\text{ln } p(R,U,V|\sigma, \sigma_U, \sigma_V)$ and show how this optimization can be done using gradient descent.

$\text{ln }\mathcal{N}(V_j | 0,\sigma_V\text{I}_D) = -(1/2)[\text{ln }(2\pi)+D\text{ln }\sigma_V+\left \| V_j \right \|^2/\sigma_V ]$
$\text{ln }\mathcal{N}(U_i | 0,\sigma_U\text{I}_D) = -(1/2)[\text{ln }(2\pi)+D\text{ln }\sigma_U+\left \| U_i \right \|^2/\sigma_U ]$
$\text{ln }\mathcal{N}(R_{ij} | U_i^TV_j,\sigma)^{I_{ij}} = -I_{ij}(1/2)[\text{ln }(2\pi)+\text{ln }\sigma+(R_{ij}-U_i^TV_j)^2/\sigma ]$

Ignoring the constants but keeping the parameters, and knowing that the log of multiplication is the sum of log
$\text{ln } p(R,U,V|\sigma, \sigma_U, \sigma_V) = \sum_{i \in [N], j \in [M]} \text{ln }\mathcal{N}(R_{ij} | U_i^TV_j,\sigma)^{I_{ij}}+\sum_{i \in [N]}\text{ln }\mathcal{N}(U_i | 0,\sigma_U\text{I}_D)+\sum_{j \in [M]}\text{ln }\mathcal{N}(V_j | 0,\sigma_V\text{I}_D)$

$\Rightarrow$ (Eq 1) $-2\text{ln } p(R,U,V|\sigma, \sigma_U, \sigma_V) = \frac{1}{\sigma}\sum_{i \in [N], j \in [M]} I_{ij}(R_{ij}-U_i^TV_j)^2+\frac{1}{\sigma_V}\sum_{j \in [M]}\left \| V_j \right \|^2+\frac{1}{\sigma_U}\sum_{i \in [N]}\left \| U_i \right \|^2+ \text{ln }\sigma \sum_{i \in [N], j \in [M]} I_{ij} + ND\text{ln }\sigma_V+MD\text{ln }\sigma_U$

$\Rightarrow$ (Eq 2) $L(U,V) =(1/2)( \sum_{i \in [N], j \in [M]} I_{ij}(R_{ij}-U_i^TV_j)^2+\frac{\sigma}{\sigma_V}\sum_{j \in [M]}\left \| V_j \right \|^2+\frac{\sigma}{\sigma_U}\sum_{i \in [N]}\left \| U_i \right \|^2 )$

So finding U,V that maximizes the posterior distribution of U,V given the data and parameters is equivalent to finding U,v minimizing Equation 1 (which can be simplified to Equation 2).

The Gradient Descent algorithm consists of updates in the direction that minimizes $L(U,V)$ until convergence.

$U_i^* \leftarrow U_i - \lambda_U\frac{\partial L(U,V)}{\partial U_i}$
$V_j^* \leftarrow V_j - \lambda_V\frac{\partial L(U,V)}{\partial V_j}$
with
$\frac{\partial L(U,V)}{\partial U_i} = \sum_{j \in [M]} I_{ij}(R_{ij}-U_i^TV_j)V_j+\frac{\sigma}{\sigma_U}U_i$
$\frac{\partial L(U,V)}{\partial V_j} = \sum_{i \in [N]} I_{ij}(R_{ij}-U_i^TV_j)U_i+\frac{\sigma}{\sigma_V}V_j$

With Gradient Descent each update costs rougly O(D(N+M)), which is not feasibly for large scale applications. In this case Stochastic Gradient Descent may be an alternative. The difference is that in GD we weight the the gradient at each observed value for each update, and SGD we evaluate the derivative at only one observed value for each update.