## WSDM Doctoral Consortium 2017

On February I visited Cambridge to attend WSDM Doctoral Consortium. It happened during the first day of the conference, in parallel to some tutorials. It was a great time, we had excellent discussions about our projects with senior researchers and fellow Ph.D. candidates. Here is a photo for the posterity.

And the program: http://www.wsdm-conference.org/2017/doctoral-consortium/ ## A lower bound for expected value of log-sum

Lately, I have been working with Poisson Matrix Factorization models and
at some point a needed to work a lower bound for $\text{E}_q[\log \sum_k X_k]$. After seeing some people using this lower bound without a good explanation, I decided to write this blog post. Also, this is included as an appendix to my ECML-PKDD 2017 paper about poisson factorizatiom model for recommendation.
The function $\log(.)$ is a concave function, which means that: $\log(p_1 x_1+p_2 x_2) \geq p_1\log x_1+p_2 \log x_2, \forall p_1,p_2:p_1+p_2=1, p_1,p_2 \geq 0$
By induction this property can be generalized to any convex combination of $x_k$ ( $\sum_k p_k x_k$ with $\sum_k p_k=1$ and $p_k \geq 0$ ): $\log \sum_k p_k x_k \geq \sum_k p_k\log x_k$

Now with the a random variable we can create a similar convex combination by multiplying and dividing each random variable $X_k$ by $p_k$ and apply the sum of of expectation property: $\text{E}_q[\log \sum_k X_k] = \text{E}_q[\log \sum_k \frac{p_k X_k}{p_k}]$ $\log \sum_k p_k\frac{X_k}{p_k} \geq \sum_k p_k\log \frac{X_k}{p_k}$ $\Rightarrow\text{E}_q [\log \sum_k p_k\frac{X_k}{p_k}] \geq \sum_k p_k \text{E}_q[\log \frac{X_k}{p_k}]$ $\Rightarrow \text{E}_q [\log \sum_k X_k ] \geq \sum_k p_k \text{E}_q[\log X_k]- p_k\log p_k$

If we want a tight lower bound we should use Lagrange multipliers to choose the set of $p_k$ that maximize the lower-bound given that they should sum to 1. $L(p_1,\ldots,p_K) = \left(\sum_k p_k \text{E}_q[\log X_k]- p_k\log p_k\right)+\lambda \left(1-\sum_k p_k\right)$ $\frac{\partial L}{\partial p_k} =\text{E}_q[\log X_k]-\log p_k-1-\lambda = 0$ $\frac{\partial L}{\partial \lambda} =1-\sum_k p_k = 0$ $\Rightarrow \sum_k p_k = 1$ $\Rightarrow\text{E}_q[\log X_k]=\log p_k+1+\lambda$ $\Rightarrow\text{E}_q[\log X_k]=\log p_k+1+\lambda$ $\Rightarrow \exp\text{E}_q[\log X_k]=p_k \exp(1+\lambda)$ $\Rightarrow \sum_k \exp\text{E}_q[\log X_k]=\exp(1+\lambda)\underbrace{\sum_k p_k}_{=1}$ $\Rightarrow p_k=\frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_k \exp \{\text{E}_q[\log X_k]\}}$

The final formula for $p_k$ is exactly the same that we can find for the parameters of the the Multinomial distribution of the auxiliary variables in a Poisson model with rate parameter as sum of Gamma distributed latent variables. Also using this optimal $p_k$ we can show a tight bound without the auxiliary variables. $\text{E}_q [\log \sum_k X_k ] \geq \sum_k \frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_j \exp \{\text{E}_q[\log X_j]\}}\text{E}_q[\log X_k]- \frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_j \exp \{\text{E}_q[\log X_j]\}}\log \frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_j \exp \{\text{E}_q[\log X_j]\}}$ $= \sum_k \frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_j \exp \{\text{E}_q[\log X_j]\}} \log \sum_j \exp \{\text{E}_q[\log X_j]\}$ $= \log \sum_j \exp \{\text{E}_q[\log X_j]\} \underbrace{ \sum_k \frac{\exp \{\text{E}_q[\log X_k]\}}{\sum_j \exp \{\text{E}_q[\log X_j]\}} }_{=1}$
This results in: $\text{E}_q [\log \sum_k X_k ] \geq \log \sum_k \exp \{\text{E}_q[\log X_k]\}$

## Paper accepted at European Conference on Machine Learning (ECML-PKDD) 2017

We have a paper accepted at ECML-PKDD 2017: “Content-Based Social Recommendation with Poisson Matrix Factorization” (Eliezer de Souza da Silva, Helge Langseth and Heri Ramampiaro). This is our first full paper resulting from our research on Poisson factorization and integration of multiple sources of information in a single recommendation model. If you have interest on the paper please email me and I will be happy to discuss.

Also, I am uploading the supplement of the paper here (you can find it also on my publications page)

Supplementary material for: “Content-Based Social
Recommendation with Poisson Matrix Factorization”

## Hidden Markov Models (part II): forward-backward algorithm for marginal conditional probability of the states

(in the same series HMM (part I): recurrence equations for filtering and prediction)

Consider a Hidden Markov Model (HMM) with hidden states $x_t$ (for $t \in {1, 2, \cdots, T}$), initial probability $p(x_1)$, observed states $y_t$, transition probability $p(x_t|x_{t-1})$ and observation model $p(y_t|x_t)$. This model can be factorized as $p(x_{1:T},y_{1:T}) = p(y_1|x_1)p(x_1)\prod_{t=2}^{t=T}p(y_t|x_t)p(x_t|x_{t-1})$. We will use the notation $X=x_{1:T}$ to represent the set $X=\{x_1,x_2,\cdots,x_T\}$.
In this post we will present the details of the method to find the smoothing distribution $p(x_t|y_{1:T})$ of a HMM, given a set of observations $y_{1:T}$:
Our starting point is the marginal probability $p(x_t|y_{1:T})$ of $x_t$ given all the observations $y_{1:T}$. \begin{aligned} p(x_t|y_{1:T}) &= \frac{p(x_t,y_{1:T})}{p(y_{1:T})} \\ &= \frac{p(x_t,y_{1:t},y_{(t+1):T})}{p(y_{1:T})}\\ &= \underbrace{p(y_{(t+1):T}|x_t)}_{\beta_t(x_t)}\underbrace{p(x_t,y_{1:t})}_{\alpha_t(x_t)}\frac{1}{p(y_{1:T})} \\ &= \frac{\alpha_t(x_t) \beta_t(x_t)}{p(y_{1:T})} \end{aligned}

## Sampling from Dirichlet Distribution using Gamma distributed samples

There is an algorithm to generate Dirichlet samples using a sampler for Gamma distribution for any $\alpha > 0$ and $\beta > 0$. We will generate Gamma distributed variables $z_k \sim \text{gamma}(\alpha_k,1)$, for $k \in {1,\cdots,d}$, and do the following variable transformation to get Dirichlet samples $x_k = \frac{z_k}{\sum_k z_k}$. First we should demonstrate that this transformation results in Dirichlet distributed samples.

Consider the following tranformation $(z_1,\cdots,z_d) \leftarrow (x_1,\cdots,x_d,v)$, where $x_k = \frac{z_k}{\sum_k z_k}$ and $v = {\sum_k z_k}$. We can rewrite this transformation as $(x_1,\cdots,x_d,v)=h(z_1,\cdots,z_d)$, where $x_k = \frac{z_k}{v}$ and $v = {\sum_k z_k}$. Also we can imediatly calculate the inverse transformation $(z_1,\cdots,z_d)=h^{-1}(x_1,\cdots,x_d,v)$, with $z_k=v x_k$. From the transformation definition we know that ${\sum_{k=1}^d x_k=1}$, implying that $x_d = 1-\sum_{k=1}^{d-1} x_k$ and $z_d=v(1-\sum_{k=1}^{d-1}x_k)$.

## Probabilistic models for Recommender systems (part I): Probabilistic Matrix Factorization

In Recommender Systems design we are faced with the following problem: given incomplete information about users preference, content information, user-items rating and  contextual information, learn the user preference and suggest new items for users based on features as:

• previous pattern of items preference of the user;
• preference of users with similar rating pattern;
• contextual information: location, time (of the day, week, month, year), social network.

This is usually formulated as a matrix completion problem using Matrix Factorization techniques to offer a optimal solution. Is this case latent features for users and item are inferred from the observed/rated items for each user, and from this latent features the missing entries are estimated. One major modelling tool for this problem is probabilistic modelling, and there are many proposals in the literature of different Probabillistic Matrix Factorization approaches. We will briefly discuss some of this models, starting with the seminal paper: Probabilistic Matrix Factorization (PMF) – [Salakhutdinov and Mnih, 2008, NIPS].

Continue reading “Probabilistic models for Recommender systems (part I): Probabilistic Matrix Factorization”

## Hidden Markov Models (part I): recurrence equations for filtering and prediction

This semester I will be attending the doctoral course MA8702 – Advanced Modern Statistical Methods  with the excellent Prof. Håvard Rue. It will be course about statistical models defined over sparse structures (chains and graphs). We will start with Hidden Markov Chains and after go to Gaussian Markov Random Fields, Latent Gaussian Models and approximate inference with Integrated Nested Laplace Approximation (INLA). All this models are interesting for my research objective of developing sound latent models for recommender systems and I am really happy of taking this course with this great teacher and researcher. So, I will try to cover some of the material of the course, starting from what we saw in the first lecture: exact recurrence for Hidden Markov Chains and dynamic programming. In other words, general equations for predictions, filtering, smoothing, sampling, mode and marginal likelihood calculation of state-space model with latent variables. We will start by introduction the general model and specifying how to obtain the prediction and filtering equation. • Markovian property: $\pi(x_1,x_2,\cdots,x_T)=\pi(x_{1:T})=\prod_{t=1}^{t=T}\pi(x_t|x_{t-1})$, with $\pi(x_1|x_0)=\pi(x_1)$
• $y_t$ are observed and $x_t$ are latent, so $\pi(y_t|x_t)$ is always known.
• If we know $x_{t-1}$ than no other variable will add any information to the conditional distribution of $x_t$.

## Embedding GIF animations in IPython notebook

A couple of days ago I was thinking about the possibility of generating GIF animations using matplotlib and visualizing this animations directly in IPython notebooks. So I did a quick search to find possible solutions, and found one solution to embed videos, and other to convert matplotlib animations to gif, so I combined both solution in a third solution converting the animation to GIF using imagemagick (you will need to have it installed in your computer), enconding the resulting file in a base64 string and embedding this in a img tag. Using IPython.display HTML class this resulting string will be displayed in the notebook (and saved with the notebook).

I will briefly explain the elements of my solution.