Sampling from Dirichlet Distribution using Gamma distributed samples

There is an algorithm to generate Dirichlet samples using a sampler for Gamma distribution for any $\alpha > 0$ and $\beta > 0$. We will generate Gamma distributed variables $z_k \sim \text{gamma}(\alpha_k,1)$, for $k \in {1,\cdots,d}$, and do the following variable transformation to get Dirichlet samples $x_k = \frac{z_k}{\sum_k z_k}$. First we should demonstrate that this transformation results in Dirichlet distributed samples.

Consider the following tranformation $(z_1,\cdots,z_d) \leftarrow (x_1,\cdots,x_d,v)$, where $x_k = \frac{z_k}{\sum_k z_k}$ and $v = {\sum_k z_k}$. We can rewrite this transformation as $(x_1,\cdots,x_d,v)=h(z_1,\cdots,z_d)$, where $x_k = \frac{z_k}{v}$ and $v = {\sum_k z_k}$. Also we can imediatly calculate the inverse transformation $(z_1,\cdots,z_d)=h^{-1}(x_1,\cdots,x_d,v)$, with $z_k=v x_k$. From the transformation definition we know that ${\sum_{k=1}^d x_k=1}$, implying that $x_d = 1-\sum_{k=1}^{d-1} x_k$ and $z_d=v(1-\sum_{k=1}^{d-1}x_k)$.

Probabilistic models for Recommender systems (part I): Probabilistic Matrix Factorization

In Recommender Systems design we are faced with the following problem: given incomplete information about users preference, content information, user-items rating and  contextual information, learn the user preference and suggest new items for users based on features as:

• previous pattern of items preference of the user;
• preference of users with similar rating pattern;
• contextual information: location, time (of the day, week, month, year), social network.

This is usually formulated as a matrix completion problem using Matrix Factorization techniques to offer a optimal solution. Is this case latent features for users and item are inferred from the observed/rated items for each user, and from this latent features the missing entries are estimated. One major modelling tool for this problem is probabilistic modelling, and there are many proposals in the literature of different Probabillistic Matrix Factorization approaches. We will briefly discuss some of this models, starting with the seminal paper: Probabilistic Matrix Factorization (PMF) – [Salakhutdinov and Mnih, 2008, NIPS]. Continue reading “Probabilistic models for Recommender systems (part I): Probabilistic Matrix Factorization”

Hidden Markov Models (part I): recurrence equations for filtering and prediction

This semester I will be attending the doctoral course MA8702 – Advanced Modern Statistical Methods  with the excellent Prof. Håvard Rue. It will be course about statistical models defined over sparse structures (chains and graphs). We will start with Hidden Markov Chains and after go to Gaussian Markov Random Fields, Latent Gaussian Models and approximate inference with Integrated Nested Laplace Approximation (INLA). All this models are interesting for my research objective of developing sound latent models for recommender systems and I am really happy of taking this course with this great teacher and researcher. So, I will try to cover some of the material of the course, starting from what we saw in the first lecture: exact recurrence for Hidden Markov Chains and dynamic programming. In other words, general equations for predictions, filtering, smoothing, sampling, mode and marginal likelihood calculation of state-space model with latent variables. We will start by introduction the general model and specifying how to obtain the prediction and filtering equation.

• Markovian property: $\pi(x_1,x_2,\cdots,x_T)=\pi(x_{1:T})=\prod_{t=1}^{t=T}\pi(x_t|x_{t-1})$, with $\pi(x_1|x_0)=\pi(x_1)$
• $y_t$ are observed and $x_t$ are latent, so $\pi(y_t|x_t)$ is always known.
• If we know $x_{t-1}$ than no other variable will add any information to the conditional distribution of $x_t$.

Embedding GIF animations in IPython notebook

A couple of days ago I was thinking about the possibility of generating GIF animations using matplotlib and visualizing this animations directly in IPython notebooks. So I did a quick search to find possible solutions, and found one solution to embed videos, and other to convert matplotlib animations to gif, so I combined both solution in a third solution converting the animation to GIF using imagemagick (you will need to have it installed in your computer), enconding the resulting file in a base64 string and embedding this in a img tag. Using IPython.display HTML class this resulting string will be displayed in the notebook (and saved with the notebook).

I will briefly explain the elements of my solution.

So, as we approach the end of 2015 I want to post some personal and academic updates.

• During the whole year I have been revising the literature on a couple of areas that interest me, that includes: hashing algorithms for massive data analysis, computational geometry (coresets, higher order voronoi diagrams, sketches, high dimensional geometry), computational social science, network science, undecibility of theories, random metric models, random matrix theory, deep learning theory, approximate inference algorithms and graphical models for recommender systems. Some of the future blog posts will take a look on some of these subjects.
• Half of the year I was involved in a very interesting project at Brazilian Institute of Geography and Statistics (IBGE), planning and implementing the supporting systems for surveys research (using Windows Phone). Even though I was excited to move back to academy, it was nice to be in this kind of project.
• I was accepted as a PhD Fellow (stipendiat) in the Department of Computer and Information Science at Norwegian University of Science and Technology (NTNU) with a project about probabilistic models and algorithms for recommender systems, working with Prof. Helge Langseth and Heri Ramampiaro. This was a big change for me and Juliana, who was pregnant, but readily accepted the challenge of leaving hot Rio and coming to cold Trondheim. We were not sure that we would be able to beat the bureaucracy and get all the documents necessary for the visa and her travel. We arrived in Trondheim in August 28th, and I started working on September 1st.
• I had the opportunity to attend RecSys 2015 with departmental funding, even though I had literally just arrived at the department. This helped me a lot in gaining momentum in the writing of the initial research proposal, submitted to the faculty and now already accepted. During RecSys, I had the chance to analyze different research challenges and had some ideas that I am looking forward to try some of this ideas.
• I ran a self-study seminar series titled Approximate and Scalable Inference for Complex Probabilistic Models in Recommender Systems. It consisted on the study of probabilistic models and inference algorithms. It included Probabilistic Matrix Factorization, Poisson Matrix Factorization, Bayesian variants of those models, models with side information (social network, trust network, kernelized models), latent gaussian models, and dynamics models (State-space, tensor-factorization, based on time dependent Dirichlet process). We also studied variational inference and plan to study INLA (Integrated Nested Laplace Approximation) and sampling methods (MCMC, HMC, Gibbs sampling) applied to matrix factorization probabilistic models.
• Last, but not least, we had our first child, Samir, born in Oct. 23 here in Trondheim.

I am really looking forward to this new year, 2015 was a great year with many changes. I hope to consolidate many open threads of research in 2016, and continue to live a happy live with my family.

Exercise: angle between high-dimensional unit vectors

I stumbled across this trigonometric formula for high-dimensional vector in the (d-1)-sphere while proof-reading the last two LSH papers (“beyond LSH”, and “optimal data-dependent LSH”). It is an interesting formula because it present the angle between the points fully characterized by the distance between the points.

So, let’s do it.

Exercise: gradient of soft-max error function

The soft-max regression model can be used in the $k$ classes classification problem. The model consists of composition of probabilities distribution for each $k$ classes. So, the activation function $h_\theta(x^{(i)})$ is given by:

$h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix}$
And
$h_{\theta_n}(x^{(i)}) = p(y^{(i)} = n | x^{(i)}; \theta) = \frac{ e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }$

The error function is given by:
$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]=- \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log( h_{\theta_j}(x^{(i)}) )\right]$

Exact Algorithms from Approximation Algorithms? (part 1)

Alex Andoni starts with a reflection of how the theory community has evolved to tackle problems by approximate algorithms, but also algorithms that performs good given some constraints on the structured of the problem. This post is a nice overview of Locality-Sensitive Hashing and an introduction to how we could solve exact near neighbor search using approximate randomized algorithm as LSH too.

One great “soft” challenge in (T)CS I find to be how to go on to find useful algorithms for problems that we believe (or have even proven!) to be hard in general. Let me explain by giving the all-too-common-example:

Practitioner: I need to solve problem X.

Theoretician: Nice question, let me think… Hm, it seems hard. I can even prove that, under certain assumptions, we cannot do anything non-trivial.

Practitioner: Slick… But I still need to solve it. Can you do anything?

Theoretician: If I think more, I can design an algorithm that gives a solution up to factor 2 within the actual solution!

Practitioner: Interesting, but that’s too much error nonetheless… How about 1.05?

Theoretician: Nope, essentially as hard as exact solution.

Practitioner: Thanks. But I still need to deliver the code in a month. I guess I will go and hack something by myself — anyways my instances are…

View original post 536 more words

Parallelization of our Locality-Sensitive Hashing approach for general metric space

These last weeks has been full of work. I am in the critical final phase of my Master degree studies, trying to finish and hoping to submit my dissertation to the committee as soon as possible. Besides, I started in June a job as a software analyst at Brazilian Institute of Geography and Statistics (IBGE), as a public servant. Keeping up the good work in both jobs has been challenging; but there has been also really good times. It is the World Cup (\ironic\ yay)!

Yesterday, for example, I found out that our work on the parallelization of our LSH aproach to generic metric similarity search was accepted at the 7th International Conference on Similarity Search and Applications – SISAP 2014! What a great news. In this work, a colaboration with specialists in distributed and parallel system Dr. George Teodoro and Thiago Teixeira, we insisted in the direct “simplistic” approach (with rather good results) of VoronoiLSH (basically partitioning the space using a set points, random or not, as centroids of the partitions and attributing codes for the partitions) to design a parallel metric nearest neighbor search method using Dataflow programming (breaking the indexing and searching algorithm in five computing stages). It is nice that the approach exploits task, pipeline, replicated and intra-stage parallelism. We evaluated the proposed method in small metric datasets and in a big Euclidean dataset for ANN (1 Billion, 128 dimensional points). More details should be posted as soon. So, here is the abstract:

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Eliezer Silva, Thiago Teixeira, George Teodoro and Eduardo Valle

Under the assumption of uniform access cost to the data, and for the handful of dissimilarities for which locality-sensitive families are available, Locality-Sensitive Hashing (LSH) is known to be one of the most competitive techniques available for similarity search. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.