## Exercise: angle between high-dimensional unit vectors

I stumbled across this trigonometric formula for high-dimensional vector in the (d-1)-sphere while proof-reading the last two LSH papers (“beyond LSH”, and “optimal data-dependent LSH”). It is an interesting formula because it present the angle between the points fully characterized by the distance between the points.

So, let’s do it.

## Exercise: gradient of soft-max error function

The soft-max regression model can be used in the $k$ classes classification problem. The model consists of composition of probabilities distribution for each $k$ classes. So, the activation function $h_\theta(x^{(i)})$ is given by:

$h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix}$
And
$h_{\theta_n}(x^{(i)}) = p(y^{(i)} = n | x^{(i)}; \theta) = \frac{ e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }$

The error function is given by:
$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]=- \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log( h_{\theta_j}(x^{(i)}) )\right]$

## Exact Algorithms from Approximation Algorithms? (part 1)

Alex Andoni starts with a reflection of how the theory community has evolved to tackle problems by approximate algorithms, but also algorithms that performs good given some constraints on the structured of the problem. This post is a nice overview of Locality-Sensitive Hashing and an introduction to how we could solve exact near neighbor search using approximate randomized algorithm as LSH too.

One great “soft” challenge in (T)CS I find to be how to go on to find useful algorithms for problems that we believe (or have even proven!) to be hard in general. Let me explain by giving the all-too-common-example:

Practitioner: I need to solve problem X.

Theoretician: Nice question, let me think… Hm, it seems hard. I can even prove that, under certain assumptions, we cannot do anything non-trivial.

Practitioner: Slick… But I still need to solve it. Can you do anything?

Theoretician: If I think more, I can design an algorithm that gives a solution up to factor 2 within the actual solution!

Practitioner: Interesting, but that’s too much error nonetheless… How about 1.05?

Theoretician: Nope, essentially as hard as exact solution.

Practitioner: Thanks. But I still need to deliver the code in a month. I guess I will go and hack something by myself — anyways my instances are…

View original post 536 more words

## Parallelization of our Locality-Sensitive Hashing approach for general metric space

These last weeks has been full of work. I am in the critical final phase of my Master degree studies, trying to finish and hoping to submit my dissertation to the committee as soon as possible. Besides, I started in June a job as a software analyst at Brazilian Institute of Geography and Statistics (IBGE), as a public servant. Keeping up the good work in both jobs has been challenging; but there has been also really good times. It is the World Cup (\ironic\ yay)!

Yesterday, for example, I found out that our work on the parallelization of our LSH aproach to generic metric similarity search was accepted at the 7th International Conference on Similarity Search and Applications – SISAP 2014! What a great news. In this work, a colaboration with specialists in distributed and parallel system Dr. George Teodoro and Thiago Teixeira, we insisted in the direct “simplistic” approach (with rather good results) of VoronoiLSH (basically partitioning the space using a set points, random or not, as centroids of the partitions and attributing codes for the partitions) to design a parallel metric nearest neighbor search method using Dataflow programming (breaking the indexing and searching algorithm in five computing stages). It is nice that the approach exploits task, pipeline, replicated and intra-stage parallelism. We evaluated the proposed method in small metric datasets and in a big Euclidean dataset for ANN (1 Billion, 128 dimensional points). More details should be posted as soon. So, here is the abstract:

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Eliezer Silva, Thiago Teixeira, George Teodoro and Eduardo Valle

Under the assumption of uniform access cost to the data, and for the handful of dissimilarities for which locality-sensitive families are available, Locality-Sensitive Hashing (LSH) is known to be one of the most competitive techniques available for similarity search. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.

## New research paper accepted!

Hot news folks! In my qualification exam I proposed two ideas for extending Locality-Sensitive Hashing to general metric spaces. Readily afterwards I wrote a short paper presenting one of the ideas and early results… and guess what? It has been accepted at the SBBD2013 (Brazilian Symposium on Databases).

K-medoids LSH: a new locality sensitive hashing in general metric space
Eliezer S. Silva, Eduardo Valle

Abstract. The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them – the ability of finding specific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this article, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our early results show promise, but also demonstrate how challenging is to work around those difficulties.

## Airplanes and Neural Nets

The November 15th is the Proclamation of the Republic Day, a national holiday, and I was really in need of a day of rest. As I couldn’t be resting the whole day I spent some time reading this great book by Dr. Vladimir Vapnik, The Nature of Statistical Learning Theory. I’m not going to write a full review of the book, but I’d like to make some observation regarding some of his opinion that really got me.

I really enjoyed the historical background in the research of the Learning Problem; as a newcomer and apprentice on the field I find it useful and wise to be well-informed on what has been going on. As Cosma Shalizi points out in this review, Vapnik’s view on the history of research in the field is very idiosyncratic, but nevertheless interesting. He expresses a great level of criticism towards the motto: “Complex theories do not work; simple algorithms do” and places himself in the side of the Theory guys.

At some point he starts describing the “second birth” of research on Neural Nets and there lies a footnote that helped me remember why I never was too excited about it. Let’s start with the quote, which is in the context of joint research program declared between Artificial Intelligence scientist and physiologists:

Of course it is very interesting to know how humans can learn. However, this is not necessarily the best for creating an artificial learning machine. It has been noted that the study of birds flying was not very useful for constructing the airplane.

I claim that this affirmation does not intend to undermine this whole research program on Bio-inspired Learning Algorithms, but indeed to elucidate the actual role of this approach in the general setting of Learning Problem. That is to say, airplanes are not the same as birds, the material is not the same, the dynamics is not the same, the way they use energy is not the same, probably we wouldn’t be able to create a flying machine just by copycatting flying living beings. Nevertheless could be that some good ideas that later was incorporated into the flying machine was inspired by birds, as long as we understood the nature of a flying machine essentially as different from birds.

The reason I was never too excited about neural nets is basically because of my second undergraduate programming course. Prof. Ayrton Cristo made a little negative comment on the fact that when neural nets function correctly, generalize the training set, and so on, usually we don’t understand why it works (or how). I don’t endorse his opinion, but I should say that subconsciously this always has made me scrupulous on Neural Nets and some other heuristics. (prof. Ayrton Cristo is what you could call AI guy with a strong foot in NLP, Logic and Functional Programming)

My personal philosophy is more tendentious to the “theoretical proven” side than to “as long as it works”. I know that a lot of good ideas can emerge from heuristics and only a posteriori be well explained by theorists, nevertheless, I’m much more comfortable when I have ground theories to back me up.

## And so on and so forth!

Hello world!

Let’s get this thing started.

First of all, I’m not used to blogging in English, but I’m really excited to this new experience. This blog is intended to be a somehow personal, but also kind of professional accounting of my journey throughout a solid education in Computer Science research.

I should have started it sooner, probably in February 2012, when I first began my master course at School of Electrical and Computer Engineering, State University of Campinas (FEEC-Unicamp), but better late than never.

By now I am almost done with the 16 mandatory credits and I’ll concentrate mainly in the research project, which has gone through some change during the year. Later I’ll write about my research project and the courses that I’ve taken.

Looking back to it, I can say that it has been a mindful experience and I’ve been working with amazing and inspiring people and I hope my research results to be a screen of all the great things I’ve learned, experienced and thought.

So join me and I hope my writing will be as pleasant to you as all of this has been to me.