Exact Algorithms from Approximation Algorithms? (part 1)

Alex Andoni starts with a reflection of how the theory community has evolved to tackle problems by approximate algorithms, but also algorithms that performs good given some constraints on the structured of the problem. This post is a nice overview of Locality-Sensitive Hashing and an introduction to how we could solve exact near neighbor search using approximate randomized algorithm as LSH too.

Windows On Theory

One great “soft” challenge in (T)CS I find to be how to go on to find useful algorithms for problems that we believe (or have even proven!) to be hard in general. Let me explain by giving the all-too-common-example:

Practitioner: I need to solve problem X.

Theoretician: Nice question, let me think… Hm, it seems hard. I can even prove that, under certain assumptions, we cannot do anything non-trivial.

Practitioner: Slick… But I still need to solve it. Can you do anything?

Theoretician: If I think more, I can design an algorithm that gives a solution up to factor 2 within the actual solution!

Practitioner: Interesting, but that’s too much error nonetheless… How about 1.05?

Theoretician: Nope, essentially as hard as exact solution.

Practitioner: Thanks. But I still need to deliver the code in a month. I guess I will go and hack something by myself — anyways my instances are…

View original post 536 more words

Parallelization of our Locality-Sensitive Hashing approach for general metric space

These last weeks has been full of work. I am in the critical final phase of my Master degree studies, trying to finish and hoping to submit my dissertation to the committee as soon as possible. Besides, I started in June a job as a software analyst at Brazilian Institute of Geography and Statistics (IBGE), as a public servant. Keeping up the good work in both jobs has been challenging; but there has been also really good times. It is the World Cup (\ironic\ yay)!

Yesterday, for example, I found out that our work on the parallelization of our LSH aproach to generic metric similarity search was accepted at the 7th International Conference on Similarity Search and Applications – SISAP 2014! What a great news. In this work, a colaboration with specialists in distributed and parallel system Dr. George Teodoro and Thiago Teixeira, we insisted in the direct “simplistic” approach (with rather good results) of VoronoiLSH (basically partitioning the space using a set points, random or not, as centroids of the partitions and attributing codes for the partitions) to design a parallel metric nearest neighbor search method using Dataflow programming (breaking the indexing and searching algorithm in five computing stages). It is nice that the approach exploits task, pipeline, replicated and intra-stage parallelism. We evaluated the proposed method in small metric datasets and in a big Euclidean dataset for ANN (1 Billion, 128 dimensional points). More details should be posted as soon. So, here is the abstract:

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Eliezer Silva, Thiago Teixeira, George Teodoro and Eduardo Valle

Under the assumption of uniform access cost to the data, and for the handful of dissimilarities for which locality-sensitive families are available, Locality-Sensitive Hashing (LSH) is known to be one of the most competitive techniques available for similarity search. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.