Parallelization of our Locality-Sensitive Hashing approach for general metric space

These last weeks has been full of work. I am in the critical final phase of my Master degree studies, trying to finish and hoping to submit my dissertation to the committee as soon as possible. Besides, I started in June a job as a software analyst at Brazilian Institute of Geography and Statistics (IBGE), as a public servant. Keeping up the good work in both jobs has been challenging; but there has been also really good times. It is the World Cup (\ironic\ yay)!

Yesterday, for example, I found out that our work on the parallelization of our LSH aproach to generic metric similarity search was accepted at the 7th International Conference on Similarity Search and Applications – SISAP 2014! What a great news. In this work, a colaboration with specialists in distributed and parallel system Dr. George Teodoro and Thiago Teixeira, we insisted in the direct “simplistic” approach (with rather good results) of VoronoiLSH (basically partitioning the space using a set points, random or not, as centroids of the partitions and attributing codes for the partitions) to design a parallel metric nearest neighbor search method using Dataflow programming (breaking the indexing and searching algorithm in five computing stages). It is nice that the approach exploits task, pipeline, replicated and intra-stage parallelism. We evaluated the proposed method in small metric datasets and in a big Euclidean dataset for ANN (1 Billion, 128 dimensional points). More details should be posted as soon. So, here is the abstract:

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Eliezer Silva, Thiago Teixeira, George Teodoro and Eduardo Valle

Under the assumption of uniform access cost to the data, and for the handful of dissimilarities for which locality-sensitive families are available, Locality-Sensitive Hashing (LSH) is known to be one of the most competitive techniques available for similarity search. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed-memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.

New research paper accepted!

Hot news folks! In my qualification exam I proposed two ideas for extending Locality-Sensitive Hashing to general metric spaces. Readily afterwards I wrote a short paper presenting one of the ideas and early results… and guess what? It has been accepted at the SBBD2013 (Brazilian Symposium on Databases).

K-medoids LSH: a new locality sensitive hashing in general metric space
Eliezer S. Silva, Eduardo Valle

Abstract. The increasing availability of multimedia content poses a challenge for information retrieval researchers. Users want not only have access to multimedia documents, but also make sense of them – the ability of finding specific content in extremely large collections of textual and non-textual documents is paramount. At such large scales, Multimedia Information Retrieval systems must rely on the ability to perform search by similarity efficiently. However, Multimedia Documents are often represented by high-dimensional feature vectors, or by other complex representations in metric spaces. Providing efficient similarity search for that kind of data is extremely challenging. In this article, we explore one of the most cited family of solutions for similarity search, the Locality-Sensitive Hashing (LSH), which is based upon the creation of hashing functions which assign, with higher probability, the same key for data that are similar. LSH is available only for a handful distance functions, but, where available, it has been found to be extremely efficient for architectures with uniform access cost to the data. Most of existing LSH functions are restricted to vector spaces. We propose a novel LSH method for generic metric space based on K-medoids clustering. We present comparison with well established LSH methods in vector spaces and with recent competing new methods for metric spaces. Our early results show promise, but also demonstrate how challenging is to work around those difficulties.