On The Problem Of p_1^-1 In Locality-sensitive Hashing

Ahle Thomas Dybdahl. Arxiv 2020

A Locality-Sensitive Hash (LSH) function is called $(r, c r, p_{1}, p_{2})$ -sensitive, if two data-points with a distance less than $r$ collide with probability at least $p_{1}$ while data points with a distance greater than $c r$ collide with probability at most $p_{2}$ . These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a $c$ -approximate nearest neighbour data structure with query time $\tilde{O} (n^{ρ} / p_{1})$ where $ρ = \frac{l o g 1 / p_{1}}{l o g 1 / p_{2}} \in (0, 1)$ . That is, sub-linear time, as long as $p_{1}$ is not too small. This is significant since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can’t be solved exact, faster than a brute force linear-time scan of the database. Unfortunately, the best LSH functions tend to have very low collision probabilities, $p_{1}$ and $p_{2}$ . Including the best functions for Cosine and Jaccard Similarity. This means that the $n^{ρ} / p_{1}$ query time of LSH is often not sub-linear after all, even for approximate nearest neighbours! In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to $\tilde{O} (n^{ρ} / p_{1}^{1 - ρ})$ (and the space usage correspondingly.) Since $n^{ρ} p_{1}^{ρ - 1} < n \Leftrightarrow p_{1} > n^{- 1}$ , our algorithm always obtains sublinear query time, for any collision probabilities at least $1 / n$ . For $p_{1}$ and $p_{2}$ small enough, our improvement over all previous methods can be up to a factor $n$ in both query time and space. The improvement comes from a simple change to the Indyk-Motwani algorithm, which can easily be implemented in existing software packages.

Awesome Learning to Hash

On The Problem Of p_1^-1 In Locality-sensitive Hashing

Ahle Thomas Dybdahl. Arxiv 2020

Similar Work