Forestdsh A Universal Hash Design For Discrete Probability Distributions
Davoodi Arash Gholami, Chang Sean, Yoo Hyun Gon, Baweja Anubhav, Mongia Mihir, Mohimani Hosein. DAMI 2019
[Paper]
LSH
Supervised
In this paper, we consider the problem of classification of high
dimensional queries to high dimensional classes
where and are discrete alphabets and the
probabilistic model that relates data to the classes is known. This
problem has applications in various fields including the database search
problem in mass spectrometry. The problem is analogous to the nearest neighbor
search problem, where the goal is to find the data point in a database that is
the most similar to a query point. The state of the art method for solving an
approximate version of the nearest neighbor search problem in high dimensions
is locality sensitive hashing (LSH). LSH is based on designing hash functions
that map near points to the same buckets with a probability higher than random
(far) points. To solve our high dimensional classification problem, we
introduce distribution sensitive hashes that map jointly generated pairs
to the same bucket with probability higher than random pairs
and , where and are the marginal probability
distributions of . We design distribution sensitive hashes using a forest of
decision trees and we show that the complexity of search grows with
\(O(N^{\lambda^(P)})\) where \(\lambda^(P)\) is expressed in an analytical form.
We further show that the proposed hashes perform faster than state of the art
approximate nearest neighbor search methods for a range of probability
distributions, in both theory and simulations. Finally, we apply our method to
the spectral library search problem in mass spectrometry, and show that it is
an order of magnitude faster than the state of the art methods.
Similar Work