Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC
‘98, has been an extremely influential framework for nearest neighbor search in
high-dimensional data sets. While theoretical work has focused on the
approximate nearest neighbor problems, in practice LSH data structures with
suitably chosen parameters are used to solve the exact nearest neighbor problem
(with some error probability). Sublinear query time is often possible in
practice even for exact nearest neighbor search, intuitively because the
nearest neighbor tends to be significantly closer than other data points.
However, theory offers little advice on how to choose LSH parameters outside of
pre-specified worst-case settings.
We introduce the technique of confirmation sampling for solving the exact
nearest neighbor problem using LSH. First, we give a general reduction that
transforms a sequence of data structures that each find the nearest neighbor
with a small, unknown probability, into a data structure that returns the
nearest neighbor with probability