Fast Similarity Sketching
Dahlgaard Søren, Langhede Mathias Bæk Tejs, Houen Jakob Bæk Tejs, Thorup Mikkel. Arxiv 2017
[Paper]
ARXIV
Supervised
We consider the problem: Given a universe
we want a random function mapping subsets
into vectors of size , such that the Jaccard
similarity between sets and is
preserved. More precisely, define and . We want , and we want
to be strongly concentrated around (i.e.
Chernoff-style bounds). This is a fundamental problem which has found numerous
applications in data mining, large-scale classification, computer vision,
similarity search, etc. via the classic MinHash algorithm. The vectors
are also called . Strong concentration is critical, for
often we want to sketch many sets so that we later, for a
query set , can find (one of) the most similar . It is then critical
that no looks much more similar to due to errors in the sketch.
The seminal algorithm uses random hash
functions , and stores as the sketch of . The main drawback of
MinHash is, however, its running time, and finding a sketch
with similar properties and faster running time has been the subject of several
papers. (continued…)
Similar Work