Set Similarity Search Beyond Minhash
Christiani Tobias, Pagh Rasmus. Arxiv 2016
[Paper]
ARXIV
Independent
LSH
We consider the problem of approximate set similarity search under
Braun-Blanquet similarity . The -approximate
Braun-Blanquet similarity search problem is to preprocess a collection of sets
such that, given a query set , if there exists with , then we can efficiently return
with .
We present a simple data structure that solves this problem with space usage
and query time
where and . Making use of existing lower bounds for
locality-sensitive hashing by O’Donnell et al. (TOCT 2014) we show that this
value of is tight across the parameter space, i.e., for every choice of
constants .
In the case where all sets have the same size our solution strictly improves
upon the value of that can be obtained through the use of
state-of-the-art data-independent techniques in the Indyk-Motwani
locality-sensitive hashing framework (STOC 1998) such as Broder’s MinHash (CCS
1997) for Jaccard similarity and Andoni et al.’s cross-polytope LSH (NIPS 2015)
for cosine similarity. Surprisingly, even though our solution is
data-independent, for a large part of the parameter space we outperform the
currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
Similar Work