Set Similarity Search Beyond Minhash | Awesome Learning to Hash Add your paper to Learning2Hash

Set Similarity Search Beyond Minhash

Christiani Tobias, Pagh Rasmus. Arxiv 2016

[Paper]    
ARXIV Independent LSH

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x,y)=|xy|/max(|x|,|y|). The (b2,b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists xP with B(q,x)b1, then we can efficiently return xP with B(q,x)>b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn+xP|x|) and query time O(|q|nρlogn) where n=|P| and ρ=log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O’Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0<b2<b1<1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder’s MinHash (CCS 1997) for Jaccard similarity and Andoni et al.’s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

Similar Work