Set Similarity Search Beyond Minhash

Christiani Tobias, Pagh Rasmus. Arxiv 2016

We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B (x, y) = | x \cap y | / max (| x |, | y |)$ . The $(b_{2}, b_{2})$ -approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $q$ , if there exists $x \in P$ with $B (q, x) \geq b_{1}$ , then we can efficiently return $x^{'} \in P$ with $B (q, x^{'}) > b_{2}$ . We present a simple data structure that solves this problem with space usage $O (n^{1 + ρ} l o g n + \sum_{x \in P} | x |)$ and query time $O (| q | n^{ρ} l o g n)$ where $n = | P |$ and $ρ = l o g (1 / b_{1}) / l o g (1 / b_{2})$ . Making use of existing lower bounds for locality-sensitive hashing by O’Donnell et al. (TOCT 2014) we show that this value of $ρ$ is tight across the parameter space, i.e., for every choice of constants $0 < b_{2} < b_{1} < 1$ . In the case where all sets have the same size our solution strictly improves upon the value of $ρ$ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder’s MinHash (CCS 1997) for Jaccard similarity and Andoni et al.’s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

Awesome Learning to Hash

Set Similarity Search Beyond Minhash

Christiani Tobias, Pagh Rasmus. Arxiv 2016

Similar Work