Consistent Subset Sampling

Kutzkov Konstantin, Pagh Rasmus. Arxiv 2014

Consistent sampling is a technique for specifying, in small space, a subset $S$ of a potentially large universe $U$ such that the elements in $S$ satisfy a suitably chosen sampling condition. Given a subset $I \subseteq U$ it should be possible to quickly compute $I \cap S$ , i.e., the elements in $I$ satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size- $k$ subsets occurring in some set in a collection of sets of bounded size $b$ , where $k$ is a small integer. This can be done by applying standard consistent sampling to the $k$ -subsets of each set, but that approach requires time $Θ (b^{k})$ . Using a carefully designed hash function, for a given sampling probability $p \in (0, 1]$ , we show how to improve the time complexity to $Θ (b^{⌈ k / 2 ⌉} l o g l o g b + p b^{k})$ in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is $Θ (b^{⌈ k / 4 ⌉})$ . We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent $k$ -itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

Awesome Learning to Hash

Consistent Subset Sampling

Kutzkov Konstantin, Pagh Rasmus. Arxiv 2014

Similar Work