Consistent Subset Sampling | Awesome Learning to Hash Add your paper to Learning2Hash

Consistent Subset Sampling

Kutzkov Konstantin, Pagh Rasmus. Arxiv 2014

[Paper]    
ARXIV Graph Independent

Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset IU it should be possible to quickly compute IS, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(bk). Using a carefully designed hash function, for a given sampling probability p(0,1], we show how to improve the time complexity to Θ(bk/2loglogb+pbk) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(bk/4). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

Similar Work