Consistent Subset Sampling
Kutzkov Konstantin, Pagh Rasmus. Arxiv 2014
[Paper]
ARXIV
Graph
Independent
Consistent sampling is a technique for specifying, in small space, a subset
of a potentially large universe such that the elements in satisfy a
suitably chosen sampling condition. Given a subset it
should be possible to quickly compute , i.e., the elements
in satisfying the sampling condition. Consistent sampling has
important applications in similarity estimation, and estimation of the number
of distinct items in a data stream.
In this paper we generalize consistent sampling to the setting where we are
interested in sampling size- subsets occurring in some set in a collection
of sets of bounded size , where is a small integer. This can be done by
applying standard consistent sampling to the -subsets of each set, but that
approach requires time . Using a carefully designed hash function,
for a given sampling probability , we show how to improve the time
complexity to in expectation,
while maintaining strong concentration bounds for the sample. The space usage
of our method is .
We demonstrate the utility of our technique by applying it to several
well-studied data mining problems. We show how to efficiently estimate the
number of frequent -itemsets in a stream of transactions and the number of
bipartite cliques in a graph given as incidence stream. Further, building upon
a recent work by Campagna et al., we show that our approach can be applied to
frequent itemset mining in a parallel or distributed setting. We also present
applications in graph stream mining.
Similar Work