Is min-wise hashing optimal for summarizing set intersection?
Abstract
Min-wise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a "min-hash") independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a min-hash is available for the set of points satisfying each predicate. This has applications in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with 1 ± ε relative error? The state-of-the-art technique for minimizing the encoding size, for any desired estimation error, is b-bit min-wise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: • Using information complexity arguments, we show that b-bit min-wise hashing is space optimal for m = 2 predicates in the sense that the estimator's variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m > 2 predicates we show that the performance of b-bit min-wise hashing (and more generally any method based on "κ-permutation" min-hash) deteriorates as m grows. • We describe a new summary that nearly matches our lower bound for m > 2. It asymptotically outperform all κ-permutation schemes (by around a factor Ω(m/log m)), as well as methods based on subsampling (by a factor Ω (log nmax), where nmax is the maximum set size). Copyright 2014 ACM.