Beating CountSketch for heavy hitters in insertion streams
Abstract
Given a stream p1,..., pm of items from a universe U, which, without loss of generality we identify with the set of integers {1,2,...,n}, we consider the problem of returning all l2-heavy hitters, i.e., those items j for which fj ≥ ϵ√F2, where fj is the number of occurrences of item j in the stream, and F2 = Σi∈[n] f2i. Such a guarantee is considerably stronger than the l1-guarantee, which finds those j for which fj ≥ ϵm. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using Θ(log2 n) bits of space (for constant ϵ > 0). The only known lower bound is Ω(log n) bits of space, which comes from the need to specify the identities of the items found. In this paper we show one can achieve O(log n log log n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including: (1) The first algorithm for estimating F2 simultaneously at all points in a stream using only O(log n log log n) bits of space, improving a natural union bound. (2) A way to estimate the l∞ norm of a stream up to additive error ϵ√F2 with O (log n log log n) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.