Beyond set disjointness: The communication complexity of finding the intersection
Abstract
We consider the following fundamental communication problem - there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages (rounds) as possible. They are willing to use randomization, and fail with a tiny probability. Given a protocol for computing the intersection, it can also be used to compute the exact Jaccard similarity, the rarity, the number of distinct elements, and joins between databases. Computing the intersection is at least as hard as the set disjointness problem, which asks whether the intersection is empty. Formally, in the two-server setting, the players hold subsets S, TE [n]. In many realistic scenarios, the sizes of S and T are significantly smaller than n, so we impose the constraint that |S|, |T| < k. We study the minimum number of bits the parties need to communicate in order to compute the intersection set S ∩ T, given a certain number r of messages that are allowed to be exchanged. While O(k log(n/k)) bits is achieved trivially and deterministically with a single message, we ask what is possible with more than one message and with randomization. We give a smooth communication/round tradeoff which shows that with O(log*k) rounds, O(k) bits of communication is possible, which improves upon the trivial protocol by an order of magnitude. This is in contrast to other basic problems such as computing the union or symmetric difference, for which Ω(k log(n/k)) bits of communication is required for any number of rounds. For two players, known lower bounds for the easier problem of set disjointness imply our algorithms are optimal up to constant factors in communication and number of rounds. We extend our protocols to m-player protocols, obtaining an optimal O(mk) bits of communication with a similarly small number of rounds. © 2014 ACM.