CRSI: A compact randomized similarity index for set-valued features
Abstract
We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estimators that work for arbitrary similarity thresholds based on a single similarity index. We show how to build this similarity index a-priori, without knowledge about query similarity thresholds, based on recently proposed synopses for multiset operations. The index is deployed using existing disk-based inverted indexing implementations and our algorithms exploit available techniques, like skip-lists, to further optimize the query performance. The index has provably small space footprints, is orders of magnitude smaller and faster to create/incrementally maintain than exact solutions, and the algorithms provide approximate answers, with an error that is controlled by a user-specified parameter. We prove the error bounds of our algorithms analytically, and, finally, we demonstrate the performance of the algorithms and verify their accuracy experimentally. © 2012 ACM.