DisCo: Distributed Co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining
Abstract
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easyto- use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map- Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary.In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. © 2008 IEEE.