Cloud analytics: Do we really need to reinvent the storage stack?
Abstract
Cloud computing offers a powerful abstraction that provides a scalable, virtualized infrastructure as a service where the complexity of fine-grained resource management is hidden from the end-user. Running data analytics applications in the cloud on extremely large data sets is gaining traction as the underlying infrastructure can meet the extreme demands of scalability. Typically, these applications (e.g., business intelligence, surveillance video searches) leverage the MapReduce framework that can decompose a large computation into a set of smaller parallelizable computations. More often than not the underlying storage architecture for running a MapReduce application is based on an Internet-scale filesystem, such as GFS, which does not provide a standard (POSIX) interface. In this paper we revisit the debate on the need of a new non-POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX-based cluster filesystems. In the course of the evaluation, we compare the performance of a traditional cluster file system and a specialized Internet file system for a variety of workloads for both traditional and MapReduce-based applications. We present modifications to the cluster filesystem's allocation and layout information to better support the requirements of data locality for analytics applications. We introduce the concept of a metablock that can enable the choice of a larger block granularity for MapReduce applications to coexist with a smaller block granularity required for traditional applications. We show that a cluster file system enhanced with metablocks can not only match the performance of specialized Internet file systems for MapReduce applications but also outperform them for traditional applications.