Unifying HDFS and GPFS: Enabling analytics on software-defined storage
Abstract
Distributed file systems built for Big Data Analytics and cluster file systems built for traditional applications have very different functionality requirements, resulting in separate storage silos. In enterprises, there is often the need to run analytics on data generated by traditional applications that is stored on cluster file systems. The absence of a single data store that can serve both classes of applications leads to data duplication and hence, increased storage costs, along with the cost of moving data between the two kinds of file systems. It is difficult to unify these two classes of file systems since the classes of applications that use them have very different requirements, in terms of performance, data layout, consistency and fault tolerance. In this paper, we look at the design differences of two file systems, IBM's GPFS [1] and the open source Hadoop's Distributed File System (HDFS) [2] and propose a way to reconcile these design differences. We design and implement a shim layer over GPFS that can be used by analytical applications to efficiently access data stored in GPFS. Through our evaluation, we provide quantitative results that show that our system performs at par with with HDFS, for most Big Data applications while retaining all the guarantees provided by traditional cluster file systems.