Large-scale fast Fourier transform on a heterogeneous multi-core system
Abstract
As interest in hybrid computing systems increases, people are eager to find new ways to exploit the unique and efficient computational power of the heterogeneous multi-core systems. Although there has been much interest in implementing high-performance fast Fourier transform (FFT) libraries on this kind of system, most existing libraries focus on small-scale FFTs whose data can fit in the local storage of a single accelerator. Real-world FFT applications often require much larger scale FFTs, but it is extremely challenging for heterogeneous multi-core system with distributed architectures to make efficient large FFT implementations. In this paper, we introduce the first known FFT library for the heterogeneous multi-core system with distributed architecture that can solve one-dimensional FFTs larger than what fits in a single accelerator. Our implementation achieves 67% performance improvement of FFTW 3.2.2 (Fastest Fourier Transform in the West) and sustains over 36 single precision FFT GFLOPs 'end-to-end' Achieving such high performance requires novel schemes for large-scale FFT factorization, data permutation and all-to-all exchanges, and buffer designs to maximize use of the local storage while minimizing communication overhead. One important finding in this paper is that large-scale FFT on this kind of architecture behaves as data transfer bound which is quite different from other architectures. A significant contribution of this paper is that for each major component of our algorithm, we explore many possible design options and present quantitative performance comparisons. This provides value beyond specific architecture, as it illustrates the fundamental features associated with different communication paradigms and mechanisms for the heterogeneous multi-core system. Today's computer systems are increasingly being designed to include general purpose accelerators. The techniques in this paper can also be applied to these architectures especially when they have limited local storage or steep cache hierarchies. We also provide insights on applying techniques in this paper to similar architectures. © The Author(s) 2012.