Optimization of all-to-all communication on the blue gene/L supercomputer
Abstract
All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-Transform (FFT) algorithm. We analyze the performance of all-to-all communication on the Blue Gene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages. In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 × 32 × 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TFfor the HPC Challenge ID FFT benchmark with our optimized all-to-all. © 2008 IEEE.