Join us in the ML Main Seminar Room Wednesday, October 20, 2010, from 11am - 12pm.
Title: Revisiting Large-scale Convolutions and FFTs on Multi-core CPUs, GPUs, and FPGAs
With a reduced operation count, FFTs are usually performed to enable a faster convolution
implementation for large stencils. However, on parallel computation platforms, the performance does not only relate to the operation count but also relates to parallelism, memory access pattern, and data dependency of the algorithm. In our work, we explore convolution and FFT designs on multi-core CPUs, Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), and investigate their performance with different problem and stencil sizes. The design process shows that, while there are still some generalities that can be extracted from the three platforms (data reuse is a key part among all three architectures), a lot of design choices are directly tied to the characteristics of the underlying architecture. Results demonstrate that, for many stencil sizes used in seismic processing, a direct convolution provides a better performance than an FFT-based convolution. The parallel performance of 1D FFT is limited by the data dependency. 3D FFT incurs significant memory access penalties on the transform in the third dimension. Only 2D FFT scales well with the parallel computation capacity of modern architectures. The technological trends indicate that these findings will continue.