
A Generic Implementation of Fast Fourier Transforms for the BPAS Library
Abstract
In this thesis we seek to realize an efficient implementation of a generic parallel fast Fourier transform (FFT). The FFT will be used in support of fast multiplication of polynomials with coefficients in a finite field. Our goal is to obtain a relatively high performing parallel implementation that will run over a variety of finite fields with different sized characteristic primes. To this end, we implement and compare two Cooley-Tukey Six-Step fast Fourier transforms and a Cooley-Tukey Four-Step variant against a high performing specialized FFT already implemented in the Basic Polynomial Algebra Subprograms (BPAS) library. We use optimization techniques found in modern day high performance FFT implementations. We start with a Six Step parallel algorithm suggested by Franz Franchetti and Markus Puschel in the Encyclopedia of Parallel Computing and derive two FFT variants, a Six-Step loop-merged variant and a Four-Step loop-merged variant. We implement and compare the FFTs in both C and C++ programming languages and we compare our BPAS finite field C++ implementation against a GNU multiple precision (GMP) implementation. We compare both serial and parallel versions over finite fields with characteristic primes of size 32 bits and larger. In addition to providing a fair comparison between FFT implementations with a varying degree of specialization, we optimistically hope to achieve a relative degree of performance with C++ as compared to C, and we hope to show how well the different FFTs parallelize and if that relates to the degree of specialization.