Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Moreno Maza, Marc


In this thesis we seek to realize an efficient implementation of a generic parallel fast Fourier transform (FFT). The FFT will be used in support of fast multiplication of polynomials with coefficients in a finite field. Our goal is to obtain a relatively high performing parallel implementation that will run over a variety of finite fields with different sized characteristic primes. To this end, we implement and compare two Cooley-Tukey Six-Step fast Fourier transforms and a Cooley-Tukey Four-Step variant against a high performing specialized FFT already implemented in the Basic Polynomial Algebra Subprograms (BPAS) library. We use optimization techniques found in modern day high performance FFT implementations. We start with a Six Step parallel algorithm suggested by Franz Franchetti and Markus Puschel in the Encyclopedia of Parallel Computing and derive two FFT variants, a Six-Step loop-merged variant and a Four-Step loop-merged variant. We implement and compare the FFTs in both C and C++ programming languages and we compare our BPAS finite field C++ implementation against a GNU multiple precision (GMP) implementation. We compare both serial and parallel versions over finite fields with characteristic primes of size 32 bits and larger. In addition to providing a fair comparison between FFT implementations with a varying degree of specialization, we optimistically hope to achieve a relative degree of performance with C++ as compared to C, and we hope to show how well the different FFTs parallelize and if that relates to the degree of specialization.

Summary for Lay Audience

The fast Fourier transform (FFT) is used by most scientific disciplines. High performance implementations of the fast Fourier transform play a crucial role in research areas like cryptography, signal processing, and polynomial system solving. On contemporary parallel computing platforms it is difficult to obtain high-performance FFT implementations. In this paper we derive and implement parallel Cooley-Tukey general radix decimation-in-time six step FFTs that we can match to various target platforms. Our goal is a competitive parallel FFT over Finite Fields. We detail and discuss our implementation and optimization effort. Then we analyze the performance of our implementation and present our findings.

Included in

Algebra Commons