Universal FFT core generator by Shenoy, Pranab
Universal FFT Core Generator
A Thesis
Submitted to the Faculty
of
Drexel University
by
Pranab Shenoy
in partial fulfillment of the
requirements for the degree
of
MS in Computer Science
December 2007
c© Copyright December 2007
Pranab Shenoy
Copying and use (herein “Redistribution”) of this document is permitted provided that the following
conditions are met:
• This document may not be modified in any respect without specific prior written permission
from the Contributors to this document (herein “Contributors”).
• Redistributions must retain the above copyright notice, incorporate all of the conditions set
forth in this license, as well as incorporate the disclaimer(s) herein.
• Neither the name of Drexel University nor the names of the Contributors may be used to
endorse or promote any artifacts or other products derived from this document without specific
prior written permission from the Contributors.
• This document may not be redistributed for profit without specific prior written permission
from the Contributors.
• This document is provided by the Regents and Contributors “as is” and any express or implied
warranties, including, but not limited to, the implied warranties of merchantability and fitness
for a particular purpose are disclaimed. In no event shall the Regents and Contributors
be liable for any direct, indirect, incidental, special, exemplary, or consequential damages
(including, but not limited to, procurement of substitute goods or services; loss of profits;
or business interruption) however caused and on any theory of liability, whether in contract,
strict liability, or tort (including negligence or otherwise) arising in any way out of the use of
this document, even if advised of the possibility of such damage.
• Nothing in this license document shall preclude Contributor, Pranab Shenoy, from distributing
this document as deemed fit or deriving a “profit” or any form of money as the result of the
preparation or distribution of this document.
ii
Acknowledgements
I would like to express my sincere gratitude to the following people for their contributions to the
completion of this thesis. Firstly, I would like to thank my advisor Dr. Jeremy Johnson who has
been a constant source of motivation and support. The many challenges and opportunities that he
has provided will always be greatly appreciated.
I would also like to thank Dr. Prawat Nagvajara for introducing me to the world of FPGAs,
and his guidance in the various aspects of this work. I am also grateful to Drs. James C. Hoe and
Prawat Nagavajara for consenting to be on my committee, and for their extremely valuable inputs
towards this thesis.
I am extremely grateful to Dr. T. S. Venkatraman for his constant guidance and support.
I would like to thank Peter Milder at Carnegie Mellon University for being patient with my
questions, his quick and informative replies helped me a considerably, to reach the completion of
this thesis. I would also like to thank my colleagues Michael Andrews and Timothy Chagnon. Their
valuable input helped me overcome many problems during the course of my work.
I would like to thank all my friends, who have made my stay here at Drexel University enjoyable.
Finally, I am eternally indebted to my parents for their unconditional love and support in all my
ventures.
iii
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Dimensionless FFT .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Tensor Product Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Discrete Fourier Transform .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 One-Dimensional DFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Multi-Dimensional DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 The Cooley-Tukey Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Iterative Cooley-Tukey Factorization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Dimensionless FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3. Universal FFT Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Pease Factorization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Dimensionless Pease Algorithm .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Universal FFT Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4. 1D DFT Core Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Hardware Platform Overview.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 FPGA Application Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Algorithm to Core .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Horizontal Data-path Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Vertical Data-path Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Hardware Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Stride Permutation through Scalable Interconnection Network .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 The J Permutation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.2 Delay Switch Delay Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 DFT Core Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.1 Stride Permutation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.2 Top Level Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
4.6.3 Kernel Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6.4 ROM Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 1D DFT Core: Execution Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 1D DFT Core Generator: From Parameters to Verilog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5. Universal FFT Core Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Universal FFT Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Software Framework for the Universal FFT Core Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6. Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 1D DFT Core Generator Performance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Dimensionless FFT Performance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 1D DFT Core Performance.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Multi-Dimensional DFT Core Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Advantages of Dimensionless FFT over Row-Column Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vList of Tables
4.1 DFT IP generator parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vi
List of Figures
2.1 One Dimensional Bit Reversal Permutation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Generalized Bit Reversal Permutation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Recursive Cooley-Tukey FFT Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Iterative Cooley-Tukey FFT Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Dimensionless Iterative Cooley-Tukey FFT Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 1D Pease FFT Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Pease Dimensionless Twiddle Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Dimensionless Pease FFT Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 16 point Pease 1D FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 4x4 point Pease 2D FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 2x2x4 point Pease 3D FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Universal FFT Processor. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Universal FFT Processor Folded Horizontally. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Application Development on a Xilinx FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Horizontal Data-path Folding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Vertical Data-path Folding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Delay Switch Delay Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Delay Switch Delay Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Delay Switch Delay Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Stride Permutation Hardware... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.8 Permutation Module.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9 DSD Cascade component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.10 Delay Switch Delay Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.11 DSD Control Component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.12 Top Level Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
4.13 Kernel Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.14 One-Dimensional ROM Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Multi-Dimensional DFT ROM Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Finite State Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Finite State Machine in a Dimensionless DFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Digit Reversal Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Slice Usage of 1D DFT Core Generator vs Xilinx LogiCore Library. . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 BRAM Usage of 1D DFT Core Generator vs Xilinx LogiCore Library. . . . . . . . . . . . . . . . . . . . . 73
6.3 Minimal Slice Usage - Twiddles and FIFOs in BRAM (Slices vs p). . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Minimal Slice Usage - Twiddles and FIFOs in BRAM (BRAMs vs p). . . . . . . . . . . . . . . . . . . . . . 75
6.5 Maximum Slice Usage - Twiddles and FIFOs in distributed RAM. . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Balanced Slice Usage - Twiddles in BRAM and FIFO depth > 64 in BRAM (Slices vs p). 77
6.7 Balanced Slice Usage - Twiddles in BRAM and FIFO depth > 64 in BRAM (BRAMs vs
p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.8 Different configurations of a 1024-point DFT. Twiddles and FIFOs in BRAM (Slices vs
t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.9 Different configurations of a 1024-point DFT. Twiddles and FIFOs in BRAM (BRAMs
vs t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.10 Different configurations of a 4096-point DFT. Twiddles and FIFOs in BRAM (Slices vs
t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.11 Different configurations of a 4096-point DFT. Twiddles and FIFOs in BRAM (BRAMs
vs t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
Abstract
Universal FFT Core Generator
Pranab Shenoy
Advisor: Dr. Jeremy Johnson, PhD
This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform
(DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is
capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points.
The processor can be configured to compute different dimensional DFTs simply by rearranging the
input data and using a different set of constants, called generalized twiddle factors, throughout the
computation. The basic computational process remains the same independent of dimension. The
processor does not rely on one-dimensional FFT computation as does the standard approach using
the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose
computation data flow is the same as the FFT algorithm due to Pease.
Since the computation performed by the Universal FFT processor is the same as the Pease FFT,
an implementation can be obtained by modifying a one-dimensional FFT processor based on the
Pease algorithm. The implementation presented in this thesis is obtained in this way from the core
generator by Nordin, Milder, Hoe, and Pu¨schel. The core generator from Nordin, Milder, Hoe,
and Pu¨schel generates a variety of cores, with various space and time tradeoffs for implementation
on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor
supplied cores but offers much greater flexibility of design choices with different area and throughput
tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional
FPGA cores with approximately the same performance and area obtained in the one-dimensional
case by Nordin, Milder, Hoe, and Pu¨schel.
11. Introduction
This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform
(DFT) computation using a variant of the fast Fourier transform (FFT) [3] due to Pease [13].
The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing
an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be
configured to compute different dimensional DFTs simply by rearranging the input data and using a
different set of constants, called generalized twiddle factors, throughout the computation. The basic
computational process remains the same independent of dimension.
The processor does not rely on one-dimensional FFT computation as does the standard approach
using the row-column algorithm [2], but instead utilizes an algorithm called the Dimensionless FFT
[1]. The Dimensionless FFT as presented in [1] was based on the standard Cooley-Tukey algorithm
[3]; however the algorithm can be adapted to different variants of the FFT, and in this thesis we use
the variant due to Pease [13].
Since the computation performed by the Universal FFT processor is the same as the Pease FFT,
an implementation can be obtained by modifying a one-dimensional FFT processor based on the
Pease algorithm. The implementation presented in this thesis is obtained in this way from the core
generator by Nordin, Milder, Hoe, and Pu¨schel [11, 12] which uses a design from [15, 16]. The core
generator from Nordin, Milder, Hoe, and Pu¨schel generates a variety of cores, with various space
and time tradeoffs for implementation on field programmable gate arrays (FPGA) [21]. The core
generator produces a description of the hardware design using the hardware description language
(HDL) Verilog [17] which is then synthesized to obtain the FPGA core. The performance obtained
is comparable to vendor supplied cores [20] but offers much greater flexibility of design choices with
different area and throughput tradeoffs. The resulting Universal FFT core generator can be used
to instantiate multi-dimensional FPGA cores with approximately the same performance and area
obtained in the one-dimensional case by Nordin, Milder, Hoe, and Pu¨schel.
The concept of a Universal FFT processor was first presented in Dimensionless Fast Fourier
Transform Method and Apparatus, (Patent no. US6003056, issued Dec. 14, 1999) and a version
of it was implemented using FPGA by Kumhom [8, 9]. Kumhom’s implementation was based on a
distributed memory processor with a special purpose interconnection network, capable of computing
transforms of different sizes. Instead this thesis presents a core generator, based on a different
2hardware model, which can be used to instantiate cores for different sized transforms with varying
amounts of parallelism which allows for different space/time tradeoffs. The generated cores are shown
to have smaller latency than using fixed one-dimensional cores with the row-column algorithm.
The derivation of the Universal FFT processor is obtained from a mathematical description of
the FFT using the tensor product and matrix factorizations [10, 18, 19]. The presentation follows
the approach in the paper by Johnson, Johnson, Rodriguez, and Tolimieri [5]. The derivation of
the Dimensionless FFT follows the approach in [5], where we modify Theorem 8 to conform to the
Pease algorithm [7] [13](see Theorem 11). Theorem 12 shows how to compute the necessary twiddle
factors, which is the basis for our modification of the 1D DFT core generator.
Chapter 2 of the thesis reviews the necessary mathematics and Chapter 3 derives the Pease
variant of the Dimensionless FFT and the resulting Universal FFT processor. Chapter 4 reviews the
1D DFT core generator from Midler et al and the ideas from Takala et al which is the basis for their
hardware design. Chapter 5 discusses the modifications to the 1D DFT core generator to support
the Dimensionless FFT and Chapter 6 summarizes the resulting performance confirming that similar
performance to the 1D DFT core is obtained and compares it to the row-column algorithm.
32. Dimensionless FFT
This Chapter derives the Dimensionless FFT algorithm from [1], The discrete Fourier transform
(DFT) is reviewed and fast Fourier transform (FFT) algorithms are presented using matrix notation.
The necessary multi-dimensional tools are presented following [5].
2.1 Tensor Product Identities
Definition 1 (Basis vector) The collection of vectors eni |0 ≤ i < n with a one in the i-th position
and zeros elsewhere is denoted by eni and denotes the standard basis for the vectors of n-tuples of
complex numbers.
Definition 2 (Tensor Product)
(emi ⊗ e
n
j ) = e
mn
in+j .
Assuming bilinearity of the tensor product we can define the tensor product of arbitrary vectors. Let
xm =


x0
...
xm−1

 =
∑
0≤i<m
xie
m
i ,
and
yn =


y0
...
yn−1

 =
∑
0≤j<n
yje
n
j ,
then
xm ⊗ yn =
∑
0≤i<m0≤j<n
xiyj(e
m
i ⊗ e
n
j ) =
∑
xiyje
mn
in+j =


x0y
n
...
xm−1y
n

 ;
the vector of size mn obtained by replacing the ith component xi of x by the vector xiy. The tensor
product of two linear transformations is defined by
(A⊗B)(x⊗ y) = Ax⊗By.
4Then (A⊗B)(emi ⊗ e
n
j ) = Ae
m
i ⊗Be
n
j implies
A⊗B =


a0,0B · · · a0,n−1B
...
. . .
...
am−1,0B · · · am−1,n−1B

 ;
the matrix obtained by replacing the ij element of A aij by the matrix aijB, which is the Kronecker
product.
Definition 3 (Identity Matrix) The identity matrix denoted by In is the n× n matrix with ones
along the diagonal and zeros elsewhere.
The following are properties exhibited by the tensor product:
1. If A, B, and C are matrices with compatible dimensions, then
(A⊗B)⊗ C = A⊗ (B ⊗ C). (2.1)
2. If A, B, C and D are matrices with compatible dimensions, then
(A⊗B)(C ⊗D) = (AC) ⊗ (BD). (2.2)
This is also known as the multiplicative rule of tensor products.
3. If P and Q are permutation matrices, then so is P ⊗Q.
4. If Im and In are identity matrices then
Im ⊗ In = Imn. (2.3)
5. If I, B, and C are matrices with compatible dimensions, then
(I ⊗BC) = (I ⊗B)(I ⊗ C) (2.4)
Proof of these identities is given in [1].
Using the tensor product identities introduced above we obtain the following theorem.
5Theorem 1 Let A be an m× n matrix and let B be a p× q matrix. Then
A⊗B = (A⊗ Ip)(In ⊗B) = (Im ⊗B)(A⊗ Iq). (2.5)
Proof
Applying multiplicative rule of tensor products,
(A⊗B) = (AIm ⊗ InB)
= (A⊗ In)(Im ⊗B).
(A⊗B) = (ImA⊗BIn)
= (Im ⊗B)(A ⊗ In).
The tensor decomposition given in Equation 2.5 can be generalized to t factors,
Theorem 2 (Fundamental Tensor Product Factorization) Let Ani be an ni×ni matrix. Let
N(i) = n1 · · ·ni with N = N(t) and let N¯(i) = N/N(i).
An1 ⊗ · · · ⊗An1 =
t∏
i=1
(IN(i−1) ⊗Ani ⊗ IN¯(i)). (2.6)
Furthermore, the factorization is true for any permutation of the factors (IN(i−1) ⊗Ani ⊗ IN¯(i))
Proof
The proof is by induction on t. For t = 2 the theorem is just the factorization An1 ⊗ An2 =
(An1 ⊗ In2)(In1 ⊗An2) given in Equation 2.5. For the general case we have:
An1 ⊗An2 ⊗ · · · ⊗Ant = (An1 ⊗ I N
n1
)(In1 ⊗An2 ⊗ · · · ⊗Ant),
which by induction is equal to
(An1 ⊗ I N
n1
)
(
In1 ⊗
t∏
i=2
(IN(i−1)
n1
⊗Ani ⊗ I N
ni
)
.
6Using the tensor product identities, and the definition, we get
(An1 ⊗ I N
n1
)
t∏
i=2
(IN(i−1) ⊗Ani ⊗ I N
ni
)
which gives us the desired result.
Definition 4 (Stride Permutation) The stride permutation Lmnn is defined by
Lmnn (e
m
i ⊗ e
n
j ) = (e
n
j ⊗ e
m
i ).
Since (emi ⊗ e
n
j ) = e
mn
in+j and (e
n
j ⊗ e
m
i ) = e
mn
jm+i, the stride permutation L
mn
n reorders the elements
of an arbitrary vector x by moving the elements in the in+ j location to the jm+ i location. For
example, let x =
∑7
i=0 xie
8
i then,
L82x = L
8
2(
7∑
k=0
xke
8
k),
= L82

 3∑
i=0
1∑
j=0
x2i+je
8
2i+j

 ,
= L82

∑
i
∑
j
x2i+j(e
4
i ⊗ e
2
j)

 ,
=
∑
i
∑
j
x2i+jL
8
2(e
4
i ⊗ e
2
j),
=
∑
i
∑
j
x2i+j(e
2
j ⊗ e
4
i ),
=
∑
i
∑
j
x2i+je
8
4j+i.
7In matrix notation this becomes


x0
x2
x4
x6
x1
x3
x5
x7


= L82x =


1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1




x0
x1
x2
x3
x4
x5
x6
x7


.
Theorem 3 (Product Rule for Stride Permutations) If N = rst then
Lrstst = L
rst
s L
rst
t . (2.7)
Proof
First observe that
LNst(x
r ⊗ xs ⊗ xt) = xs ⊗ xt ⊗ xr .
Since also
LNs L
N
t (x
r ⊗ xs ⊗ xt) = LNs (x
t ⊗ xr ⊗ xs) = (xs ⊗ xt ⊗ xr).
The following are a few other properties of the stride permutation construct [5].
LNN = IN ,
LNn L
N
N/n = IN ,
(LNn )
−1 = LNN/n,
LNN/n = (L
N
n )
t
8For example,
(L82)
−1 = (L82)
t = L84 =


1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1


In general, the tensor product is not commutative. However, A⊗B and B ⊗ A only differ by a
permutation. The following permutation can be used to commute the tensor product.
Theorem 4 (Commutation Theorem)
Lmnn (A⊗B) = (B ⊗A)L
mn
n (2.8)
where A is an m×m matrix and B is an n× n matrix. In other words
B ⊗A = Lmnn (B ⊗A)(L
mn
n )
−1. (2.9)
Proof
The proof is based on the definition.
Lmnn (A⊗B)(e
m
i ⊗ e
n
j ) = L
mn
n (Ae
m
i ⊗Be
n
j ),
= (Benj ⊗Ae
m
i ).
Similarly
(B ⊗A)Lmnn (e
m
i ⊗ e
n
j ) = (B ⊗A)(e
n
j ⊗ e
m
i ),
= (Benj ⊗Ae
m
i ).
9Applying Theorem 4 to A⊗ In we get, (L
mn
n )
−1(In ⊗A)L
mn
n . Thus we can now write A⊗B as,
(A⊗ In)(L
mn
n )
−1(B ⊗ Im)L
mn
n (2.10)
(Lmnn )
−1(In ⊗A)L
mn
n (Im ⊗B). (2.11)
Using (Lmnn )
−1 = Lmnm we can rewrite these factorizations as
(A⊗ In)L
mn
m (B ⊗ Im)L
mn
n (2.12)
Lmnm (In ⊗A)L
mn
n (Im ⊗B). (2.13)
An important permutation occurring in FFT algorithms is the bit-reversal permutation [6].
Definition 5 (Bit-Reversal Permutation)
R2
t
(e2it−1 ⊗ · · · ⊗ e
2
i1 ⊗ e
2
i0) = (e
2
i0 ⊗ e
2
i1 ⊗ · · · ⊗ e
2
it−1).
Lemma 1 (Bit-Reversal Factorization)
R2
t
= (R2
j
⊗R2
t−j
)L2
t
2j (2.14)
R2
t
= (I2 ⊗ R
2t−1)L2
t
2 (2.15)
R2
t
=
t∏
i=1
(I2t−i ⊗ L
2i
2 ). (2.16)
A straight forward MATLAB implementation is given in Figure 2.1.
Digit-reversal is a generalization of the bit-reversal permutation (Definition 5), which occurs in
the Dimensionless FFT.
Definition 6 (Digit Reversal) The t-digit digit-reversal permutation R(n1,...,nt) is a permutation
matrix of size N = n1 · · ·nt defined by
R(n1,...,nt)(en1i1 ⊗ · · · ⊗ e
nt
it
) = entit ⊗ · · · ⊗ e
n1
i1
. (2.17)
The special case when n1 = · · · = nt = n is denoted by R
nt [1]. An alternative method to compute
the digit-reversal permutation matrix is as follows,
Let N = (n1 × · · · × nt)
10
function y = oneDBitRev(x)
% One Dimensional Bit Reversal Permutation.
% Input: x a vector of length n. n = 2^t, t an integer, t >= 0.
% Output: y = R^{2^t} x
% Algorithm:
n = length(x);
t = floor(log2(n));
% allocate space for y
y = zeros(n,1);
for i=0:n-1
j = bitrevind(i,t);
y(j+1) = x(i+1);
end
end
function revindex = bitRevind(inNum,size)
revindex = 0;
tempindex = inNum;
for i=0:size-1
bit = mod(tempindex,2);
tempindex = floor(tempindex/2);
revindex = revindex*2 + bit;
end
end
Figure 2.1: One Dimensional Bit Reversal Permutation.
Then,
R(n1,...,nt) = (Rn1 ⊗ I N
n1
)(In1 ⊗R
n2 ⊗ · · · ⊗Rnt) (2.18)
The following example illustrates Equation 2.18
R4,4 = (R4 ⊗ I4)(I4 ⊗R
4)
R2,2,4 = (R2 ⊗ I8)(I2 ⊗R
2 ⊗R4)
= (R2 ⊗ I8)(I2 ⊗
{
(R2 ⊗ I4)(I2 ⊗R
4)
}
)
A MATLAB implementation of the digit-reversal permutation is given in Figure 2.2.
11
2.2 The Discrete Fourier Transform
This section introduces the one-dimensional and multi-dimensional discrete Fourier transform
[5, 1].
2.2.1 One-Dimensional DFT
Definition 7 Fne
n
i = f
n
i , where f
n
i =
∑n−1
j=0 ω
ijenj , and ω = e
2pii
n .
The matrix representation is given by Fn = (ω
ij) 0 ≤ i, j < n. For example,
F2 =

 1 1
1 −1

 , F4 =


1 1 1 1
1 i −1 −i
1 −1 1 −1
1 −i −1 i


, F8 =


1 1 1 1 1 1 1 1
1 ω8 i ω
3
8 −1 ω
5
8 −i ω
7
8
1 i −1 −i 1 i −1 −i
1 ω38 −i ω8 −1 ω
7
8 i ω
5
8
1 −1 1 −1 1 −1 1 −1
1 ω58 i ω
7
8 −1 ω8 −i ω
3
8
1 −i −1 i 1 −i −1 i
1 ω78 −i ω
5
8 −1 ω
3
8 i ω8


,
Fn =


1 1 · · · 1
1 ω · · · ωn−1
...
... · · ·
...
1 ωn−1 · · · ωn−1
2


.
2.2.2 Multi-Dimensional DFT
Definition 8 Let X(a1, a2, · · · , at) be a function of t variables, where 0 ≤ aj < nj. The t dimen-
sional n1 × · · · × nt DFT of X is
Y (b1, · · · , bt) =
∑
0≤aj<nj
e
2pii
n1
a1b1 · · · e
2pii
nt
atbtX(a1, · · · , at). (2.19)
Since the summation indices are independent, Equation 2.19 can be rewritten as a nested sum.
Y (b1, · · · , bt) =
∑
0≤a1<n1
e
2pii
n1
a1b1

· · ·

 ∑
0≤at<nt
e
2pii
nt
atbtX(a1, · · · , at)

 · · ·

 .
12
function revArray = generalBitReversal(numdims, dims, x)
global XARRAY; global YARRAY;
XARRAY = x;
gBitRev(dims, 1, numdims, prod(dims), 0);
revArray = YARRAY’;
end
function gBitRev(dims, startIdxDims, numdims, N, startIdxX)
global XARRAY; global YARRAY;
n1 = dims(startIdxDims);
nb = N/n1;
%COMPUTE R_{N_1} \otimes I_{N/N_1}
bitRevTensor(n1, nb, startIdxX);
XARRAY = YARRAY;
%COMPUTE I_{N_1} OTIMES (R^{N_2} OTIMES ... OTIMES R^{N_T})
if(numdims > 1)
for i = 0 : n1-1
gBitRev(dims, startIdxDims+1, numdims-1, nb, (i*nb)+startIdxX);
end
end
end
function bitRevTensor(n1, nb, startIdxX)
global XARRAY; global YARRAY;
if nb > 1
arrayStart = startIdxX;
arrayEnd = startIdxX + nb;
numBits = floor(log2(n1));
for i = 0:n1-1
newBase = bitRevind(i, numBits);
dataManip((newBase*nb)+startIdxX,arrayStart,arrayEnd);
arrayStart = arrayEnd;
arrayEnd = arrayEnd + nb;
end
else
numBits = floor(log2(n1));
for i = startIdxX+1:startIdxX+n1
YARRAY(startIdxX+bitRevind(i-1, numBits)+1) = XARRAY(i);
end
end
end
function dataManip(base, arrayStart, arrayEnd)
global XARRAY; global YARRAY;
j=1;
for i = arrayStart+1:arrayEnd
YARRAY(base+j) = XARRAY(i);
j=j+1;
end
end
Figure 2.2: Generalized Bit Reversal Permutation.
13
The multi-dimensional DFT as represented by the nested sum can be computed by applying one-
dimensional DFTs along each dimension. A sequence of nt-point DFTs are applied to the functions
obtained by fixing the first t-1 inputs to X . Then a sequence of nt−1-point DFTs are applied to
the resulting function with all but (t − 1)-st variables fixed. This process continues until finally a
sequence of n1-point DFTs are applied. The one-dimensional DFTs are computed using the FFT
(explained in Section 2.3)
When computing a two-dimensional DFT if the input function X(a1, a2) is stored lexicographi-
cally as an n1 × n2 matrix, the computation proceeds by applying n2-point DFT to the rows of X
followed by n1-point DFT applied to the columns of the partially transformed matrix. This approach
is called the row-column algorithm.
The multi-dimensional DFT from Equation 2.19 can also represented using the tensor product.
If the input function X(a1, a2, · · · , at) and the output Y (a1, a2, · · · , at) are stored lexicographically
in vectors x and y then,
y = (Fn1 ⊗ Fn2 ⊗ · · · ⊗ Fn1)x. (2.20)
The following example illustrates the row-column approach to computing a two dimensional DFT
based on Equation 2.20. Let n1 = 4 and n2 = 4. We have,
x = (X(0, 0), X(0, 1), X(0, 2), X(0, 3), X(1, 0), X(1, 1), X(1, 2), X(1, 3),
X(2, 0), X(2, 1), X(2, 2), X(2, 3), X(3, 0), X(3, 1), X(3, 2), X(3, 3))T ,
y = (Y (0, 0), Y (0, 1), Y (0, 2), Y (0, 3), Y (1, 0), Y (1, 1), Y (1, 2), Y (1, 3),
Y (2, 0), Y (2, 1), Y (2, 2), Y (2, 3), Y (3, 0), Y (3, 1), Y (3, 2), Y (3, 3))T ,
and,
y = (F4 ⊗ F4)x,
Using Equation 2.5 we can decompose the tensor product to,
y = (F4 ⊗ I4)(I4 ⊗ F4)x
which implies
y =


I4 I4 I4 I4
I4 iI4 −I4 −iI4
I4 −I4 I4 −I4
I4 −iI4 −I4 iI4




F4
F4
F4
F4


x
14
From the example it is clear, (I4 ⊗ F4)x applies the 4-point DFT to the rows of X and (F4 ⊗
I4) applies the 4-point DFT to the columns of the partially transformed matrix. In general the
row-column algorithm for computing the two-dimensional DFT (Fn1 ⊗ Fn2) corresponds to the
factorization
y = (Fn1 ⊗ In2)(In1 ⊗ Fn2)x. (2.21)
Using Theorem 2 we can generalize the row-column algorithm applied to a t-dimensional DFT
FN = Fn1 ⊗ · · · ⊗ Fnt of size N = n1 · · ·nt, through the following factorization
FN =
t∏
i=1
(IN(i−1) ⊗ Fni ⊗ IN¯(i)).
2.3 Fast Fourier Transform
The Fast Fourier Transform is a method to compute the DFT of size N in O(N logN) time.
This section reviews matrix factorizations corresponding to the FFT [5, 6].
2.3.1 The Cooley-Tukey Factorization
The divide and conquer step in the FFT corresponds to the factorization Frs = (Fr⊗Is)T
rs
s (Ir⊗
Fs)L
rs
r , where T
rs
s is a diagonal matrix called the twiddle factor matrix.
Definition 9 (Twiddle Factor Matrix)
Tmnn (e
m
i ⊗ e
n
j ) = ω
ij
mn(e
m
i ⊗ e
n
j ), (2.22)
where ωmn is a primitive mn-th root of unity [6].
Definition 10 (Direct Sum) Let A0, · · · , An−1 be matrices. The direct sum of A0, · · · , An−1 is
the block diagonal matrix
n−1⊕
i=0
Ai = diag(A0, A1, · · · , An−1)
The twiddle matrix can be written as a direct sum.
Theorem 5
Tmnn =
m−1⊕
i=0
Wn(ωmn)
i,
15
where
Wn(α) =


1
α
...
αn−1


.
Proof
It follows from the definition that
Tmnn =
m−1⊕
i=0
n−1⊕
j=0
ωijmn,
=
m−1⊕
i=0
Wn(ω
i
mn),
=
m−1⊕
i=0
Wn(ωmn)
i.
For example,
T 21 = diag(1, 1),
T 42 = diag(1, 1, 1, ω4),
T 84 = diag(1, 1, 1, 1, 1, ω8, ω
2
8 , ω
3
8),
T 168 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16).
Proposition 1 T srr = L
rs
s T
rs
s (L
rss)
−1 = Lrss T
rs
s L
sr
r .
Theorem 6 (Cooley-Tukey) Assume N = rs, then
Frs = (Fr ⊗ Is)T
rs
s (Ir ⊗ Fs)L
rs
r , (2.23)
Proof
The proof involves the defining properties of tensor products and the associated matrices that have
been presented. To see this, complete both sides on the basis elements esi ⊗ e
r
j .
(Fr ⊗ Is)T
rs
s (Ir ⊗ Fs)L
rs
r (e
s
i ⊗ e
r
j) = (Fr ⊗ Is)T
rs
s (Ir ⊗ Fs)(e
r
j ⊗ e
s
i )
16
= (Fr ⊗ Is)T
rs
s (e
r
j ⊗ f
s
i )
= (Fr ⊗ Is)T
rs
s
(
s−1∑
k=0
ωrik(erj ⊗ e
s
k)
)
= (Fr ⊗ Is)
(
s−1∑
k=0
ωrik+jk(erj ⊗ e
s
k)
)
=
(
s−1∑
k=0
ωrik+jk(f rj ⊗ e
s
k)
)
=
s−1∑
k=0
r−1∑
l=0
ωrik+sjl+jk(erl ⊗ e
s
k).
Since ω is an rs-th root of unit, ωrik+sjl+jk = ωrik+sjl+jk+rils = ωri+j)(sl+k). Using this observation
along with the map erl ⊗ e
s
k → e
rs
ls+k, shows that the last sum in the derivation is f
n
ri+j .
For example,
F16 = (F2 ⊗ I8)T
16
8 (I2 ⊗ F8)L
16
2
=

 I8 I8
I8 −I8

T 168

 F8 0
0 F8

L162
where
T 168 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16)
and L162 is the stride permutation matrix. Figure 2.3 shows a MATLAB implementation of a recursive
two-power FFT on the factorization FN = (F2 ⊗ IN/2)T
N
N/2(I2 ⊗ FN/2)L
N
2 .
2.3.2 Iterative Cooley-Tukey Factorization
Using properties of the tensor product the recursive formulation can be converted to a formula
corresponding to an iterative computation.
Theorem 7 (Iterative Cooley-Tukey Factorization)
F2t =
{
t∏
c=1
(I2c−1 ⊗ F2 ⊗ I2t−c)
(
I2c−1 ⊗ T
2t−c+1
2t−c
)}
R2
t
, (2.24)
where R2
t
is the bit-reversal permutation.
Proof
The proof is by induction using Theorem 6 (Cooley-Tukey factorization), the multiplicative property
17
% Recursive Fast Fourier Transform.
% Input: x a vector of length n. n = 2^k, k an integer, k >= 0.
% Output: y = F_N x. F_N = (omega_N^{ij})_{0 <= i,j < N}
%
% Algorithm. Let m = N/2. T^N_m = I_m oplus W_m(omega_N)
% y = (F_2 otimes I_m)T^N_m(I_2 otimes F_m)L^N_2
%
function y = recurCTFFT(x)
N = length(x)
if N == 1
y = x
else
% [x0 x1] = L^N_2 x
x0 = x(1:2:N-1);
x1 = x(2:2:N);
% [t0 t1] = (I_2 otimes F_m)[x0 x1]
t0 = recurCTFFT(x0);
t1 = recurCTFFT(x1);
% w = W_m(omega_N)
w = exp((2*pi*i/N)*(0:n/2-1));
% y = [y0 y1] = (F_2 otimes I_m) T^N_m [t0 t1]
y0 = t0 + w.*t1;
y1 = t0 - w.*t1;
y = [y0 y1]
end
end
Figure 2.3: Recursive Cooley-Tukey FFT Algorithm.
18
of the tensor product (i.e. (I⊗AB) = (I⊗A)(I⊗B)),and the factorization of bit-reversal in Lemma 1
[5].
The following example illustrates the Iterative Cooley-Tukey factorization.
F16 =
{
4∏
c=1
(I2c−1 ⊗ F2 ⊗ I24−c)(I2c−1 ⊗ T
24−c+1
24−c )
}
R2
4
= (I21−1 ⊗ F2 ⊗ I24−1)(I21−1 ⊗ T
24−1+1
24−1 )
(I22−1 ⊗ F2 ⊗ I24−2)(I22−1 ⊗ T
24−2+1
24−2 )
(I23−1 ⊗ F2 ⊗ I24−3)(I23−1 ⊗ T
24−3+1
24−3 )
(I24−1 ⊗ F2 ⊗ I24−4)(I24−1 ⊗ T
24−4+1
24−4 )R
16
= (F2 ⊗ I8)(T
16
8 )(I2 ⊗ F2 ⊗ I4)(I2 ⊗ T
8
4 )
(I4 ⊗ F2 ⊗ I2)(I4 ⊗ T
4
2 )(I8 ⊗ F2)(I8 ⊗ T
2
1 )R
16
where
T 168 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16)
(I2 ⊗ T
8
4 ) = diag(1, 1, 1, 1, 1, ω8, ω
2
8 , ω
3
8 , 1, 1, 1, 1, 1, ω8, ω
2
8 , ω
3
8),
(I4 ⊗ T
4
2 ) = diag(1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4),
(I8 ⊗ T
2
1 ) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R16 is the bit-reversal permutation as per Definition 5 and is given by the following matrix.
R16 =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15

 .
The elements in the top row are indices of the input and they are mapped to the corresponding
elements in the bottom row, which are indices of the output. Figure 2.4 shows a MATLAB imple-
mentation of the iterative FFT.
2.4 Dimensionless FFT
This section introduces the Dimensionless FFT algorithm. The following proof [1] is key to
understanding the nature of the Dimensionless algorithm, and is the basis for this thesis. This
19
function y = iterCTFFT(x)
% Input: x a vector of length n. n = 2^t, t an integer, t >= 0.
% Output: y = F_{2^t} x
% Algorithm: Iterative.
% F_{2^t} = { Prod_{c=1}^t (I_{2^{c-1}} @ F_2 @ I_{2^{t-c}})
% (I_{2^{c-1}} @ T^{2^{t-c+1}}_{2^{t-c}}) } R^{2^t}
%
n = length(x);
t = ceil(log2(n));
xt = bitreversal(x);
yt = zeros(n,1);
for c=t:-1:1
m = 2^(c-1);
p = 2^(t-c);
% W = W_p(omega_{2p})
W = exp((2*pi*i)/(2*p)*-(0:p-1)’);
% yt = (I_m @ F_2 @ I_p)xt
for j=0:m-1
% y^{2p}_{j*2p+1} = (F_2 @ I_p)T^{2p}_p x^{2p}_{j*2p+1}
% = (F_2 @ I_p)(I_p $ W) x^{2p}_{j*2p+1}
xt((j*2+1)*p+1:(j+1)*2*p) = W .* xt((j*2+1)*p+1:(j+1)*2*p);
yt(j*2*p+1:(j*2+1)*p) = xt(j*2*p+1:(j*2+1)*p) + xt((j*2+1)*p+1:(j+1)*2*p);
yt((j*2+1)*p+1:(j+1)*2*p) = xt(j*2*p+1:(j*2+1)*p) - xt((j*2+1)*p+1:(j+1)*2*p);
end
xt = yt;
end
y = yt;
end
Figure 2.4: Iterative Cooley-Tukey FFT Algorithm.
20
algorithm will be later used in Chapter 3 to convert a standard one-dimensional FFT algorithm to
a multi-dimensional FFT algorithm.
Theorem 8 (Radix 2 Dimensionless FFT) Let N = 2K , and assume that N = n1 × · · · × nt,
where ni = 2
ki . Let K(i) = k1 + · · · + ki, with the boundary conditions K(t) = K and K(0) = 0
denote the i-th prefix sum. Let FN denote an arbitrary multi-dimensional DFT of size N = 2
K . Let
FN = Fn1 ⊗ · · · ⊗ Fnt denote an arbitrary multi-dimensional DFT of size N = 2
K . Then
FN =
{
K∏
i=1
(I2i−1 ⊗ F2 ⊗ I2K−K(i) )Ti
}
R, (2.25)
where
R = Rn1 ⊗ · · · ⊗ Rnt
and
Ti = I2K(j−1)+l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I2K−K(j)
with i = K(j − 1) + l and 0 ≤ l < kj.
Proof
The proof follows from the one-dimensional FFT factorization (Theorem 7) and the fundamental
factorization of tensor products (Theorem 2).
Using the radix 2 FFT we can write
Fnj = BnjRnj ,
and from the Iterative Cooley-Tukey algorithm Theorem 7 we get,
Bnj =
kj∏
l=1
(I2l−1 ⊗ F2 ⊗ I2kj−l)(I2l−1 ⊗ T
2kj−l+1
2kj−l
).
Then from the properties of the tensor product and by definition of the multi-dimensional DFT
(Definition 8) and Equation 2.20,
FN = (Bn1 ⊗ · · · ⊗Bnt)(Rn1 ⊗ · · · ⊗Rnt)
21
Furthermore applying fundamental tensor product factorization (Theorem 2),
(Bn1 ⊗ · · · ⊗Bnt) =
t∏
j=1
(I2K(j−1) ⊗Bnj ⊗ I2K−K(j) )
If we expand each Bnj
FN =


t∏
j=1
kj∏
l=1
(I2K(j−1) ⊗ I2l−1 ⊗ F2 ⊗ I2kj−l ⊗ I2K−K(j) )(I2K(j−1) ⊗ I2l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I2K−K(j) )

R.
Combining adjacent identity matrices
FN =


t∏
j=1
kj∏
l=1
(I2K(j−1)+l−1 ⊗ F2 ⊗ I2kj−l+K−K(j))(I2K(j−1)+l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I2K−K(j) )

R. (2.26)
Setting i = k(j − 1) + l into this equation completes the proof.
The following examples will show how Equation 2.26 computes Dimensionless FFT on 16 points.
A Dimensionless FFT of 16 points is capable of computing a one-dimensional DFT on 16 points, a
two-dimensional DFT on 2×8, 4×4, or 8×2 points, a three-dimensional DFT on 4×2×2, 2×4×2,
or 2 × 2 × 4 points, and a four-dimensional DFT on 2 × 2 × 2 × 2 points. The one-dimensional,
two-dimensional and three-dimensional cases are illustrated.
• One-dimensional DFT on 16 points.
Since it is a 1D DFT on 16 points we have, t = 1, j = 1, K = 4, k1 = 4, K(1) = 4, K(0) = 0
F16 =
{
4∏
l=1
(I2K(1−1)+l−1 ⊗ F2 ⊗ I2k1−l+4−K(1))(I2K(1−1)+l−1 ⊗ T
2k1−l+1
2k1−l ⊗ I24−K(1) )
}
R16.
F16 =
{
4∏
l=1
(I2l−1 ⊗ F2 ⊗ I24−l)(I2l−1 ⊗ T
24−l+1
24−l )
}
R16.
F16 = (I21−1 ⊗ F2 ⊗ I24−1 )(I21−1 ⊗ T
24−1+1
24−1 )(I22−1 ⊗ F2 ⊗ I24−2)(I22−1 ⊗ T
24−2+1
24−2 )
(I23−1 ⊗ F2 ⊗ I24−3 )(I23−1 ⊗ T
24−3+1
24−3 )(I24−1 ⊗ F2 ⊗ I24−4)(I24−1 ⊗ T
24−4+1
24−4 )R
16.
F16 = (F2 ⊗ I8)(T
16
8 )(I2 ⊗ F2 ⊗ I4)(I2 ⊗ T
8
4 )(I4 ⊗ F2 ⊗ I2)(I4 ⊗ T
4
2 )(I8 ⊗ F2)(I8 ⊗ T
2
1 )R
16.
22
where
T 168 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16)
(I2 ⊗ T
8
4 ) = diag(1, 1, 1, 1, 1, ω8, ω
2
8, ω
3
8 , 1, 1, 1, 1, 1, ω8, ω
2
8 , ω
3
8),
(I4 ⊗ T
4
2 ) = diag(1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4),
(I8 ⊗ T
2
1 ) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
and R16 is the bit-reversal permutation as per Definition 5. The R16 permutation is as follows,
R16 =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15

 .
It is clear that the Dimensionless FFT computes the one-dimensional DFT in the same fashion
as the Iterative Cooley-Tukey factorization.
• Two-dimensional DFT on 16 points (F4 ⊗ F4).
We have,t = 2, K = 4, k1 = 2, k2 = 2, K(0) = 0, K(1) = 2, K(2) = 4.
(F4 ⊗ F4) =
2∏
j=1
kj∏
l=1
(I2K(j−1)+l−1 ⊗ F2 ⊗ I2kj−l+4−K(j))(I2K(j−1)+l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I24−K(j) )R
(F4 ⊗ F4) =
k1∏
l=1
(I2K(1−1)+l−1 ⊗ F2 ⊗ I2k1−l+4−K(1))(I2K(1−1)+l−1 ⊗ T
2k1−l+1
2k1−l ⊗ I24−K(1))
k2∏
l=1
(I2K(2−1)+l−1 ⊗ F2 ⊗ I2k2−l+4−K(2))(I2K(2−1)+l−1 ⊗ T
2k2−l+1
2k2−l ⊗ I24−K(2))R
(F4 ⊗ F4) =
2∏
l=1
(I2l−1 ⊗ F2 ⊗ I22−l+4−2)(I2l−1 ⊗ T
22−l+1
22−l ⊗ I24−2)
2∏
l=1
(I2K(1)+l−1 ⊗ F2 ⊗ I22−l+4−4)(I2K(1)+l−1 ⊗ T
22−l+1
22−l ⊗ I24−4)R
(F4 ⊗ F4) = (I21−1 ⊗ F2 ⊗ I22−1+4−2 )(I21−1 ⊗ T
22−1+1
22−1 ⊗ I24−2 )
(I22−1 ⊗ F2 ⊗ I22−2+4−2 )(I22−1 ⊗ T
22−2+1
22−2 ⊗ I24−2 )
(I22+1−1 ⊗ F2 ⊗ I22−1)(I22+1−1 ⊗ T
22−1+1
22−1 )
(I22+2−1 ⊗ F2 ⊗ I22−2)(I22+2−1 ⊗ T
22−2+1
22−2 )R
(F4 ⊗ F4) = (F2 ⊗ I8)(T
4
2 ⊗ I4)(I2 ⊗ F2 ⊗ I4)(I2 ⊗ T
2
1 ⊗ I4)
23
(I4 ⊗ F2 ⊗ I2)(I4 ⊗ T
4
2 )(I8 ⊗ F2)(I8 ⊗ T
2
1 )R
(4,4).
where
(T 42 ⊗ I4) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, i, i, i, i),
(I2 ⊗ T
2
1 ⊗ I4) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
(I4 ⊗ T
4
2 ) = diag(1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4),
(I8 ⊗ T
2
1 ) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
R(4,4) is the digit-reversal permutation as per Definition 6. The R(4,4) permutation is as follows,
R(4,4) =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 1 3 8 10 9 11 4 6 5 7 12 14 13 15

 .
• Three-dimensional DFT on 16 points (F2 ⊗ F2 ⊗ F4).
We have,t = 3, K = 4, k1 = 1, k2 = 1, k3 = 2, K(0) = 0, K(1) = 1, K(2) = 2, K(3) = 4.
(F2 ⊗ F2 ⊗ F4) =


∏3
j=1
∏kj
l=1(I2K(j−1)+l−1 ⊗ F2 ⊗ I2kj−l+4−K(j))
(I2K(j−1)+l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I24−K(j) )R


(F2 ⊗ F2 ⊗ F4) =
k1∏
l=1
(I2K(1−1)+l−1 ⊗ F2 ⊗ I2k1−l+4−K(1))(I2K(1−1)+l−1 ⊗ T
2k1−l+1
2k1−l ⊗ I24−K(1))
k2∏
l=1
(I2K(2−1)+l−1 ⊗ F2 ⊗ I2k2−l+4−K(2))(I2K(2−1)+l−1 ⊗ T
2k2−l+1
2k2−l ⊗ I24−K(2))
k3∏
l=1
(I2K(3−1)+l−1 ⊗ F2 ⊗ I2k3−l+4−K(3))(I2K(3−1)+l−1 ⊗ T
2k3−l+1
2k3−l ⊗ I24−K(3))
(F2 ⊗ F2 ⊗ F4) =
1∏
l=1
(I2l−1 ⊗ F2 ⊗ I21−l+4−1)(I2l−1 ⊗ T
21−l+1
21−l ⊗ I24−1 )
1∏
l=1
(I2K(1)+l−1 ⊗ F2 ⊗ I21−l+4−2)(I2K(1)+l−1 ⊗ T
21−l+1
21−l ⊗ I24−2 )
2∏
l=1
(I2K(2)+l−1 ⊗ F2 ⊗ I22−l+4−4)(I2K(2)+l−1 ⊗ T
22−l+1
22−l ⊗ I24−4 )
(F2 ⊗ F2 ⊗ F4) = (I21−1 ⊗ F2 ⊗ I21−1+4−1 )(I21−1 ⊗ T
21−1+1
21−1 ⊗ I24−1 )
(I21+1−1 ⊗ F2 ⊗ I21−1+4−2 )(I21+1−1 ⊗ T
21−1+1
21−1 ⊗ I22)
24
(I22+1−1 ⊗ F2 ⊗ I22−1)(I22+1−1 ⊗ T
22−1+1
22−1 )
(I22+2−1 ⊗ F2 ⊗ I22−2+4−4 )(I22+2−1 ⊗ T
22−2+1
22−2 )
(F2 ⊗ F2 ⊗ F4) = (F2 ⊗ I8)(T
2
1 ⊗ I8)(I2 ⊗ F2 ⊗ I4)(I2 ⊗ T
2
1 ⊗ I4)
(I4 ⊗ F2 ⊗ I2)(I4 ⊗ T
4
2 )(I8 ⊗ F2)(I8 ⊗ T
2
1 )R
(2,2,4)
where
(T 21 ⊗ I8) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
(I2 ⊗ T
2
1 ⊗ I4) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
(I4 ⊗ T
4
2 ) = diag(1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4, 1, 1, 1, ω4),
(I8 ⊗ T
2
1 ) = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R(2,2,4) is the digit-reversal permutation as per Definition 6. The R(2,2,4) permutation is
as follows,
R(2,2,4) =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 1 3 4 6 5 7 8 10 9 11 12 14 13 15

 .
The factorizations in each of these examples are the same except for the initial permutation and
the values of the twiddle factors in the diagonal matrix. Figure 2.5 shows a MATLAB implementation
of the Dimensionless FFT.
25
function y = dimlessICTFFT(t,dims,x)
% Computes Dimensionless Iterative Cooley-Tukey FFT
% t is the number of dimensions
% dims is the array of dimensions
k = log2(dims); % This is k_j | k(1)= log(n_1)....k(t)= log(n_t)
for i = 0 : t
K(i+1) = sum(k(1:i)); % This is K(j) | K(j)= k(1)+...+k(j), K(0)= 0
% and K(t)= N
end
N = prod(dims); % N is the total size of the DFT (N = n_1*n_2*....n_t)
bigK = log2(N);
xt = generalBitReversal(t, dims, x);
yt = zeros(N,1);
for c=1:t
for l = k(c):-1:1
m = 2^(K(c-1+1)+l-1);
p = 2^(k(c)-l+bigK-K(c+1));
a = m;
rs = 2^(k(c)-l+1);
s = 2^(k(c)-l);
b = 2^(bigK-K(c+1));
W = dimlessICT_Twid(a,rs,s,b);
for j=0:m-1
xt((j*2+1)*p+1:(j+1)*2*p) = W((j*p)+1:(j+1)*p).*xt((j*2+1)*p+1:(j+1)*2*p);
yt(j*2*p+1:(j*2+1)*p) = xt(j*2*p+1:(j*2+1)*p) + xt((j*2+1)*p+1:(j+1)*2*p);
yt((j*2+1)*p+1:(j+1)*2*p) = xt(j*2*p+1:(j*2+1)*p) - xt((j*2+1)*p+1:(j+1)*2*p);
end
xt = yt;
end
end
y = yt;
end
function w = dimlessICT_Twid(a,rs,s,b)
I = sqrt(-1);
for i=1:a
w_n = exp(-2*pi*I/rs*(0:s-1))’;
w_n = kron(w_n, ones(b,1));
if i == 1
w = w_n;
else
w = [w;w_n];
end
end
end
Figure 2.5: Dimensionless Iterative Cooley-Tukey FFT Algorithm.
26
3. Universal FFT Processor
The Universal FFT processor is a machine that computes an arbitrary dimension DFT, based
on a variant of the Dimensionless FFT (Theorem 11) similar to the Pease FFT [7, 13]. The Pease
algorithm has the benefit that the data flow in each stage is the same (constant geometry). This
allows the computation to be vertically folded using the same permutation hardware at each stage.
This Chapter introduces the underlying math of the Universal FFT processor, and subsequent
Chapters discuss the implementation details.
3.1 Pease Factorization
The Pease factorization is a variant of the Iterative Cooley-Tukey factorization. Using Theorem 4
the Iterative Cooley- Tukey algorithm can be converted to the Pease algorithm. Both factorizations
compute the DFT in O(N logN) time. However the data access pattern of the Pease factorization
is the same at every stage.
Theorem 9 (Pease Factorization)
F2t =
{
t∏
c=1
L2
t
2 (I2t−1 ⊗ F2)Tc
}
R2
t
, (3.1)
Tc = L
2t
2t−c(I2c−1 ⊗ T
2t−c+1
2t−c )L
2t
2c (3.2)
Proof
Using Theorem 4,
((I2c−1 ⊗ F2)⊗ I2t−c) = L
2t
2 (I2t−1 ⊗ F2)L
2t
2t−c .
Inserting this into the factorization in Theorem 7, we obtain
F2t =
{
t∏
c=1
(
L2
t
2c(I2t−1 ⊗ F2)L
2t
2t−c
)(
I2c−1 ⊗ T
2t−c+1
2t−c
)}
R2
t
.
Combining adjacent factors and using the observation that
L2
t
2c+1 = L
2t
2cL
2t
2 ,
27
which follows from the product rule (Theorem 3) for stride permutations, we obtain
F2t =
{
t∏
c=1
(
L2
t
2 (I2t−1 ⊗ F2)Tc
)}
R2
t
,
where
Tc = L
2t
2t−c(I2c−1 ⊗ T
2t−c+1
2t−c )L
2t
2c .
Theorem 10 (Pease Twiddle Factors)
Tc(e
2t−c
i ⊗ e
2c−1
j ⊗ e
2
k) = ω
ik
2t−c+1(e
2t−c
i ⊗ e
2c−1
j ⊗ e
2
k) (3.3)
Proof
Tc(e
2t−c
i ⊗ e
2c−1
j ⊗ e
2
k) = L
2t
2t−c(I2c−1 ⊗ T
2t−c+1
2t−c )L
2t
2c(e
2t−c
i ⊗ e
2c−1
j ⊗ e
2
k)
= ωik2t−c+1(e
2t−c
i ⊗ e
2c−1
j ⊗ e
2
k)
Corollary 1
Tc =
2t−c−1⊕
i=0
(
I2c−1 ⊗W
i2c−1
2
)
, (3.4)
and
W2 =

 1 0
0 ω

 ,
where ω is a primitive 2t-th root of unity.
Proof
By Theorem 10,
Tc =
2t−c−1⊕
i=0
2c−1−1⊕
j=0
1⊕
k=0
ωik2t−c+1
=
2t−c−1⊕
i=0
2c−1−1⊕
j=0

 1 0
0 ωik2t−c+1


=
2t−c−1⊕
i=0
2c−1−1⊕
j=0
W2(ω2t−c+1)
i
28
=
2t−c−1⊕
i=0
(I2c−1 ⊗W2(ω2t−c+1)
i)
=
2t−c−1⊕
i=0
(I2c−1 ⊗W2(ω2t)
i2c−1 )
=
2t−c−1⊕
i=0
(I2c−1 ⊗W
i2c−1
2 )
The following example shows how the Pease algorithm given by Equation 3.1 is used to compute
a 1D DFT on 16 points.
F16 =
{
4∏
c=1
L2
4
2 (I24−1 ⊗ F2)Tc
}
R2
4
,
F16 = L
16
2 (I8 ⊗ F2)T1L
16
2 (I8 ⊗ F2)T2L
16
2 (I8 ⊗ F2)T3L
16
2 (I8 ⊗ F2)T4R
16
where
T1 = diag(1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16),
T2 = diag(1, 1, ω8, ω8, ω
2
8 , ω
2
8 , ω
3
8, ω
3
8),
T3 = diag(1, 1, 1, 1, ω4, ω4, ω4, ω4),
T4 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R16 is the bit-reversal permutation as per Definition 5. The R16 permutation is as follows,
R16 =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15

 .
Figure 3.1 gives a MATLAB implementation of the Pease FFT.
3.2 Dimensionless Pease Algorithm
The Dimensionless version of the Pease algorithm is easily derived from Theorem 8 in exactly
the same way as the Pease algorithm is derived from the Iterative Cooley-Tukey algorithm as shown
in Theorem 9.
Theorem 11 (Pease Dimensionless FFT) Let FN denote an arbitrary multi-dimensional DFT
29
function y = peaseFFT(x)
n = length(x);
t = floor(log2(n));
xt = oneDBitRev(x’);
yt = zeros(n, 1);
W = peaseTwids(t);
rowSz = 2^(t-1);
for c = t:-1:1
for r=0:2^(t-1)-1
xt(r*2+1+1) = W(((c-1)*rowSz)+r+1) * xt(2*r+1+1);
yt(r+1) = xt(2*r+1) + xt(2*r+1+1);
yt(r+n/2+1) = xt(2*r+1)-xt(2*r+1+1);
end
xt = yt;
end
y = yt;
end
function W = peaseTwids(t)
rowSz = 2^(t-1);
W = ones(t*rowSz,1);
for c = t:-1:1
w = exp(2*pi*j/2^(t-c+1));
for r=0:2^(t-1)-1
r1 = floor(r/2^(c-1));
W(((c-1)*rowSz)+r+1) = w^r1;
end
end
end
Figure 3.1: 1D Pease FFT Algorithm.
30
of size N = 2K , and assume that N = n1 × · · · × nt, where ni = 2
ki . Also let K(i) = k1 + · · ·+ ki,
with the boundary conditions K(t) = K and K(0) = 0. Then
FN =
{
K∏
i=1
L2
K
2 (I2K−1 ⊗ F2)T
′
i
}
P, (3.5)
where
P = Rn1 ⊗ · · · ⊗Rnt
and
T ′i = L
2K
2K−iTiL
2K
2i
with i = K(j − 1) + l and 0 ≤ l < kj.
Proof
This theorem is obtained from Theorem 8 by applying the commutation to each factor and then
combining adjacent factors. Using Theorem 4,
(I2i−1 ⊗ F2 ⊗ I2K−i ) = L
2K
2i (I2K−1 ⊗ F2)L
2K
2K−i . (3.6)
Inserting this into the factorization in Theorem 8, we obtain
FN =
{
K∏
i=1
(
L2
K
2i (I2i−1 ⊗ F2)L
2K
2K−i
)
Ti
}
P.
Combining adjacent factors and using the observation that
L2
K
2i+1 = L
2K
2i L
2K
2 ,
which follows from the product rule (Theorem 3) for stride permutations, we obtain
FN =
{
K∏
i=1
(
L2
K
2 (I2K−1 ⊗ F2)T
′
i
)}
P,
where
T ′i = L
2K
2K−iTiL
2K
2i .
31
Theorem 12 (Dimensionless Pease Twiddle Factors)
T ′i (e
2kj−l
a ⊗ e
2K−K(j)
b ⊗ e
2K(j−1)+l−1
c ⊗ e
2
d) = ω
ad
2kj−l+1
(e2
kj−l
a ⊗ e
2K−K(j)
b ⊗ e
2K(j−1)+l−1
c ⊗ e
2
d) (3.7)
Alternatively,
ωad
2kj−l+1
= ωad2
i−12K−K(j)
N
Proof
T ′i (e
2kj−l
a ⊗ e
2K−K(j)
b ⊗ e
2K(j−1)+l−1
c ⊗ e
2
d) =
L2
K
2K−i(I2K(j−1)+l−1 ⊗ T
2kj−l+1
2kj−l
⊗ I2K−K(j) )L
2K
2i (e
2kj−l
a ⊗ e
2K−K(j)
b ⊗ e
2K(j−1)+l−1
c ⊗ e
2
d)
= ωad
2kj−l+1
(e2
kj−l
a ⊗ e
2K−K(j)
b ⊗ e
2K(j−1)+l−1
c ⊗ e
2
d)
The second statement follows from
(kj − l + 1) + (i− 1) +K −K(j) = (kj − l + 1) +K(j − 1) + l − 1 +K −K(j) = K,
since this implies that N/2kj−l+1 = 2(i−1)+K−K(j) and consequently ω2kj−l+1 = ω
2(i−1)+K−K(j)
N .
It is possible to compute the twiddles for an arbitrary multidimensional DFT from the twiddles
of the corresponding one-dimensional DFT. In the one-dimensional case, the previous theorem is
simplified since j = 1, i = l, and kj = K(j) = K. Therefore the twiddle computation involves only
three factors:
T ′i (e
2K−i
a1 ⊗ e
2i−1
c1 ⊗ e
2
d1) = ω
a1d1
2K−i+1
(e2
K−i
a1 ⊗ e
2i−1
c1 ⊗ e
2
d1).
Relating this to the general case implies that a1 = a2
K−K(j) + b. Therefore, a = ⌊a1/2
K−K(j)⌋ and
the twiddle
ωad2
K−K(j)2i−1
N = ω
⌊a1/2
K−K(j)⌋2K−K(j)d2i−1
N (3.8)
32
The following examples are similar to those shown in Section 2.4. Using equations from Theo-
rem 11 we proceed to show how one-dimensional, two-dimensional and three-dimensional DFTs can
be computed.
• One-dimensional DFT on 16 points.
Since it is a 1D DFT on 16 points we have,t = 1, j = 1, K = 4, k1 = 4, K(1) = 4, K(0) = 0
F16 =
{
4∏
c=1
L2
4
2 (I24−1 ⊗ F2)Tc
}
R2
4
,
F16 = L
16
2 (I8 ⊗ F2)T1L
16
2 (I8 ⊗ F2)T2L
16
2 (I8 ⊗ F2)T3L
16
2 (I8 ⊗ F2)T4R
16
where
T1 = diag(1, ω16, ω
2
16, ω
3
16, ω
4
16, ω
5
16, ω
6
16, ω
7
16),
T2 = diag(1, 1, ω8, ω8, ω
2
8 , ω
2
8, ω
3
8 , ω
3
8),
T3 = diag(1, 1, 1, 1, ω4, ω4, ω4, ω4),
T4 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R16 is the bit-reversal permutation as per Definition 5. The R16 permutation is as follows,
R16 =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15

 .
It is clear that the Dimensionless FFT computes the one-dimensional DFT in the same fashion
as the Iterative Cooley-Tukey factorization. A comparison of the examples shown for each is
proof of this statement.
• Two-dimensional DFT on 16 points (F4 ⊗ F4).
We have,t = 2, K = 4, k1 = 2, k2 = 2, K(0) = 0, K(1) = 2, K(2) = 4.
(F4 ⊗ F4) =
{
4∏
c=1
L2
4
2 (I24−1 ⊗ F2)Tc
}
R2
4
,
(F4 ⊗ F4) = L
16
2 (I8 ⊗ F2)T1L
16
2 (I8 ⊗ F2)T2L
16
2 (I8 ⊗ F2)T3L
16
2 (I8 ⊗ F2)T4R
16
33
where
T1 = diag(1, 1, 1, 1, ω4, ω4, ω4, ω4),
T2 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
T3 = diag(1, 1, 1, 1, ω4, ω4, ω4, ω4),
T4 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R(4,4) is the digit-reversal permutation as per Definition 6. The R(4,4) permutation is as
follows,
R(4,4) =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 1 3 8 10 9 11 4 6 5 7 12 14 13 15

 .
• Three-dimensional DFT on 16 points (F2 ⊗ F2 ⊗ F4).
We have,t = 3, K = 4, k1 = 1, k2 = 1, k3 = 2, K(0) = 0, K(1) = 1, K(2) = 2, K(3) = 4.
(F2 ⊗ F2 ⊗ F4) =
{
4∏
c=1
L2
4
2 (I24−1 ⊗ F2)Tc
}
R2
4
,
(F2 ⊗ F2 ⊗ F4) = L
16
2 (I8 ⊗ F2)T1L
16
2 (I8 ⊗ F2)T2L
16
2 (I8 ⊗ F2)T3L
16
2 (I8 ⊗ F2)T4R
16
where
T1 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
T2 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
T3 = diag(1, 1, 1, 1, ω4, ω4, ω4, ω4),
T4 = diag(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1).
and R(2,2,4) is the digit-reversal permutation as per Definition 6. The R(2,2,4) permutation is
as follows,
R(2,2,4) =

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 1 3 4 6 5 7 8 10 9 11 12 14 13 15

 .
34
The examples clearly show that the Dimensionless FFT approach holds true, when applied to
the Pease FFT algorithm. The computation flow remains unchanged. Only the twiddle factors
and initial permutation change as the number of dimensions and dimension sizes change. Figure 3.2
shows a MATLAB implementation of the twiddle value generation for the Dimensionless Pease FFT.
Figure 3.3 shows a MATLAB implementation of the Dimensionless Pease FFT.
3.3 Universal FFT Processor
General purpose machines have been the traditional tools for DFT computations, and there are
many fast libraries that can carry out this operation efficiently [14, 4]. Recent studies have shown
that there is some advantage in using a co-processor that exclusively computes the DFT operation,
as compared to the more traditional approach. The Universal FFT processor provides one such co-
processor. In theory it can compute a DFT having arbitrary number of dimensions with individual
dimension sizes of any length.
To arrive at a hardware design that computes a multi-dimensional DFT, we need to visually
represent the examples presented in the previous section. This is done using a data-flow graph. For
each example computation, namely F16, F4 ⊗ F4 and F2 ⊗ F2 ⊗ F4, data-flow graphs are shown in
Figure 3.4, Figure 3.5 and Figure 3.6 respectively. In these graphs data flows from left to right. The
first rewiring in space is the Digit Reversal permutation. After the initial permutation, data passes
through K vertical stages of processing (Equation 3.5). Between each vertical stage is the Stride
permutation rewiring given by LN2 (Equation 3.5). Meanwhile in each vertical stage there are N/2
number of F2 computations. As stated previously, a F2 unit computes the sum and difference of the
inputs to obtain the output. Each input of the F2 unit is also multiplied with a twiddle factor. (the
diagram shows only one multiplication because the other twiddle factor is always 1).
From the data-flow graph it is evident that the computation has a regular and repetitive structure
with a straightforward data access pattern. The overall structure of the computation remains the
same, irrespective of individual dimension sizes and number of dimensions. Thus, it is possible to
define a system with an array of processors, such that it can compute any multi-dimensional DFT.
Figure 3.7 shows the Universal FFT processor system diagram. The processors are laid out in the
same format as the data-flow diagram.
The Universal FFT processor is made up of a number of Processing Elements (PE). Each PE has
three sub-components. A Computation Unit (CU), Memory Unit (MU) and the Address Generator
35
function Twid = mdpTwidGen(t,dims)
% This program calculates the Twiddle factors for a multi-dimensional
% DFT. The Twiddle factors will be in format such that, they can be
% used in the Pease FFT algorithm modified to Dimensionless FFT.
% t is the number of dimensions of the multi-dim DFT
% dims is an array of dimension sizes. [n_1 n_2 n_3 .... n_t]
k = log2(dims); % This is k_j | k(1)= log(n_1)....k(t)= log(n_t)
for i = 0 : t
K(i+1) = sum(k(1:i));
% K(j)= k(1)+...+k(j), K(0)= 0 and K(t)= N
end
% N is the total size of the DFT (N = n_1*n_2*....n_t)
N = prod(dims);
bigK = log2(N);
Twid = [];
for j = 1 : t
for l = 1 : k(j)
m1 = 2^(k(j) - l);
m2 = 2^(bigK - K(j+1));
m3 = 2^(K(j-1+1)+l-1);
T = TwidGen(m1,m2,m3,N);
Twid = [Twid;T];
end
end
end
function twid = TwidGen(m1,m2,m3,N)
twid = ones(N/2, 1);
w_n = exp(2*pi*j/(2*m1));
for a = 0 : m1-1
for b = 0 : m2-1
for c = 0 : m3-1
twid(a*m2*m3+b*m3+c+1, 1) = w_n^a;
end
end
end
end
Figure 3.2: Pease Dimensionless Twiddle Generator.
36
function y = dimlessPeaseFFT(t,dims,x)
N = prod(dims); % N is the total size of the DFT (N = n_1*n_2*....n_t)
bigK = log2(N);
xt = generalBitReversal(t, dims, x); %Given in Chapter 2.
yt = zeros(N, 1);
W = mdpTwidGen(t,dims);
rowSz = 2^(bigK-1);
for c = bigK:-1:1
for r=0:2^(bigK-1)-1
xt(r*2+1+1) = W(((c-1)*rowSz)+r+1) * xt(2*r+1+1);
yt(r+1) = xt(2*r+1) + xt(2*r+1+1);
yt(r+N/2+1) = xt(2*r+1)-xt(2*r+1+1);
end
xt = yt;
end
y = yt;
end
Figure 3.3: Dimensionless Pease FFT Algorithm.
(AG) [8]. The CU computes the sum and difference of its inputs and applies the appropriate
twiddle factors to its output. The MU stores the twiddle factors that correspond to the particular
Computation Unit. The AG helps the Computation Unit in selecting the appropriate twiddles from
its Memory Unit. In Figure 3.7 the Address Generator is trivial since each PE use only one twiddle
factor; however later we will reuse some of the PE, in which case it will be necessary to select
twiddles. The Processing Elements are arranged in vertical stages, and between each stage is the
Permutation Block. This component permutes the input by a re-wiring in space to produce an
output. A digit-reversal Permutation Block is placed at both the input and output. Through the
Control Block, digit-reversal occurs either at the input or output. The Universal FFT processor is
setup such that depending on the DFT to be computed, the processor can be reconfigured at run
time. Reconfiguration involves the following steps; inserting the correct twiddles in each MU, reseting
the Address Generator, rewiring the Permutation Block and setting the location of digit-reversal.
In general the Universal FFT processor allows different permutations at each stage, not just
the stride permutation used in the Pease algorithm. However, when the Pease algorithm is used,
or any other constant geometry algorithm, the stages in the processor can be reduced. In this
arrangement the original processor structure can be horizontally folded, as shown in Figure 3.8.
Though this arrangement introduces a few more components, overall this arrangement uses fewer
resources. This arrangement has a single vertical stage and a single Permutation Block. The output
37
Figure 3.4: 16 point Pease 1D FFT.
38
Figure 3.5: 4x4 point Pease 2D FFT.
39
Figure 3.6: 2x2x4 point Pease 3D FFT.
40
Figure 3.7: Universal FFT Processor.
41
Figure 3.8: Universal FFT Processor Folded Horizontally.
42
of the Permutation Block is registered and fed back to the input. At the input is a multiplexer
controlled by the Control Block. This multiplexer chooses between a fresh input and the feedback.
In this arrangement the Address Generator is non-trivial. The Memory Unit stores all the twiddles
required by a particular CU for all vertical stages of computation. Thus, the AG has to provide the
correct indexing mechanism, that allows the CU to access the appropriate twiddle depending on the
vertical stage of computation.
The Universal FFT processor is a general approach of achieving DFT computations. The follow-
ing Chapters discuss a concrete version of this machine, and the methods used to realize a working
model of the Universal FFT processor.
43
4. 1D DFT Core Generator
This Chapter provides an overview of the 1D DFT Core Generator from Nordin, Milder, Hoe,
and Pu¨schel [11, 12]. The software is accessible online through www.spiral.net and provides an
intuitive interface to specify and control the core generation process. The term core is used as a
representation of the source file generated by the program. The source file is written in Verilog and
contains Register Transfer Level structural descriptions that can be synthesized on FPGAs.
The Chapter describes the core generator and the necessary background information on FPGAs
and the design process and tools used in the development of FPGA design. Section 4.1 provides
an overview on FPGAs, Section 4.2 covers the basics of application development on FPGAs and
the need for IP cores, and Section 4.3 describes the process of extracting the design details of
the core from the algorithm. Section 4.4 presents the hardware parameters that control the core
generation process, Section 4.5 discusses an implementation approach to realize stride permutation
in hardware, and Section 4.6 describes the DFT core internals in detail. Section 4.7 and Section 4.8
cover additional implementation details.
4.1 Hardware Platform Overview
FPGAs [21] are reconfigurable hardware systems that have gained popularity in recent times
as an inexpensive solution for diverse engineering problems. Designs rendered on an FPGA can
compute certain problems faster than software implementations (disregarding data transfer costs to
and from the FPGA) because FPGAs can fully utilize micro-architectural parallelism in the design.
Furthermore, the device is reconfigurable; thus as and when problem specifications change, different
designs can be deployed on the same hardware.
Currently the major FPGA vendors are Xilinx (www.xilinx.com) and Altera (www.altera.com).
FPGAs from these vendors are shipped with massive amounts of reconfigurable fabric. The recon-
figurable fabric is made up of elements called slices1. Each slice has two 4-to-1 lookup tables and
two 1-bit registers. Slices are connected via interconnection networks that are a part of the reconfig-
urable fabric, and designs are rendered by routing logic through them. In addition to this, FPGAs
typically have hardwired logic such as multipliers and memory, and pins that can be connected to
external peripherals. On Xilinx FPGAs the memory elements are called Block RAMs and have
1Slice is the term used to describe the basic logic building block in Xilinx FPGAs.
44
a 2-KByte capacity. Thus, FPGAs provide sufficient resources to efficiently implement large and
complex hardware designs.
4.2 FPGA Application Development
Application development on FPGAs involve writing code in a hardware description language
(HDL) such as Verilog, that describes the hardware design. A standard practice is to employ
reusable IP cores for components that are commonly used, like Dual ported RAM modules, FIFOs
and DSP elements. The complete source base is then processed by the development tool chain.
Xilinx ISE2 synthesizes, maps, places and routes the Verilog source to generate a binary file. Finally
the binary file is programmed onto the FPGA over a cable called JTAG. Figure 4.1 shows the
application development process on a Xilinx FPGA. Providing a rich set of configurable IP cores
dramatically speeds up the design process.
Figure 4.1: Application Development on a Xilinx FPGA.
The DFT computation is one such IP that is repeatedly used in almost all DSP applications.
Similar to hand tuned libraries distributed by processor manufacturers [4, 14], FPGA vendors provide
IPs that improve application performance [20]. However, a shortcoming of reusable IPs is that the
end-designers cannot dictate application-specific customizations. These customizations generally
revolve around satisfying the following three constraints: area (amount of resources used/available),
2Xilinx ISE is the development tool chain in a Xilinx environment.
45
performance and numerical accuracy. The designers of the 1D DFT core generator argue that
parameterized IP generation is particularly well suited for DSP kernels such as transforms because
of their well-understood structure and regularity [12].
To support end-designer customization, the 1D DFT core generator provides parameters that
can be grouped into two categories: those that control the functionality of the DFT core and those
that control hardware implementation choices. The following two sections explain the process of
formulating these parameters and the role they play in the core generation process.
4.3 Algorithm to Core
The 1D DFT core generator [11, 12] uses the Pease FFT algorithm (Equation 3.1)to compute
the DFT. Recall the Pease factorization from Theorem 9
F2t =
{
t∏
c=1
L2
t
2 (I2t−1 ⊗ F2)Tc
}
R2
t
.
Since (F2t)
T = F2t an alternative (dual) Pease factorization is possible
F2t = R
2t
{
t∏
c=1
Tc (I2t−1 ⊗ F2)L
2t
2t−1
}
and the core generator supports both formulae. The description in [11, 12] use the dual design. For
the purpose of this discussion we will disregard the formula that specifies how the twiddle values are
calculated (that process will be discussed in Section 4.6). Chapter 3 also has examples illustrating
how the formula is used to calculate the one-dimensional DFT on 16 points. Figure 3.4 represents
the data flow graph of the computation.
From the data flow graph, it is clear that the Pease algorithm has a repetitive and regular
structure. The regular structure of the Pease algorithm makes it easy to reuse hardware to compute
each factor L2
t
2 (I2t−1 ⊗ F2)Tc in the computation through a process called Horizontal Data-path
Folding. The reuse saves on the amount of FPGA resources needed for the core without an increase
in latency.
The FPGA resources needed for the computation of each factor can be further reduced by reusing
computational units for the butterfly computation (F2 along with multiplication by the twiddle
factor). The number of butterfly units can be selected, thereby controls the amount of concurrency
and allows the designer to experiment with space time tradeoffs. This structural transformation is
46
called Vertical Data-path Folding. It should be noted here that folding does not reduce the required
number of computations; rather it reduces the area on the FPGA needed at the expense of increased
latency. The following subsections discuss the two structural transformations greater detail.
4.3.1 Horizontal Data-path Folding
In accordance with the Pease formula (Equation 3.1), a DFT computation of size N = 2t has t
stages. Each stage performs the the (I2t−1 ⊗ F2)T computation. F2 takes two inputs and computes
the sum and difference of the inputs. I2t−1 specifies the number of such F2 computations in a given
stage, and T which contains the twiddle factors is multiplied to the input of each F2 unit. Thus
the computation of an N = 2t point DFT requires t = log2(N) stages with each stage needing N/2
butterfly computations. A direct implementation would require N/2 log2(N) butterfly units. For
example, a 16-point DFT requires 4 stages with 8 butterfly units per stage for a total of total of 32
butterfly units. Thus, a direct implementation will result in a shortage of resources on the FPGA;
as the size of the input increases there will not be enough resources on the FPGA to synthesize the
hardware. This can be remedied in two steps.
The first step is Horizontal Data-path Folding. Instead of log2(N) number of stages, we use
only one stage with data being fed back log2(N) number of times, reducing resources by a factor
of log2(N). The modified data flow diagram is shown in Figure 4.2 [12]. The figure shows a single
vertical stage with extra hardware for the feedback process. The output of the single vertical stage
is registered and fed back to the input. At the input, there is a multiplexer that chooses between
the feed back output and fresh input. There is also some extra logic that the figure does not show.
This extra logic controls the selection of the appropriate twiddles for each stage of the computation.
Because of horizontal folding, the twiddle values to the N/2 multipliers must be changed at every
stage.
Horizontal folding enables resource savings without performance loss. Transforming the structure
as described, introduces no latency in the overall system; which means that there is no need to
partially fold in order to optimize the computation [12].
4.3.2 Vertical Data-path Folding
The second step to optimizing the hardware design is to start from a fully horizontally folded
data-path, and carry out Vertical Data-path Folding. Vertical Data-path Folding is similar to the
previous folding mechanism, except that the number of F2 units in the single vertical stage is reduced
47
Figure 4.2: Horizontal Data-path Folding.
to varying degrees. The degree of folding varies from no vertical folding to full vertical folding. In
full vertical folding there is only one F2 unit and data is fed back to it
N log2(N)
2 times. Vertical
folding can be increased or decreased as powers of 2, thus giving various levels of concurrency, and
therefore different area/performance tradeoffs.
Figure 4.3: Vertical Data-path Folding.
Figure 4.3 [12] shows the modified data flow diagram after full vertical folding has been carried
out. Vertical folding introduces some additional complexity to the hardware design. When no
vertical folding is done, the L2
t
2 stride permutation is a simple permutation in space implemented
with wires. However, due to vertical folding the input to the F2 unit needs to buffered and sequenced
48
in order to execute correctly. The additional buffer unit labeled v-fold(L162 ) in Figure 4.3 represents
this modified stride permutation component. Section 4.5 explains in detail how the above is achieved.
4.4 Hardware Parameters
To accomplish greater application specific customizations, it is possible to formulate a set of
parameters from the information presented in the previous section. These parameters offer different
options to the end-designer, with which the core can be optimized to meet all design constraints. As
stated before, the parameters fall in two categories: those that control functionality of the hardware
and those that control implementation choices. The parameters are described below and have been
summarized for convenience in Table 4.1. This information was obtained from [11, 12] .
The following parameters control the functionality of the generated IP core.
• N : specifies the size of the DFT to be computed. The input and output will be a N -element
complex vector. The value of N has to be a power of 2 and N ≥ 4.
• w: specifies the precision of the fixed-point representation of the input, the output, and the
intermediate data values. A total of 2w bits will be required to represent both the real and
imaginary components.
• t: specifies the precision of the fixed-point representation for the precomputed twiddle factors.
The real and imaginary components are each represented using 1-bit sign, 1-bit integer, and
t− 2-bit fraction. We require 2 < t ≤ w.
• dir: specifies whether the core will accept bit-reversed input vectors and outputs natural-
ordered output vectors (dir = 0), or vice versa (dir = 1).
• scale: This boolean parameter determines whether the fixed-point data representation of in-
termediate values is scaled by a factor of 2 after each butterfly block to avoid overflow.
The following parameters control the implementation choices.
• p: This parameter specifies the number of butterfly units instantiated in the DFT implemen-
tation. In other words p specifies the degree of vertical folding. The input and output data
vectors are streamed 2p elements per cycle over N/(2p) cycles. Of all the parameters p is
the primary micro-architectural parameter that allows end-designers to customize the tradeoff
between performance and resource usage. Parameter p must be a power of 2 and 1 ≤ p ≤ N/2.
49
• twid: parameter controls whether the twiddle factor tables use the 16-Kbit Block RAM
(BRAM) macros in the Xilinx FPGA architecture (twid = 0), or distributed among the slices
in the FPGA (twid = 1). If the twiddle factor tables are greater than 16-Kbit in size, BRAM
mode is selected automatically.
• thr: If p < n/4, the stride permutation module requires FIFO queues ranging in depth from 2
to size n/(2p). FIFOs deeper than thr use BRAMs for storage. FIFO queues requiring more
than 16-Kbit of storage are automatically set to use BRAM.
meaning range
N DFT size 22 to 214
w data-path width 6 to 32
t twiddle bit-width 6 to w
dir location of bit-reversal input (0) or output (1)
scale arithmetic mode scaled or unscaled
p degree of parallelism 1 to n/2
twid twiddle storage two twiddle storage options
thr FIFO of size ≥ thr 2 to n/(2p)
is implemented in BRAM
Table 4.1: DFT IP generator parameters.
4.5 Stride Permutation through Scalable Interconnection Network
The Pease algorithm requires the LNN/2 or L
N
2 stride permutation (Chapter 2 Section 4) before
or after every stage depending on the value of the parameter dir. From the data-flow graphs
shown in the previous Chapter, implementing this permutation appears trivial. However, with the
introduction of vertical folding, stride permutation is no longer a straightforward reordering of wires
in space; a more complex system of buffering and sequencing of data is required. This is accomplished
by using a system of Delay Switch Delay (DSD) units. This arrangement was first presented by
Takala, et al. [15, 16] and the following is a discussion of this technique.
4.5.1 The J Permutation
In order to achieve folded stride permutation in hardware, a new permutation construct is intro-
duced. This permutation denoted by J is defined as follows.
50
Definition 11 JN is the permutation of an N-point input vector defined by
JN (e
2
i ⊗ e
N/4
j ⊗ e
2
k) = e
2
k ⊗ e
N/4
j ⊗ e
2
i . (4.1)
The above definition is illustrated using the following example. Let N = 8 then
J8(e
2
i ⊗ e
2
j ⊗ e
2
k) = e
2
k ⊗ e
2
j ⊗ e
2
i ,

x0
x1
x2
x3
−
x4
x5
x6
x7


=


x0
x4
x2
x6
−
x1
x5
x3
x7


which is the permutation 
 0 1 2 3 4 5 6 7
0 4 2 6 1 5 3 7


and when applied to an input vector x swaps all the odd elements in the top half of the input with
all the even elements of the bottom half.
This behavior is true in general, since
e2i ⊗ e
N/4
j ⊗ e
2
k → e
2
k ⊗ e
N/4
j ⊗ e
2
i
implies the elements located at iN/2 + 2j + k and kN/2 + 2j + i are swapped. If i = k the input
remains fixed. If i 6= k then the element in the bottom half in location 2j + 1 is swapped with the
element at location N/2 + 2j in the top half.
The following lemma gives the relation between the stride permutation and the J permutation.
Lemma 2
LNN/2 = JN (I2 ⊗ L
N/2
N/4) (4.2)
51
alternatively
LN2 = (I2 ⊗ L
N/2
2 )JN (4.3)
Proof
JN (I2 ⊗ L
N/2
N/4)(e
2
i ⊗ e
2
j ⊗ e
N/4
k ).
= JN (e
2
i ⊗ e
N/4
k ⊗ e
2
j).
= (e2j ⊗ e
N/4
k ⊗ e
2
i ).
= LNN/2(e
2
i ⊗ e
2
j ⊗ e
N/4
k ).
Equation 4.3 is obtained by applying the transpose operation to Equation 4.6.
Recursively applying Lemma 2 we get the following theorem.
Theorem 13 For an N -point input where N = 2k. The stride permutation is given by,
LNN/2 =
k−1∏
i=0
(I2i ⊗ JN/2i) (4.4)
The following example illustrates the theorem.
L168 = J16(I2 ⊗ J8)(I4 ⊗ J4)(I8 ⊗ J2),
= J16(I2 ⊗ J8)(I4 ⊗ J4).
When partial folding is used the recursion in Theorem 13 should be stopped when the number
of physical inputs are reached. If p is the number of butterfly units,
LNN/2 =
{
k−p−1∏
i=0
(I2i ⊗ JN/2i)
}
(I2k−p ⊗ L
2p
p /2) (4.5)
Applying the new equation to the previous example with p = 2 we get,
L168 = J16(I2 ⊗ J8)(I4 ⊗ L
4
2)
The following lemma is needed to implement JN when p > 1.
Lemma 3
J2Np = (L
Np
N ⊗ I2)(Ip ⊗ J2N )(L
Np
p ⊗ I2) (4.6)
52
Proof
(LNpN ⊗ I2)(Ip ⊗ J2N )(L
Np
p ⊗ I2)(e
2
i ⊗ e
N/2
j ⊗ e
p
k ⊗ e
2
l )
= (LNpN ⊗ I2)(Ip ⊗ J2N )(e
p
k ⊗ e
2
i ⊗ e
N/2
j ⊗ e
2
l )
= (LNpN ⊗ I2)(e
p
k ⊗ e
2
l ⊗ e
N/2
j ⊗ e
2
i )
= (e2l ⊗ e
N/2
j ⊗ e
p
k ⊗ e
2
i )
= J2Np(e
2
i ⊗ e
N/2
j ⊗ e
p
k ⊗ e
2
l ).
4.5.2 Delay Switch Delay Unit
The J permutation can be implemented with a delay switch delay unit (DSD) which lends itself
to streaming computation. Let DN where N = 4m be a two input switch with FIFO of length m in
front and in back of the switch as shown in Figure 4.4.
Figure 4.4: Delay Switch Delay Unit.
StreamingN elements through the delay switch delay unitDN in a round robin schedule produces
an output stream which is reordered by the JN permutation if the switch is set to 0 for the first
N/4 cycles and 1 for the second N/4 cycles. To see this refer Figure 4.5 where El and Ol are the
first N/4 even and odd indexed input elements while Eh and Oh are the second N/4 even and odd
indexed input elements. For example, Figure 4.6 shows the DSD unit in operation when N = 16.
53
Figure 4.5: Delay Switch Delay Operation.
4.6 DFT Core Internals
The DFT core consists of four main modules. They are the Stride Permutation module, Top
Level module, Butterfly Block or also known as the Kernel module and finally the ROMmodule. The
Top Level module wraps the entire hardware design into one single entity. It also has a Finite State
Machine (FSM) that controls the execution of the hardware. The Butterfly Block carries out the
F2 unit operation, calculating the sum and difference of its inputs and multiplying the appropriate
twiddle factor values. The Stride Permutation module accesses the input at a particular stride to
produce output that is fed to other modules connected to it. Finally, the ROM module stores the
twiddle factors, and provides a mechanism that allows these values to be indexed into and retrieved
from the ROM. The following subsections describe these modules in greater detail.
4.6.1 Stride Permutation Module
The stride permutation LNN/2 can be implemented as a streaming computation using a cascade of
delay switch delay units. Assume there are p butterfly units, then the cascade shown in Figure 4.7
implements LNN/2, where the N input elements are streamed in N/2p chunks of size 2p, scheduled in
round robin fashion. The correctness of this implementation follows from Theorem 13 and Lemma 3
along with the observation that DN implements JN .
The remainder of this section, delineates the implementation of the subcomponents in the stride
permutation hardware.
Permutation Module: On each cycle, a 2p-element sub-vector enters the perm module and first
undergoes an L2pp permutation by wire reordering. For the next N/(2p)− 1 cycles, the 2p-element
vector passes through the DSD Cascade component. Figure 4.8 provides source code in Verilog that
54
Figure 4.6: Delay Switch Delay Example.
55
Figure 4.7: Stride Permutation Hardware.
defines the Permutation module.
DSD Cascade Component: The DSD Cascade component consists of log2(N/2p) successive
stages of DSD units. The DSD units are numbered DSD4, · · · ,DSDN/2p,DSDN/p. For a given p,
the perm module contains p number of DSD Cascade components. Figure 4.9 provides source code
in Verilog that defines the DSD Cascade component.
DSD Unit: Each DSDm unit requires two m/4-entry synchronous FIFOs and a programmable
switch that allows the two data values either to pass-through for m/4 cycles or to criss-cross for m/4
cycles. Figure 4.4 shows the block diagram. The DSD Control component provides the necessary
control signals to operate the programmable switch. Figure 4.10 provides source code in Verilog
that defines the DSD unit.
56
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module lMod(I0,O0,I1,O1,CPHASE, C, reset);
/*CPHASE: is the control signals from the
* DSDControl component
*C : is the clock
*/
parameter width = 32;
input [4:0] CPHASE;
input C, reset;
input [width-1:0] I0;
output [width-1:0] O0;
wire [width-1:0] w0;
input [width-1:0] I1;
output [width-1:0] O1;
wire [width-1:0] w1;
L_$N$ #(width) L0(
.I1(w0), .I2(w1),
.O1(O0), .O2(O1),
.CPHASE(CPHASE), .C(C), .reset(reset));
assign w0 = I0;
assign w1 = I1;
endmodule
Figure 4.8: Permutation Module.
DSD Control Component: The switching state alternates every m/4 cycles repeatedly to per-
mute a continuous stream of input vectors. There is one DSD Control unit for each of the log2(N/2p)
successive DSD stages in a cascade. Since there are p such cascades, there will be p copies of DSDm
units all sharing the same DSD Control unit as shown in Figure 4.7. Figure 4.11 provides source
code in Verilog that defines the DSD Control component.
4.6.2 Top Level Module
To compute the DFT using the Pease algorithm three components are required. They are the F2
unit, the Stride Permutation component and a storage and retrieval mechanism for twiddle factors.
In addition a Top Level module is necessary to control the data flow through the hardware, sequence
the various hardware operations and to buffer all necessary intermediate results.
On an FPGA, a computation event occurs every clock tick, much like a processor. The Top Level
module provides the clock signal to all the computation components in the hardware. It connects
57
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module L_64(I1, I2, O1, O2, CPHASE, C, reset);
parameter width = 32;
input [width-1:0] I1, I2;
input [4:0] CPHASE;
input C, reset;
output [width-1:0] O1, O2;
wire [width-1:0] w21, w22, w31, w32, w41, w42;
wire [width-1:0] w51, w52, w61, w62, w71, w72;
//addressing mechanism for DSD units
//that use FIFOs synthesized using BRAMs.
reg [3:0] addr;
wire [3:0] nextAddr;
assign nextAddr = addr + 1;
always @(posedge C or negedge reset) begin
if (reset == 0)
addr <= 0;
else
addr <= nextAddr;
end
//DSD unit using FIFOs synthesized on slices.
DSD #(4, width) dsd4(.I1(w21), .I2(w22), .O1(w31), .O2(w32),
.PHASE(CPHASE[0]), .C(C));
DSD8 #(width) dsd8(.addr(addr[0:0]), .nextAddr(nextAddr[0:0]),
.I1(w31), .I2(w32), .O1(w41), .O2(w42),
.PHASE(CPHASE[1]), .C(C));
DSD16 #(width) dsd16(.addr(addr[1:0]), .nextAddr(nextAddr[1:0]),
.I1(w41), .I2(w42), .O1(w51), .O2(w52),
.PHASE(CPHASE[2]), .C(C));
DSD32 #(width) dsd32(.addr(addr[2:0]), .nextAddr(nextAddr[2:0]),
.I1(w51), .I2(w52), .O1(w61), .O2(w62),
.PHASE(CPHASE[3]), .C(C));
DSD64 #(width) dsd64(.addr(addr[3:0]), .nextAddr(nextAddr[3:0]),
.I1(w61), .I2(w62), .O1(w71), .O2(w72),
.PHASE(CPHASE[4]), .C(C));
//rewiring in space for a L^(2p)_p permutation
//in this case p=1.
assign w21 = I1;
assign w22 = I2;
assign O1 = w71;
assign O2 = w72;
endmodule
Figure 4.9: DSD Cascade component.
58
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module J16(I1, I2, O1, O2, PHASE, C, addr, nextAddr);
parameter width = 32;
output [width-1:0] O1, O2;
input [width-1:0] I1, I2;
input PHASE;
input C;
input [1:0] addr, nextAddr;
wire [width-1:0] FO1;
wire [width-1:0] FI2;
FIFO4 F1 (.in(I2), .raddr(nextAddr), .waddr(addr), .o(FO1), .clk(C));
FIFO4 F2 (.in(FI2), .raddr(nextAddr), .waddr(addr), .o(O1), .clk(C));
CROSS #(width) X (.I1(I1), .I2(FO1), .O1(FI2), .O2(O2), .PHASE(PHASE) );
endmodule
Figure 4.10: Delay Switch Delay Unit.
each component via wires and provides registers to store intermediate data wherever necessary. The
Top Level module also contains logic that chooses between feedback output and fresh input to the
Kernel modules. A Finite State Machine (FSM) in the Top Level module, controls the data flow.
Figure 4.12 provides pseudo-code in Verilog that defines the Top Level module.
It should be noted that the pseudo-code is for a DFT with bit-reversed output. For the bit-
reversed input case, the computation flow would be in the reverse direction. In other words, the
Kernel modules would be given the input and its output would be fed to the Stride Permutation
module. Even the FSM would operate in the reverse fashion (Subsection 4.6.2).
Finite State Machine
The FSM controls the assignment of values to the row and col variables. The variables row and
col specify the butterfly computation being performed with 0 ≤ row < N/2 and 0 ≤ col < log2(N)
corresponding to the coordinates of the butterfly computation in the data flow diagram.
As stated before, there are two different FSMs for dir = 0 and dir = 1. When the hardware
accepts bit reversed input, there is no need for an extra wait state since the input data can be fed
directly to the Kernel modules. However, when the core accepts natural ordered input, the input
has to be fed to the Stride Permutation module first. Hence an extra state has to be introduced into
59
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module DSDControl(clk, reset, startVec, phase, outStart);
parameter size = 1;
parameter delay = 2;
input clk, reset, startVec;
output phase, outStart;
reg [size:0] counter;
reg outStart, startFlag;
assign phase = counter[size];
always @(posedge clk or negedge reset) begin
if (reset == 0) begin
counter <= 0;
startFlag <= 0;
end
else begin
startFlag <= startFlag;
if (startVec == 1) begin
counter <= 0;
startFlag <= 1;
end
else
counter <= counter+1;
if ((counter == delay-2) && (startFlag == 1)) begin
outStart <= 1;
startFlag <= 0;
end
else begin
outStart <= 0;
end
end
end
endmodule
Figure 4.11: DSD Control Component.
the FSM. This state waits for a period that is equal to the total depth of the FIFOs in a single DSD
cascade. For example, when N = 64, p = 1 there will be 5 DSD units. They are DSD4, DSD8,
DSD16, DSD32 and DSD64 respectively, the FIFO length in each will be 1, 2, 4, 8 and 16. Thus
the wait state will stall for 30 cycles. Also, both FSMs wait an extra pipe number of cycles when
an entire vertical stage has finished computing. The value of pipe is the number of pipeline stages
present in the Kernel module.
A pseudo-code in Verilog is not presented here because the 1D DFT core uses a modified version
60
of the Pease twiddle value generation algorithm (Subsection 4.6.4) and the resulting FSM is more
easily explained and represented in it’s modified form (Chapter 5 Section 5.1).
4.6.3 Kernel Module
The Kernel module in the hardware computes F2. Given two real inputs, the Kernel computes the
sum and difference of the inputs. Since the hardware follows the Pease algorithm, the intermediate
result after the sum and difference computation must be multiplied by twiddle factors. The Pease
twiddle factors have the property that one twiddle value is always 1, and hence in each Kernel
computation, only the difference has to be multiplied with a twiddle factor.
The twiddle factors are retrieved from the ROM and fed to the Kernel by the Top Level module.
The Kernel is also pipelined to increase efficiency and throughput. The amount of pipelining depends
upon the bitwidth of twiddle factors and input. The Xilinx series of FPGAs have 18-bit multipliers,
hence for the following three scenarios (i) w, t ≤ 16 (ii) 16 < w ≤ 32, t ≤ 16 and (iii) 16 < w, t ≤ 32
different depths of pipelining is constructed. Figure 4.13 provides pseudo-code in Verilog that defines
the Kernel module.
4.6.4 ROM Module
The twiddle factors necessary for the one-dimensional DFT computation are produced by the
1D DFT core generator using a modified version of the Pease twiddle generation formula. This
means that only a subset of the actual twiddles are produced. This is possible since certain twiddle
factors are repeated, and can hence be eliminated by utilizing a single copy of the twiddle factor.
This single copy is then reused across computations. The twiddles are converted to t-bit fixed-point
hexadecimal values. Bit t− 1 represents the sign, bit t− 2 takes the value 0 or 1 which is the integer
part and bit t − 3 to 0 represents the fraction. This format is sufficient since the twiddle values
range from -1 to 1. The hexadecimal values are then placed in a ROM construct that provides a
mechanism to index the data and retrieve the twiddle values. Figure 4.14 provides source code in
Verilog that defines the ROM module.
4.7 1D DFT Core: Execution Details
So far the function of each module and construction of the core was described. This section
discusses the execution of the core on an FPGA. The DFT computation with p-degree of vertical
61
folding begins with loading the input vector into the data-path 2p elements per cycle. After n/2p
cycles, the vector is fully loaded and buffered in the feedback register (for p = n/2) or in the FIFOs
of the folded LNN/2 or L
N
2 permutation depending on the value of dir (for p < n/2). Next the loaded
vector cycles through the pipeline log2(N) times. In the last iteration, the transformed vector is read
out instead of being fed back into the data-path. Processing a stream of vectors involves overlapping
the last iteration of one vector with the loading of the next.
The Kernel module computes one fixed-point multiplication and two fixed- point additions. In
order to increase the clock rate, the Kernel module is pipelined. The depth of the pipeline depends
on the bitwidth of the twiddle factors and input data. The vertically folded stride permutation
module introduces a latency of n/2p to buffer and permute the data. Thus the latency per vertical
stage is n2p + pipe, and the overall latency of the DFT hardware is log2(N)(
n
2p + pipe)t, where t is
the cycle time of the placed and routed hardware. The steady state throughput is 1/latency [12].
The generated DFT core does not compute the bit-reversal permutation, either at the beginning
(dir = 0) or at the end (dir = 1) of log2(N) iterations. In most applications that use the DFT
core, the bit-reversal permutation is absorbed by another permutation in a subsequent computation.
For bit-reversed input, the end-designer is responsible for providing the hardware and is outside the
scope of this generator.
4.8 1D DFT Core Generator: From Parameters to Verilog
This section describes the framework used for generating Verilog code from the parameters
defined in Table 4.1. The framework is built using the object oriented programming paradigm.
Each hardware component is represented by an object that contains methods which write Verilog
code to the output file. This output files can then be synthesized onto an FPGA.
The approach is simple. Each object that represents a hardware construct is a wrapper for
pre-designed Verilog code. This Verilog source, however is incomplete. It has no constants defined.
For example, width of registers and wires, number of temporaries, twiddle values within the ROM
etc. are all represented by variables within the framework. Values are assigned to these variables at
generation time, which are determined from the input parameters to the 1D DFT core generator.
At generation time, some objects are invoked multiple times to get the desired concurrency in the
Verilog core.
The generated Verilog code is obtained from a pre-designed template. By substituting the param-
62
eters in appropriate locations, the complete Verilog representation of the hardware is produced for
different combinations of input values. In some situations, there exist two versions of the same hard-
ware component, one for each direction of computation flow, and depending on the dir parameter
the appropriate version is written to the core file.
63
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module dft(enable, c, reset_n,inputNext,
I0_r, I0_i, O0_r, O0_i,I1_r, I1_i, O1_r, O1_i);
........
output inputNext; //Signals next I/O burst.
reg [5:0] small_counter;
reg [7:0] counter; // Used to signal next I/O burst.
reg [7:0] counter_pipe; // Used in selecting between feed back and fresh input.
wire [width-1:0] x0_r[1:0], x0_i[1:0], x1_r[1:0], x1_i[1:0],
x2_r[1:0], x2_i[1:0], x3_r[1:0], x3_i[1:0];
wire [15:0] T2_r[0:0]; // Stores real part of twiddle factors.
wire [15:0] T2_i[0:0]; // Stores imaginary part of twiddle factors.
wire [4:0] CPHASE;
wire phase0, phase1, phase2, phase3, phase4;
wire start0, start1, start2, start3, start4, start5;
assign CPHASE = {phase4, phase3, phase2, phase1, phase0};
assign start0 = (small_counter == 0) ? 1’b1 : 1’b0;
// Instantiate the DSD Control units. These set the individual values of
// CPHASE. Each start wire is triggered by the previous DSD control unit
// and starts the next DSD Control unit in the cascade.
reg [2:0] col; //Specifies which vertical stage is under execution.
reg [4:0] row; //Specifies which Kernel unit is under execution.
reg [1:0] rc_state; // Stores the current state of the FSM.
// Instantiate the ROM module. Using row and col the address for the
// appropriate twiddle factor is selected and fed to wire T2.
assign x0_r[0] = I0_r; assign x0_i[0] = I0_i;
assign x0_r[1] = I1_r; assign x0_i[1] = I1_i;
// Following chooses between feed back and fresh input.
assign x1_r[0] = (counter_pipe < 32) ? x0_r[0] : O0_r;
assign x1_i[0] = (counter_pipe < 32) ? x0_i[0] : O0_i;
assign x1_r[1] = (counter_pipe < 32) ? x0_r[1] : O1_r;
assign x1_i[1] = (counter_pipe < 32) ? x0_i[1] : O1_i;
// Perm module instantiation. Feed the wires x1 as input.
// CPHASE is fed as input. Output is fed to the wires x2.
// Butterfly unit instantiation. Feed the x2 wires as input. Feed the T2
// wires that contain the twiddle factors. Output is fed to x3 wires.
always @(posedge clk or negedge reset_n) begin
// FSM !here! - sets row and col so that appropriate twiddle
// values are retrieved from the ROM module.
// All counter logic !here!.
O0_r <= x3_r[0]; O0_i <= x3_i[0];
O1_r <= x3_r[1]; O1_i <= x3_i[1];
end
endmodule
Figure 4.12: Top Level Module.
64
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module cblockrev_pipe(I0_r, I0_i, I1_r, I1_i, T_r, T_i,
O0_r, O0_i, O1_r, O1_i, clk, reset);
.........................
.........................
assign int_r = (n0 - n1);
assign int_i = (n2 + n3);
assign O0_r = c0_r;
assign O0_i = c0_i;
assign O1_r = int_r;
assign O1_i = int_i;
assign n0 = {m0[2*width-1], m0[fpoint+width-2 : fpoint]};
assign n1 = {m1[2*width-1], m1[fpoint+width-2 : fpoint]};
assign n2 = {m2[2*width-1], m2[fpoint+width-2 : fpoint]};
assign n3 = {m3[2*width-1], m3[fpoint+width-2 : fpoint]};
assign tm0 = {{scale{b1_r[width-1+scale]}}, b1_r[width-1:scale]};
assign tm1 = {{scale{b1_i[width-1+scale]}}, b1_i[width-1:scale]};
always @(posedge clk, negedge reset) begin
a0_r <= I0_r;
a0_i <= I0_i;
a1_r <= I1_r;
a1_i <= I1_i;
b0_r <= {{scale{a0_r[width-1]}}, a0_r} + {{scale{a1_r[width-1]}}, a1_r};
b0_i <= {{scale{a0_i[width-1]}}, a0_i} + {{scale{a1_i[width-1]}}, a1_i};
b1_r <= {{scale{a0_r[width-1]}}, a0_r} - {{scale{a1_r[width-1]}}, a1_r};
b1_i <= {{scale{a0_i[width-1]}}, a0_i} - {{scale{a1_i[width-1]}}, a1_i};
t2_r <= {{(width-t_width){T_r[t_width-1]}}, T_r[t_width-1:0]};
t2_i <= {{(width-t_width){T_i[t_width-1]}}, T_i[t_width-1:0]};
t3_r <= t2_r;
t3_i <= t2_i;
c0_r <= {{scale{b0_r[width-1+scale]}}, b0_r[width-1:scale]};
c0_i <= {{scale{b0_i[width-1+scale]}}, b0_i[width-1:scale]};
m0 <= tm0 * t3_r;
m1 <= tm1 * t3_i;
m2 <= tm0 * t3_i;
m3 <= tm1 * t3_r;
c1_r <= int_r;
c1_i <= int_i;
end
endmodule
Figure 4.13: Kernel Module.
65
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
module rom0(addr, out, clk);
input clk;
output [31:0] out;
reg [31:0] out;
input [4:0] addr;
always @(posedge clk) begin
case(addr)
8’d0: out <= {16’h4000, 16’h0};
8’d1: out <= {16’h3fb1, 16’h646};
8’d2: out <= {16’h3ec5, 16’hc7c};
8’d3: out <= {16’h3d3f, 16’h1294};
8’d4: out <= {16’h3b21, 16’h187e};
8’d5: out <= {16’h3871, 16’h1e2b};
8’d6: out <= {16’h3537, 16’h238e};
8’d7: out <= {16’h3179, 16’h289a};
8’d8: out <= {16’h2d41, 16’h2d41};
8’d9: out <= {16’h289a, 16’h3179};
8’d10: out <= {16’h238e, 16’h3537};
8’d11: out <= {16’h1e2b, 16’h3871};
8’d12: out <= {16’h187e, 16’h3b21};
8’d13: out <= {16’h1294, 16’h3d3f};
8’d14: out <= {16’hc7c, 16’h3ec5};
8’d15: out <= {16’h646, 16’h3fb1};
8’d16: out <= {16’h0, 16’h4000};
8’d17: out <= {16’hf9bb, 16’h3fb1};
8’d18: out <= {16’hf384, 16’h3ec5};
8’d19: out <= {16’hed6c, 16’h3d3f};
8’d20: out <= {16’he783, 16’h3b21};
8’d21: out <= {16’he1d5, 16’h3871};
8’d22: out <= {16’hdc72, 16’h3537};
8’d23: out <= {16’hd767, 16’h3179};
8’d24: out <= {16’hd2bf, 16’h2d41};
8’d25: out <= {16’hce87, 16’h289a};
8’d26: out <= {16’hcaca, 16’h238e};
8’d27: out <= {16’hc78f, 16’h1e2b};
8’d28: out <= {16’hc4e0, 16’h187e};
8’d29: out <= {16’hc2c2, 16’h1294};
8’d30: out <= {16’hc13b, 16’hc7c};
8’d31: out <= {16’hc04f, 16’h646};
default: out <= {16’hc04f, 16’h646};
endcase
end
endmodule
Figure 4.14: One-Dimensional ROM Module.
66
5. Universal FFT Core Generator
This Chapter explains the process of converting the 1D DFT core generator outlined in Chapter 4
to the Universal FFT core generator. The modifications made, directly follow from the mathematical
formulae presented in Chapter 3.
5.1 Universal FFT Core
As shown in Chapter 3, the DFT computation flow is fixed and independent of dimension for
a fixed number of input points. The Universal FFT core re-uses Verilog source from the 1D core.
However, modifications are required in the twiddle generation process, the initial permutation and
to the software framework that generates the core.
Changing the twiddle factors requires, replacing the code that generates the 1D twiddles with the
Dimensionless twiddle value generator. However, this change has some side effects. Due to design
decisions made by the original authors of the software, the modification process affects the data
access pattern between the ROM module and Top Level module. Consequently the Finite State
Machine is also transformed. These changes only alter the mechanics of achieving the computation
and not the computation itself.
The code presented in Figure 3.2 generates the Dimensionless twiddles using Equation 3.7. These
twiddles are then converted to t-bit wide fixed-point hexadecimal values, that are then placed within
the ROM module of the generated core. The t-bit wide twiddle is composed the same way as in the
1D case. Figure 5.1 provides source code in Verilog that defines the modified ROM module.
The 1D DFT core generator modified the Pease twiddle generation such that the ROM module
held a minimal set of twiddle values. This is an optimization, and is obtained by deriving a gener-
alized access pattern of the twiddle factors, allowing repeating twiddle values to be removed from
the generation process. This optimization was not performed in the Dimensionless case.
Hence, the FSM needs to be modified for the multi-dimensional case. The modification ensures
that the Top Level module indexes and retrieves the correct twiddle value for the corresponding
Kernel module computation. Figure 5.2 shows a graphical representation of the state machine and
Figure 5.3 provides source code in Verilog that defines the FSM.
Since we are now working with multi-dimensional input, parameters to the Universal FFT core
67
// Following is the pseudo-code in Verilog.
// p=1, N=8, dir=1, thr=2, bitwidth, twidwidth=16
module rom0(addr, out, clk);
input clk;
output [31:0] out;
reg [31:0] out;
input [3:0] addr;
always @(posedge clk) begin
case(addr)
0: out <= {16’h4000, 16’h0};
1: out <= {16’h4000, 16’h0};
2: out <= {16’h4000, 16’h0};
3: out <= {16’h4000, 16’h0};
4: out <= {16’h4000, 16’h0};
5: out <= {16’h4000, 16’h0};
6: out <= {16’h0, 16’h4000};
7: out <= {16’h0, 16’h4000};
8: out <= {16’h4000, 16’h0};
9: out <= {16’h2d41, 16’h2d41};
10: out <= {16’h0, 16’h4000};
11: out <= {16’hd2bf, 16’h2d41};
default: out <= {16’hd2bf, 16’h2d41};
endcase
end
endmodule
Figure 5.1: Multi-Dimensional DFT ROM Module.
generator also change. The end-designer now has to specify the number of dimensions and the size
of each dimension. Section 5.2 discusses the effect of these new parameters in detail.
The Dimensionless FFT requires the generalized bit-reversal permutation called digit-reversal
(Definition 6). Equation 2.18 provides a recursive approach to compute this permutation, which was
shown in Figure 2.2. Figure 5.4 gives a pictorial example of the digit-reversal permutation in action.
5.2 Software Framework for the Universal FFT Core Generator
The final aspect to implementing the Universal FFT core generator is the software framework.
The software framework is a refractored version of the original 1D DFT core generator software.
The new framework extends the old set of parameters by adding two new parameters integral to the
multi-dimensional core generation process.
The two new parameters introduced are t and dims, where t represents the number of dimensions
and dims is the array of dimension sizes. All dimension sizes have to be powers of 2. These two
68
Figure 5.2: Finite State Machine.
parameters replace the old parameter N at the user interface level. All other parameters remain the
same and affect the code generation process in the same way as the original 1D DFT core generator.
Given the number of dimensions and respective dimension sizes, the framework proceeds to
compute the total number of input points. The internal variable size maintains this value. For
example, when t = 3 and dims = (4, 4, 8) the total number of input points is size = 128. Once
the total number of input points is determined the framework proceeds to generate the core file.
The unaltered Stride Permutation and Kernel modules are generated by using the parameter size
instead of N . The Verilog code in these modules remain the same; just the wrapper code for the
modules change. The modified Verilog source code is used for those modules which have been altered,
namely the Top Level module and the ROM module. The new framework wraps these components
appropriately to achieve the desired effect.
69
// Following is the pseudo-code in Verilog.
// p=1, N=64, dir=1, thr=2, bitwidth, twidwidth=16
case(rc_state)
0: begin
col <= 5;
row <= 0;
dead <= 0;
if (small_counter == 30)
rc_state <= 2;
else
rc_state <= 0;
end
1: begin
col <= col;
if (dead == 3) begin
row <= 0;
dead <= 0;
rc_state <= 2;
end
else begin
row <= 0;
dead <= dead+1;
rc_state <= 1;
end
end
2: begin
if (row == 31) begin
row <= 0;
dead <= 1;
rc_state <= 1;
if (col == 0)
col <= 5;
else
col <= col-1;
end
else begin
col <= col;
row <= row+1;
dead <= 0;
rc_state <= 2;
end
end
endcase
Figure 5.3: Finite State Machine in a Dimensionless DFT.
70
Figure 5.4: Digit Reversal Example.
71
6. Performance Results
This section summarizes performance results of the Universal FFT core generator. FPGA re-
source utilization of the modified core generator is compared to results for the original 1D core
generator from [12]. The comparison confirms that essentially the same utilization was obtained,
independent of dimension. Comparisons in [12] with industrial grade cores from Xilinx LogiCore
[20] enable us to conclude that the Universal FFT core generator is of similar quality with the same
benefits and flexibility obtained by the 1D core generator. Theoretical comparison shows the bene-
fits, in terms of input latency, of the Universal core compared to the use of fixed 1D cores with the
row-column algorithm. Finally, we observe and analyze a decrease in memory required for twiddle
factor storage as the dimension increases.
All experiments are based on designs synthesized for a Xilinx Virtex2-Pro XC2VP100 with a
speed grade of -6. Xilinx ISE version 8.1i was used for synthesis, map, place and route. Resource
estimates were collected after synthesis and map operations. This methodology of conducting ex-
periments is consistent with [12], ensuring accurate comparison of all results.
6.1 1D DFT Core Generator Performance
Performance of the 1D DFT core generator was presented in [12]. However, [12] used Xilinx ISE
version 6.1 during their performance evaluation, and the results presented in this Chapter are all
based on a newer version. In order to maintain accuracy, experiments from [12] were re-run. We
confirm that the performance of the 1D DFT core generator in our environment is similar to that
reported in [12]. These results provide a basis to compare the Universal FFT core generator to an
industrial grade library from Xilinx as was done in [12].
The experiments in [12] used a 1024-point DFT for comparison. All cores had a fixed-point
(16-bit) data format with a burst I/O interface , and natural-in bit-reversed out data ordering. The
Xilinx cores used a radix-4 DFT, and were generated from the LogiCore library [20]. The 1D DFT
from [12] cores were generated using three resource settings: 1) minimum slice usage (all FIFOs
and Twiddles in BRAMs) 2) maximum slice usage (all FIFOs and Twiddles in Distributed RAM)
3) balanced slice usage (Twiddles in BRAM and FIFOs deeper than 64 in BRAM).
Figure 6.1 compares the slice usage of the cores, for varying values of parallelism (p), and Fig-
72
ure 6.2 compares BRAM usage. These results are consistent with [12].
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 22000
 1  2  4  8  16  32
Sl
ice
s
Parallelism [p]
Minimal Slices
Maximum Slices
Balanced Slices
Xilinx
Figure 6.1: Slice Usage of 1D DFT Core Generator vs Xilinx LogiCore Library.
6.2 Dimensionless FFT Performance
This section compares the performance of the Universal FFT core generator and the 1D DFT
core generator. The results obtained are broken down into two subsections. The first sub-section
compares the performance of 1D DFT cores produced by the original 1D DFT core generator and
our modified core generator. This allows us to study the effects of our modification to the 1D DFT
core generator. The second subsection compares a subset of possible multi-dimensional cores for a
fixed number of input points. Some interesting results were obtained and these are evaluated in
Section 6.4.
6.2.1 1D DFT Core Performance
To maintain consistency and compare results accurately, the experiments were conducted in the
same fashion as presented in [12]. Cores generated by three software frameworks were compared.
As stated in Chapter 5 the 1D DFT core generator was optimized to reduce the number of twiddles.
73
 0
 50
 100
 150
 200
 250
 1  2  4  8  16  32
BR
AM
s
Parallelism [p]
Minimal Slices
Maximum Slices
Balanced Slices
Xilinx
Figure 6.2: BRAM Usage of 1D DFT Core Generator vs Xilinx LogiCore Library.
Thus, we compare the performance of 1D DFT cores generated by the unoptimized 1D DFT core
generator framework, the optimized 1D DFT core generator and the Universal FFT core generator.
Performance in the case of minimal slice usage is shown in Figure 6.3 and Figure 6.4. The former
compares the slice usage for varying values of p, while the latter compares BRAM usage. From the
figures it is clear that all frameworks perform the same when considering slice usage. However, the
Universal FFT core generator and the unoptimized 1D DFT core generator use about twice as many
BRAMs (when p = 1) when compared to the optimized 1D DFT core generator. As p increases the
difference reduces by half every time, until when p ≥ 16 the BRAM usage is same for all frameworks.
The maximum slice usage resource setting causes the entire core to be realized using slices and
no BRAMs. Figure 6.5 shows performance of the three frameworks as p varies. It is evident that
the three cores perform roughly the same. At low values of p, the optimized 1D DFT core generator
uses lesser number of slices than the other two, but as p increases, a similar pattern as seen in the
previous graphs was observed.
Performance in the case of balanced slice usage is shown in Figure 6.6 and Figure 6.7. The
first graphs shows slice utilization as p varies and the second represents BRAM utilization. As
seen previously, the slice utilization for all three core generation frameworks is the same. In the
74
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 1  2  4  8  16  32
Sl
ice
s
Parallelism [p]
MDFT
1D-DFT-Opt
1D-DFT-UnOpt
Figure 6.3: Minimal Slice Usage - Twiddles and FIFOs in BRAM (Slices vs p).
case of BRAM usage, the optimized 1D DFT core generator has better performance; since it uses a
lesser number of twiddles for low values of p. However, when p ≥ 16 performance of all three core
generators is the same.
It should be mentioned here that empirical data for performance in speedup was not collected.
This is because both the Universal FFT core generator and the 1D DFT core generator have the
same latency equal to log2(N)(N/2p+pipe)t. In the equation N is the total number of input points,
p is the degree of parallelism, pipe is the depth of the pipeline in the Butterfly unit and t is the cycle
time. The cycle time t depends on N , p and timing constraints specified during synthesis. Thus,
only cycle time will vary, and it can be manipulated by specifying timing constraints to achieve
desired execution speed.
6.2.2 Multi-Dimensional DFT Core Performance
In this section we compare performance of multi-dimensional DFT cores generated by the Uni-
versal FFT core generator. Cores are generated in three dimensions for both 1024-point DFT and
4096-point DFT. All cores have twiddles and FIFOs realized in BRAMs, and p set to 4. Performance
of 1D DFT cores, generated by both the optimized and unoptimized version of the 1D DFT core
generator, has been included in the graphs. For the 1024-point DFT, Figure 6.8 represents the slice
75
 0
 50
 100
 150
 200
 250
 1  2  4  8  16  32
BR
AM
s
Parallelism [p]
MDFT
1D-DFT-Opt
1D-DFT-UnOpt
Figure 6.4: Minimal Slice Usage - Twiddles and FIFOs in BRAM (BRAMs vs p).
usage and Figure 6.9 represents the BRAM usage. Similarly for the 4096-point DFT, Figure 6.8
represents the slice usage and Figure 6.9 represents the BRAM usage.
An interesting observation was made here. As the number of dimensions increases, the resource
usage of the core decreases, and hence performance increases. While the slice usage for all the cores
remain the same, it is the BRAM usage that drops when the dimension increases. This led us to
conjecture that the number of twiddles plays an important role in this situation. In Section 6.4 we
present the reason for this decrease in resource usage and hence increase in performance.
6.3 Advantages of Dimensionless FFT over Row-Column Method
The row-column method to compute multi-dimensional DFTs was briefly introduced in Chap-
ter 2. The process as explained previously computes the DFT of the input data in one dimension,
transpose the intermediate representation and then computes the DFT in the other dimension. The
method while simple and straightforward has certain drawbacks when compared to the Dimensionless
FFT.
The row-column method requires successive 1D DFT computations with a data transposition in
between while the Dimensionless FFT algorithm is one large DFT computation with no intermediate
data transposition, which can reduce latency. The following latency calculations show that the
76
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 20000
 22000
 1  2  4  8  16  32
Sl
ice
s
Parallelism [p]
MDFT
1D-DFT-Opt
1D-DFT-UnOpt
Figure 6.5: Maximum Slice Usage - Twiddles and FIFOs in distributed RAM.
row-column method indeed does poorly when compared to the Dimensionless approach. We have,
log2(N)(N/2p + pipe)t as the latency of the 1D DFT core. The latency of the 1D and Universal
FFT cores is log2(N)(N/2p+ pipe)t. The following calculations show the latency of the row-column
approach (using 1D DFT cores) to the Dimensionless FFT approach.
• Two dimensional case (n× n)
– Row-Column approach will require 2n number of Fn FFT computations with latency
equal to
∗ 2n log2(n) (n/2p+ pipe) = (n
2 log2(n))/p+ (pipe× 2n× log2(n))
– Dimensionless FFT approach will require 1 Fn2 FFT computation with latency equal to
∗ log2(n
2) (n2/2p+ pipe) = (n2 log2(n))/p+ (pipe× 2× log2(n))
• Three dimensional case (n× n× n)
– Row-Column approach will require 3n2 number of Fn FFT computations with latency
equal to
∗ 3n2 log2(n) (n/2p+ pipe) = (3n
3 log2(n))/2p+ (pipe× 3n2 ∗ log2(n))
– Dimensionless FFT approach will require 1 Fn3 FFT computation with latency equal to
77
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 1  2  4  8  16  32
Sl
ice
s
Parallelism [p]
MDFT
1D-DFT-Opt
1D-DFT-UnOpt
Figure 6.6: Balanced Slice Usage - Twiddles in BRAM and FIFO depth > 64 in BRAM (Slices vs
p).
∗ log2(n
3) (n3/2p+ pipe) = (3n3 log2(n)/2p) + (pipe× 3× log2(n))
This shows the latency is reduced by n log2(n) pipe in two dimensions and n
2 log2(n) pipe in three
dimensions.
The Universal FFT core has the added benefit of not requiring redesign and re-synthesis when
changing the number of dimensions.
6.4 Performance Evaluation
This section substantiates the results obtained in Section 6.2. It was observed that as the number
of dimensions increased, the amount of storage used by the twiddle factors decreased. The following
theorems show that as the number of dimensions increase, the total number of twiddle factors not
equal to one decrease, substantiating the observation.
Lemma 4 The number of twiddle values not equal to one in Tmnn is
C(Tmnn ) = mn− (m+ n) + 1 (6.1)
78
 0
 5
 10
 15
 20
 25
 30
 35
 1  2  4  8  16  32
BR
AM
s
Parallelism [p]
MDFT
1D-DFT-Opt
1D-DFT-UnOpt
Figure 6.7: Balanced Slice Usage - Twiddles in BRAM and FIFO depth > 64 in BRAM (BRAMs
vs p).
Proof
Tmnn =
m−1⊕
i=0
W i
where
W = diag(1, ω, · · · , ωn−1)
Therefore the number of twiddles not equal to one is (m− 1)(n− 1) = mn− (m+ n) + 1.
Theorem 14 For an N-point one-dimensional DFT where N = 2k, the number of twiddle values
not equal to one in the iterative or Pease algorithm is equal to
C(N) = 2k−1(k − 2) + 1. (6.2)
Proof
The twiddles occur in the factors
I2i−1 ⊗ T
2k−i+1
2k−i
79
 0
 250
 500
 750
 1000
 1250
 1500
 1750
 2000
 0  1  2  3  4
Sl
ice
s
Dimensions[t]
1024-point-MDFT
1024-point-1D-Opt
1024-point-1D-UnOpt
32x32-point-MDFT
8x16x8-point-MDFT
Figure 6.8: Different configurations of a 1024-point DFT. Twiddles and FIFOs in BRAM (Slices vs
t).
for i = 1, · · · , k. Since the number of non-trivial twiddles in
I2i−1 ⊗ T
2k−i+1
2k−i = 2
i−1C(T 2
k−i+1
2k−i ),
the total number of non-trivial twiddles is
k∑
i=1
2i−1C(T 2
k−i+1
2k−i ),
which by Lemma 4 is equal to
k∑
i=1
2i−1(2k−i − 1)
=
k∑
i=1
2k−1 − 2i−1
= k2k−1 −
k∑
i=1
2i−1
= k2k−1 − (2k − 1)
= (k − 2)2k−1 + 1.
80
 0
 10
 20
 30
 40
 50
 60
 70
 0  1  2  3  4
BR
AM
ss
Dimensions[t]
1024-point-MDFT
1024-point-1D-Opt
1024-point-1D-UnOpt
32x32-point-MDFT
8x16x8-point-MDFT
Figure 6.9: Different configurations of a 1024-point DFT. Twiddles and FIFOs in BRAM (BRAMs
vs t).
Since the Pease algorithm simply permutes the twiddle factors, the number of non-trivial twiddles
is the same for both algorithms.
For example, when N = 16, C(16) = 17.
The above theorem can be generalized to count the number of non-trivial twiddles in the multi-
dimensional DFT. From Theorem 8 we have,
Theorem 15 For a multi-dimensional DFT given by FN1 ⊗ · · · ⊗FNt , where the number of dimen-
sions is t, Ni = 2
ki , K = k1 + · · ·+ kt, K(i) = k1 + · · · + ki, and N = N1 × · · · ×Nt, the number
twiddle values not equal to one in the Dimensionless FFT or its Pease variant is
C(N1, · · · , Nt) = K2
K−1 − t2K +
t∑
j=1
2K−kj (6.3)
Proof
By the fundamental tensor product factorization used in Theorem 8 the number of non-trivial twiddle
factors is equal to the number of non-trivial factors in
t∏
j=1
(I2K(j−1) ⊗ FNj ⊗ I2K−K(j))
81
 0
 250
 500
 750
 1000
 1250
 1500
 1750
 2000
 2250
 0  1  2  3  4
Sl
ice
s
Dimensions[t]
4096-point-MDFT
4096-point-1D-Opt
4096-point-1D-UnOpt
64x64-point-MDFT
16x16x16-point-MDFT
Figure 6.10: Different configurations of a 4096-point DFT. Twiddles and FIFOs in BRAM (Slices
vs t).
which by Theorem 14 is equal to
t∑
j=1
2K(j−1)2K−K(j)C(2kj )
=
t∑
j=1
2K−kjC(2kj )
=
t∑
j=1
2K−kj (2kj−1(kj − 2) + 1)
=
t∑
j=1
2K(kj − 2) + 2
K−kj
= K2K−1 − t2K +
t∑
j=1
2K−kj .
Substituting t = 1 in Equation 6.3, we get K2K−1 − 2K+1, which is the same as Equation 6.2
(number of non-trivial twiddles in a one-dimensional DFT). Substituting t = K in Equation 6.3 we
82
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
 120
 130
 0  1  2  3  4
BR
AM
ss
Dimensions[t]
4096-point-MDFT
4096-point-1D-Opt
4096-point-1D-UnOpt
64x64-point-MDFT
16x16x16-point-MDFT
Figure 6.11: Different configurations of a 4096-point DFT. Twiddles and FIFOs in BRAM (BRAMs
vs t).
get
K2K−1 −K2K +
K∑
j=1
2K−1
= K2K−1 −K2K +K2K−1
= 0.
Equation 6.3 is maximized when t = 1 and minimized when t = K, and generally decreases as the
dimensions increase.
Applying Equation 6.3 on the two-dimensional DFT F4⊗F4, where t = 2, n1 = 4, n2 = 4, K =
4, k1 = 2, k2 = 2, we get
C(4, 4) = 4.24−1 − 2.24 +
2∑
j=1
24−kj ,
= 32− 32 + 24−2 + 24−2,
= 8.
83
For the three-dimensional DFT F2 ⊗ F2 ⊗ F4, where t = 3, n1 = 2, n2 = 2, n3 = 4, K = 4, k1 =
1, k2 = 1, k3 = 2, we get
C(2, 2, 4) = 4.24−1 − 3.24 +
3∑
j=1
24−kj ,
= 32− 48 + 24−1 + 24−1 + 24−2,
= −16 + 8 + 8 + 4,
= 4.
If trivial twiddles are stored in a compressed form by Xilinx ISE, then Theorem 15 explains the
decrease in the amount of storage used for the twiddle factors.
84
7. Conclusion
This thesis presented a Universal FFT core generator that produces FPGA cores for arbitrary
multi-dimensional DFTs. The core generator is based on a variant of the Dimensionless FFT that
has data flow similar to the Pease FFT algorithm. The Universal FFT core generator was obtained
by simple modification of the 1D DFT core generator by Nordin, Milder, Hoe and Pu¨schel which
was also based on the Pease algorithm. Their core generator had comparable performance to vendor
supplied cores and provided greater flexibility allowing the user to explore a wide range of design
choices and space time tradeoffs. Our modified core generator achieves similar performance and
provides the same benefits. In addition our approach, based on the Dimensionless FFT, can provide
reduced latency compared to the standard row-column approach.
85
Bibliography
[1] L. Auslander, J. R. Johnson, and R. W. Johnson, “Dimensionless Fast Fourier Transforms”,
Tech.Rep. DU-MSC-97-01, Drexel University, 1997, http://www.cs.drexel.edu
[2] E. Chu, and A. George, “Inside the FFT Black Box Serial and Parallel Fast Fourier Transform
Algorithms”, ISBN: 0-8493-0270-6, CRC Press LLC, 2000.
[3] James W. Cooley and J. W. Tukey. “An algorithm for the machine calculation of complex Fourier
series”, Math. Comput., 19(90):297–301, April 1965.
[4] Matteo Frigo, and Steven G. Johnson, “FFTW: An Adaptive Software Architecture for the
FFT”, ICASSP Conference Proceedings, 1998, vol.3, pp.1381-1384
[5] J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri, “A methodology for designing,
modifying, and implementing Fourier transform algorithms on various architectures”, Circuits,
Systems, and Signal Processing, 9(4):449–500, 1990.
[6] J. R. Johnson, Tech.Rep. DU-MSC-98-01, Drexel University, 1998, http://www.cs.drexel.edu
[7] J. R. Johnson, and R. W. Johnson, Tech.Rep. DU-MSC-01-07, Drexel University, 2001,
http://www.cs.drexel.edu
[8] P. Kumhom, J. Johnson, and P. Nagvajara, “Design, optimization, and implementation of a
universal FFT processor”, Proc. 13th IEEE ASIC/SOC Conference, 2000.
[9] Pinit Kumhom, “Design, Optimization, and Implementation of a Universal FFT Processor”,
PhD. thesis, Electrical and Computer Engineering, Drexel University, 2001.
[10] C. Van Loan, “Computational Frameworks for the Fast Fourier Transform, volume 10 of Fron-
tiers in Applied Mathematics”, Society for Industrial and Applied Mathematics, Philadelphia,
USA, 1992.
[11] Peter A. Milder, Mohammad Ahmad, James C. Hoe and Markus Pu¨schel, “Fast and Accurate
Resource Estimation of Automatically Generated Custom DFT IP Cores”, FPGA’06, Monterey,
USA, 2006.
[12] Grace Nordin, Peter A. Milder, James C. Hoe and Markus Pu¨schel, “Automatic Generation
of Customized Discrete Fourier Transform IPs”, Proceedings of the 42nd Annual Conference on
Design Automation, 2005.
[13] M. C. Pease. An adaptation of the fast Fourier transform for parallel processing. J. ACM,
15(2):252–264, April 1968.
[14] Markus Pu¨schel, Bryan Singer, Jianxin Xiong, Jose M. F. Moura, Jeremy Johnson, David
Padua, Manuela Veloso, Robert W. Johnson, “SPIRAL: A Generator for Platform-Adapted
Libraries of Signal Processing Algorithms”, Journal of High Performance Computing and Ap-
plications, special issue on Automatic Performance Tuning, 18(1), 2004, pp.21-45
[15] J. Takala, D. Akopian, J. Astola, and J. Saarinen, “Scalable interconnection networks for partial
column array processor architectures”, Proc. IEEE ISCAS, Geneva, Switzerland, 2000.
[16] J. Takala, T. Ja¨rvinen, P. Salmela, and D. Akopian, “Multi-port interconnection networks for
radix-r algorithms”, Proc. IEEE Intl. Conf. Acoustics, Speech, Signal Processing, 2001.
[17] Donald E. Thomas, and Philip R. Moorby, “The Verilog hardware description language”,
Kluwer Academic Publishers, 5th edition, 2002.
86
[18] R. Tolimieri, M. An, and C. Lu, “Algorithms for discrete Fourier transforms and convolution”,
Springer, 2nd edition, 1997.
[19] R. Tolimieri, M. An, C. Lu, and C. S. Burrus, “Mathematics of Multidimensional Fourier
Transform Algorithms”, Springer, 2nd edition, 1997.
[20] Xilinx, Inc. “Xilinx LogiCore: Fast Fourier Transform v3.1”, November 2004.
[21] Xilinx, Inc. “Virtex-II Pro and Virtex-II Pro X FPGA User Guide”, November 2004.
