Use of the Reduced Precision Redundancy (RPR) method in a radix4 FFT implementation by Gavros, Athanasios
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
2010-09
Use of the Reduced Precision Redundancy (RPR)
method in a radix4 FFT implementation
Gavros, Athanasios














Approved for public release; distribution is unlimited 
USE OF THE REDUCED PRECISION REDUNDANCY 









 Thesis Co-Advisors:   Herschel Loomis 
  Alan Ross 
THIS PAGE INTENTIONALLY LEFT BLANK 
 i
REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send 
comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to 
Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 
22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 
1. AGENCY USE ONLY (Leave blank) 
 
2. REPORT DATE   
September 2010 
3. REPORT TYPE AND DATES COVERED 
Master’s Thesis 
4. TITLE AND SUBTITLE   Use of the Reduced Precision Redundancy (RPR) 
Method in a Radix-4 FFT Implementation 
 
6. AUTHOR(S)  Athanasios Gavros 
5. FUNDING NUMBERS 
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 
Naval Postgraduate School 
Monterey, CA  93943-5000 
8. PERFORMING ORGANIZATION 
REPORT NUMBER     
9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) 
N/A 
10. SPONSORING/MONITORING 
    AGENCY REPORT NUMBER 
11. SUPPLEMENTARY NOTES  The views expressed in this thesis are those of the author and do not reflect the official policy 
or position of the Department of Defense or the U.S. Government. IRB Protocol Number: ___N/A___. 
12a. DISTRIBUTION / AVAILABILITY STATEMENT   
Approved for public release; distribution is unlimited 
12b. DISTRIBUTION CODE 
A 
13. ABSTRACT (maximum 200 words)  
 
Reduced precision redundancy (RPR), as a new method for improving fault tolerance in FPGAs, appears promising 
in replacing triple modular redundancy (TMR) to prevent the single event effects due to radiation in arithmetic 
processes.  As a test of this approach, the RPR technique was used to implement a Radix-4 fast Fourier transform 
(FFT).  This design was implemented in a Xilinx Virtex 2 FPGA in order to find the possible gain in speed and power 
as compared to the TMR method. 
This thesis deals with a 64-point Radix-4 in-place FFT, based on an improved FFT algorithm.  The whole 
FFT structure was implemented based on self-designed modules and by manipulating the embedded Virtex II FPGA’s 
modules.  The point was to create a fast and small FFT module that could be altered according to specific application 
requirements.  The implementation of the FFT was successful, managing to handle data in real time at a speed of 
134MHz. 
Based on this FFT design, the next challenge was the implementation of TMR and RPR modules. The first 
attempt was the TMR structure, implemented by creating three identical replicas of the FFT and installing a voter per 
FFT stage.  This implementation was unsuccessful due to space limitations.  The next step was the alteration of the 
existing FFT and the creation of a smaller 8 x 8 bit butterfly module for the RPR structure.  After the successful 
completion of this step, implementation of a RPR module with an 8/32 degree was commenced. Ambiguities and 
inefficient radiation protection were identified in this implementation.  Finally, adopting a new RPR approach and a 
higher degree of 14/32, a smooth and correct RPR module was created that could work in real time, and handle data 
at a speed of 163MHz.  Both TMR and RPR with a degree of 14/32 methods were compared, confirming the RPR’s 
advantage in power consumption and in occupied FPGA’s resources.  
 
 
15. NUMBER OF 
PAGES  
135 
14. SUBJECT TERMS  
Reduced Precision Redundancy (RPR), Triple Modular Redundancy (TMR), Field Programmable 
Gate Array (FPGA), Fast Fourier Transform (FFT) 

















NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)  
 Prescribed by ANSI Std. 239-18 
 ii
THIS PAGE INTENTIONALLY LEFT BLANK 
 iii
Approved for public release; distribution is unlimited 
 
 
USE OF THE REDUCED PRECISION REDUNDANCY (RPR) METHOD IN A 




Lieutenant Junior Grade, Hellenic Navy 
Bachelor of Naval Science, Hellenic Naval Academy, 2003 
 
 
Submitted in partial fulfillment of the 
requirements for the degree of 
 
 
























R. Clark Robertson 
Chairman, Department of Electrical and Computer Engineering 
 iv
THIS PAGE INTENTIONALLY LEFT BLANK 
 v
ABSTRACT 
Reduced precision redundancy (RPR), as a new method for improving fault tolerance in 
FPGAs, appears promising in replacing triple modular redundancy (TMR) to prevent the 
single event effects due to radiation in arithmetic processes.  As a test of this approach, 
the RPR technique was used to implement a Radix-4 fast Fourier transform (FFT).  This 
design was implemented in a Xilinx Virtex 2 FPGA in order to find the possible gain in 
speed and power as compared to the TMR method. 
This thesis deals with a 64-point Radix-4 in-place FFT, based on an improved 
FFT algorithm.  The whole FFT structure was implemented based on self-designed 
modules and by manipulating the embedded Virtex II FPGA’s modules.  The point was 
to create a fast and small FFT module that could be altered according to specific 
application requirements.  The implementation of the FFT was successful, managing to 
handle data in real time at a speed of 134MHz. 
Based on this FFT design, the next challenge was the implementation of TMR and 
RPR modules. The first attempt was the TMR structure, implemented by creating three 
identical replicas of the FFT and installing a voter per FFT stage.  This implementation 
was unsuccessful due to space limitations.  The next step was the alteration of the 
existing FFT and the creation of a smaller 8 x 8 bit butterfly module for the RPR 
structure.  After the successful completion of this step, implementation of a RPR module 
with an 8/32 degree was commenced. Ambiguities and inefficient radiation protection 
were identified in this implementation.  Finally, adopting a new RPR approach and a 
higher degree of 14/32, a smooth and correct RPR module was created that could work in 
real time, and handle data at a speed of 163MHz.  Both TMR and RPR with a degree of 
14/32 methods were compared, confirming the RPR’s advantage in power consumption 
and in occupied FPGA’s resources.  
 vi
THIS PAGE INTENTIONALLY LEFT BLANK 
 vii
TABLE OF CONTENTS 
I. INTRODUCTION........................................................................................................1 
A. OBJECTIVE ....................................................................................................2 
B. FFT DESIGN OVERVIEW............................................................................2 
C. BACKGROUND ..............................................................................................2 
1. Space Environment and FPGA...........................................................2 
2. Fault Tolerance Methods—Redundancy...........................................3 
D. ORGANIZATION OF THIS THESIS...........................................................4 
II. PREVIOUS WORK.....................................................................................................5 
A. PREVIOUS THESIS .......................................................................................5 
1. Gkikas Thesis .......................................................................................5 
2. Sullivan Thesis......................................................................................6 
B. PREVIOUS PAPERS ......................................................................................8 
1. Application Reports and Notes ...........................................................8 
2. Papers....................................................................................................8 
III. THEORETICAL APPROACH ................................................................................11 
A. PROBLEM DISCUSSION............................................................................11 
1. Discrete Fourier Transform..............................................................11 
2. Fast Fourier Transform ....................................................................12 
a. Radix-2 Versus Radix-4 Versus Split-Radix ..........................13 
b. In-Place Versus Constant Geometry Structure......................15 
B. CONCEPTUAL DESIGN MODEL .............................................................16 
1. 64-Point Radix-4 in-Place DIF..........................................................16 
2. An Improved Radix-4 DIF FFT Algorithm.....................................19 
IV. DESIGN IMPLEMENTATION DETAILS OF FFT .............................................21 
A. DESCRIPTION..............................................................................................21 
1. Stage 1-2-3...........................................................................................22 
a. Stage Controller ......................................................................24 
b. RAM.........................................................................................24 
c. Compute Factor.......................................................................24 
d. BF Multiplier ..........................................................................24 
e. Multiplexer ..............................................................................31 
2. Main Controller .................................................................................31 
3. Reversing Last Stage..........................................................................31 
B. IMPLEMENTATION EFFORTS AND RESULTS...................................31 
1. Implementing a TMR ........................................................................31 
2. Implementing a RPR—First Attempt—RPR Degree 8/32.............32 
a. RPR Upper and Lower Modules.............................................33 
b. Overflow Approach .................................................................34 
c. Ambiguity Phenomenon .........................................................35 
d. RPR Survey Problems.............................................................36 
3. Implementing a RPR—Second Attempt—RPR Degree 14/32.......38 
 viii
a. RPR Bound Modules ..............................................................38 
b. Verifying Results .....................................................................39 
V. RESULTS ...................................................................................................................43 
VI. CONCLUSIONS AND RECOMMENDATIONS...................................................47 
A. SUMMARY ....................................................................................................47 
B. RECOMMENDATIONS FOR FUTURE STUDY .....................................48 
1. Overflow Manipulator.......................................................................48 
2. Implementing a 4 Recursive Butterfly FFT Instead of the 12 
Butterfly FFT .....................................................................................48 
APPENDIX A. FFT IMPLEMENTATION...............................................................49 
A1. FFT MODULE–XILINX SCHEMATIC DESIGN.....................................50 
A2. FFT MODULE—XILINX BEHAVIORAL AND STRUCTURAL 
DESIGN ..........................................................................................................52 
A2.1.  Main Controller of TMR..................................................................52 
A2.II. Reversing Bit of Last Stage Module .................................................53 
A2.III.  Controller of First Stage Timescale 1ns / 1ps.................................54 
A2.IV. RAM....................................................................................................57 
A2.V. Multiplexer and Voter of Each Stage...............................................60 
A2.VI.  Compute Factor ................................................................................60 
A2.VII. BF’s Multiplier .......................................................................65 
A2.VIII. Multiplier 32x32 Module.......................................................67 
A2.IX. BF Multiplier—CSA module ............................................................69 
A2.X. BF Multiplier—CLAH module.........................................................70 
APPENDIX B. TMR IMPLEMENTATION.............................................................75 
B1. TMR MODULE—XILINX SCHEMATIC DESIGN.................................76 
B2. TMR MODULE—XILINX BEHAVIORAL AND STRUCTURAL 
DESIGN ..........................................................................................................78 
B2.I.  Multiplexer and Voter of Each Stage..............................................78 
APPENDIX C. RPR IMPLEMENTATION ..............................................................81 
C1. RPR MODULE—XILINX SCHEMATIC DESIGN..................................82 
C2. RPR MODULE—XILINX STRUCTURAL AND BEHAVIORAL 
DESIGN ..........................................................................................................85 
C2.I. Radiation Module ..............................................................................85 
C2.II. Multiplexer and Voter of Each Stage...............................................89 
C2.III.  BF—Delayer ......................................................................................92 
C2.IV.  BF——Multiplier..............................................................................93 
APPENDIX D. MATLAB FILE..................................................................................95 
D1. MAIN FILE ....................................................................................................95 
LIST OF REFERENCES....................................................................................................115 
INITIAL DISTRIBUTION LIST .......................................................................................117 
 
 ix
LIST OF FIGURES 
Figure 1.  Radix-2 DIF Butterfly (From [15]) .......................................................................13 
Figure 2.  Radix-4 DIF Butterfly (From [16]) .......................................................................14 
Figure 3.  Split Radix DIF (From [16]) .................................................................................14 
Figure 4.  Radix-2 DIF FFT with in-place input and output (From [15])..............................15 
Figure 5.  Radix-2 DIF FFT with constant geometry structure (From [15]) .........................16 
Figure 6.  16-Point Radix-4 DIF FFT Butterfly (From [8])...................................................17 
Figure 7.  64-Point Radix-4 DIF FFT ....................................................................................18 
Figure 8.  64-Point Radix-4 in Shape DIF Signed Fixed Point 32Bit FFT—The Upper 
Level ................................................................................................................22 
Figure 9.  64-Point Radix-4 in Place DIF Signed Fixed Point 32bit FFT—Stages 1&2&3..23 
Figure 10.  BF Multiplier—Use of Embedded Multipliers—Part1 .......................................26 
Figure 11.  BF Multiplier—CSA module—Part2..................................................................28 
Figure 12.  BF Multiplier—CLAH Module—Part3 (From [19]) ..........................................29 
Figure 13.  Multiplier—Overview—Part1 & Part2 & Part3 .................................................30 
Figure 14.  Reduced Precision Redundancy Stage Portrayal.................................................34 
Figure 15.  RPR Shield and 3-Bit Shifting Obstacle .............................................................37 
Figure 16.  FFT and Radiation Module .................................................................................41 
Figure 17.  TMR Vs RPR Synthesis Results .........................................................................44 
Figure 18.  TMR Vs RPR—XPower Analyzer Comparison .................................................45 
Figure 19.  FFT Design—Entirely Layout.............................................................................50 
Figure 20.  FFT’s First Stage Layout.....................................................................................51 
Figure 21.  RAM Structure ....................................................................................................52 
Figure 22.  TMR’s Entirely Concept Layout. ........................................................................76 
Figure 23.  TMR’s First Stage Layout. ..................................................................................77 
Figure 24.  TMR’s BF Module ..............................................................................................78 
Figure 25.  RPR- Entirely Module Layout.............................................................................82 
Figure 26.  RPR Module—Three Stages Layout ...................................................................83 
Figure 27.  RPR—First Stage Layout ....................................................................................84 
Figure 28.  RPR—Butterfly ...................................................................................................85 
 
 x
THIS PAGE INTENTIONALLY LEFT BLANK 
 xi
LIST OF TABLES 
Table 1. Characteristics of 64-point Radix-4 FFT .........................................................20 
Table 2. Expected Errors of Precision Calculation ........................................................40 
Table 3. Errors Expected Due to the Import of Radiation in the Precise Module .........42 
Table 4. Synthesis Results—Comparison between RPR and TMR in a Virtex II 
XC2V8000-5FF1152 FPGA ............................................................................43 
Table 5. TMR Versus RPR—XPower Analyzer Results...............................................44 
 
 xii
THIS PAGE INTENTIONALLY LEFT BLANK 
 xiii
EXECUTIVE SUMMARY 
Field Programmable Gate Arrays (FPGA) are integrated circuits containing 
programmable logic components and interconnectors that are able to be used for the 
creation of complex logic functions.  They are characterized by the unique ability to be 
updated or changed depending on current requirements.  One of the main applications for 
FPGAs is in the space industry where arrays must be updated or reconfigured without 
being physically accessed.  Difficulties arise in the use of such devices in the harsh space 
environment where guard methods against single event effects (SEE) from radiation are 
required. 
A simple solution to this problem is the use of the Triple Modular Redundancy 
(TMR) method.  This is a process that protects the whole FPGA configuration against 
SEE, with the compromise of demanding a significant amount of resources.  Snodgrass 
recognized the problem and introduced, in his PhD dissertation in 2006 [1], a new 
method of fault tolerance.  This method, referred to as the Reduced Precision 
Redundancy (RPR) method, is an alternative way of implementing and protecting the 
required design.  It can be used only for arithmetic processes and requires fewer 
resources than TMR.  The use of RPR requires a compromise between capacity demand 
and output’s precision, depending on the “degree”—the measure of reduction of precision 
—of RPR. 
The objective of this thesis was the creation of a fast Fourier transform (FFT) 
structure that could be implemented in a FPGA of the Virtex II family, adopting both 
methods of redundancy, TMR and RPR, in order to investigate the performance of RPR.  
First, a simple 64-point Radix-4 in-place FFT was implemented that could handle fixed-
point two’s complement numbers.  This structure was tested with accurate results.  Next, 
a TMR structure was designed by replicating three identical FFT structures and by 
importing a voter into the end of each stage.  The design was successful, but the 
implementation failed to fit within due to size constraints of the Virtex II FPGA, 
revealing the significant demand for resources of the TMR method.  The next step was 
the design of a RPR structure with a degree of 8/32 – 8 bits reduced precision and 32 bits 
 xiv
precise result.  The design was successfully implemented, but the protection against 
radiation failed due to ambiguities and errors that were not considered at that time.  
Taking into consideration the problems from the previous unsuccessful design, a new 
RPR structure with a degree of 14/32 was designed and implemented.  This 
implementation worked correctly and protected the FFT structure efficiently. 
Based on the research conducted in this thesis, an alternative RPR method is 
suggested, where there is no actual need for generating upper and lower bounds.  Instead, 
the truncation and duplication of the precise number, in combination with theoretical 
boundary calculations, is sufficient. This alteration assists in simplifying the logic and 
decreasing the size of the voter. 
Finally, both TMR and RPR methods were implemented successfully in a slightly 
larger Virtex II demonstrating the advantage of RPR over TMR in resource requirements 
and power consumption. 
 xv
ACKNOWLEDGMENTS 
I would like to express my gratitude and sincere appreciation to Professors 
Herschel H. Loomis, Jr. and Alan A. Ross for their guidance, encouragement and support 
in the completion of this work. 
 xvi
THIS PAGE INTENTIONALLY LEFT BLANK 
 
 1
I. INTRODUCTION  
Modern satellites are capable of handling controls, communications, observation 
systems, and on-board payload data processing tasks.  The inaccessibility of a satellite 
after launch and the need for periodic updates of the satellite’s data handling procedures 
create a strong argument for using field programmable gate array (FPGA) instead of 
application-specific integrated circuit (ASIC) technologies.  The fact that a FPGA can be 
reconfigured any time it needs to be is a major advantage and the main reason that they 
are preferred in a spacecraft circuit design. 
Spacecraft computer systems must be able to operate reliably, despite the harsh 
radiation environment.  High energy protons from the Van Allen radiation belt, cosmic 
rays from outer space, and highly charged ions from solar flares are just a few of the 
threats that need to be taken into consideration [2] in spacecraft circuit design.  In order 
to prevent radiation effects, hardening of the device is a significant priority.  But, 
continuously decreasing transistor sizes and the simultaneous increase in operating 
frequencies are obstacles to our hardening efforts. 
Radiation can cause unwanted effects, such as the flipping of a memory cell’s 
state in semiconductor devices, better known as soft errors or single event upsets (SEU).  
FPGAs are susceptible to errors in both data and architecture configuration caused by 
SEU. In order to prevent this faulty behavior, triple modular redundancy (TMR) is 
commonly used to ensure reliable operation.  However, TMR is very costly in terms of 
chip area and power consumption.  Current research is focused on introducing new 
methods of fault tolerance for space-borne reprogrammable computers.  At the Naval 
Postgraduate School (NPS) in 2006, a new method of fault tolerance was introduced [1], 
referred to as reduced precision redundancy (RPR).  RPR applies redundancy only to the 
most significant numerical bits of a circuit and in this way significantly decreases the 
needed chip area and power consumption over that required for TMR. 
 
 2
In this thesis, a Radix-4 64-point FFT is implemented in a Virtex II XC2V6000 
FPGA chip using a Xilinx interface in order to further examine the effectiveness of the 
RPR method.  The reason for the choice of this chip was its existence in the space-flight 
prototype processor, CFTP-2 [3]. 
A. OBJECTIVE 
Reduced precision redundancy (RPR), a new method of fault tolerance in digital 
arithmetic processors, appears promising as a technique for replacing triple modular 
redundancy (TMR) against single event effects due to radiation.  In order to demonstrate 
this promise, the RPR technique is used to implement a Radix-4 fast Fourier transform 
(FFT).  This design is implemented with different degrees of RPR in a Xilinx Virtex II 
FPGA in order to find possible improvements in chip area, speed, and power 
consumption compared with the TMR method. 
B. FFT DESIGN OVERVIEW 
The basic design goal is to implement a FFT design based on the RPR concept in 
a Virtex2 XC2v6000.  This efforts focuses on the creation of one FFT Radix-4 N = 64 
point that can handle one 32-bit input signal in every clock period (in real time) and at the 
maximum possible frequency, combined with two RPR modules of the same philosophy 
with a degree of 8/32 = 0.25. 
C. BACKGROUND 
1. Space Environment and FPGA 
The space environment is harsh and has serious impact on any spacecraft that 
orbits, even for a short period of time.  Thermal imbalances, erosion or surface damage 
are just a few of the threats.  In this thesis, efforts to prevent or decrease the impact of a 
different threat, space radiation, are examined. 
The space radiation environment is encountered by the majority of space missions 
and is the result of galactic cosmic rays (GCRs) of particles emitted by solar events and 
of particles trapped in the Earth’s radiation belts [2].  GCRs are highly energetic, heavy 
 3
protons and ions, reaching energies in excess of 10 GeV/nucleon.  The GCR environment 
in interplanetary space changes with the phase of the solar cycle.  Particles emitted by 
solar events are an important contribution to radiation and are correlated with the eleven-
year solar cycle.  This is due to the fact that large solar events are more frequent during 
solar maxima than during solar minima.  Particles trapped in the Earth’s radiation belts, 
sometimes called the Van Allen belts, are most significant between the altitudes of 
approximately 1,000 km to 32,000 km.  These particles consist of electrons, protons and 
heavy ions that are trapped in the Earth’s magnetic field.  Spacecraft shielding is capable 
of protecting against only some of these particles [2]. 
This radiation and the inefficiency of our shielding can cause unwanted effects on 
space-borne circuits, better known as single event effects (SEE).  SEE can take many 
forms and some of them can be more destructive than others.  In this thesis, the non-
destructive SEE, single event upset, known as a soft error is considered.  A soft error is 
the transient corruption of a single bit of data.  Unfortunately, FPGAs are susceptible to 
soft errors, in both data and the architecture configuration, both of which are stored in 
memory. 
2. Fault Tolerance Methods—Redundancy 
One way to maximize a system’s fault tolerance is to use triple modular 
redundancy (TMR).  The basic design of TMR consists of three identical copies of the 
operation that is required to be “secure.”  The three identical system results are processed 
by a voting system to produce a single output.  If any one of the three systems fails, the 
other two systems can correct and mask the fault.  The disadvantage of using TMR is the 
high chip area occupation and the increased power consumption [4].  Snodgrass [1] 
suggested the concept of a new fault tolerance method, the reduced precision redundancy 
(RPR) method that allows a sacrifice in level of precision in an arithmetic calculation, in 
return for decreased power and capacity demand.  Instead of implementing three identical 
copies of a circuit and voting the result, one fully functional copy of the circuit and two 
reduced precision copies, that will define the upper and lower limit of the function’s 
output, are created.  The voter then compares the three values and checks to determine if 
 4
the function’s result is within the limits of the reduced precision values or if an error has 
occurred.  In either case a result can be generated.  Depending on where the exact point 
of the error is, the result will be a precise or less-precise output calculation. 
D. ORGANIZATION OF THIS THESIS 
Chapter II, Previous Work, describes all thesis and various previous work on 
which this thesis is based.  
Chapter III, Theoretical Approach, describes the theoretical background and the 
algorithm choices that rising for the design of a FFT. 
Chapter IV, Design Implementation Details and Description of the 
Implementation Effort, provides a description of the modules and data management 
strategies used to develop the FFT design and explains the implementation efforts for 
TMR and RPR modules. 
Chapter V, Results, contains an analysis of the occupied resources and power 
consumption for both TMR and RPR modules.  
Chapter VI, Conclusion and Recommendations, contains a summary of the total 




II. PREVIOUS WORK  
A. PREVIOUS THESIS 
This thesis is based on the dissertation of Joshua D. Snodgrass [1] and on two 
theses, from Nikolaos Gkikas [5] and Margaret A. Sullivan [4], respectively.  These 
works reveal valuable information and results that are combined in the design efforts 
examined in this thesis. 
1. Gkikas Thesis 
Gkikas describes the design of a Radix-4 FFT for wireless communications 
(wireless local area networks (LANs)) [5].  He compares different algorithm structures, 
the complexity, memory needs, data flow and controller complexity of each of them, 
searching for the most suitable algorithm for his intended application. 
First, he analyzes the difference between the decimation Fourier transform (DFT) 
and FFT, Radix-2, Radix-4 and Split-Radix algorithms.  Based on theoretical approaches, 
he reports that the FFT provides a faster solution than the DFT.  Comparing Radix-2 to 
Radix-4, he concludes that Radix-4 requires fewer multiplications, so it is the preferred 
solution.  Investigating the Split-Radix algorithm, he calculates that it requires even fewer 
multiplications than the Radix-4, but it has the disadvantage of greater complexity.  
Continuing his investigation, he states some useful information about the decimation-in-
time-frequency algorithm (DITF), the fast Harley transform and the quick Fourier 
transform.  His final conclusion for his application is the adoption of the Radix-4 FFT. 
Secondly, he analyzes the structure of a Radix-4, 64-point, in-place decimation in-
frequency (DIF) FFT, identifying its major modules.  He explains the basic principles of 
each module and writes a brief note about the interconnections and the possible 
controlling algorithms that he uses in his C++ implementation.  His design contains four 
major components.  The first component is a computer factor component that generates 
the needed factors.  The second component is comprised of two 64-sample ROMs that 
store the computed phase factors and the twiddle factors.  The third component is a 
 6
multiply accumulator that consists of one butterfly (BF) machine, one adder and two 
registers, to compute the real or imagery part of the stage’s output.  Finally, the fourth 
component is a controller that synchronizes all of the components. 
In the third part of his thesis, Gkikas examines the structure of the butterfly 
machine, and based on the theory of the improved Radix-4 algorithm, derived from the 
Cooley and Tukey algorithm [6], he suggests the use of the factor terms.  This structure is 
helpful because a minor modification to the factor’s equations, while leaving the rest of 
the design intact, provides an inverse FFT (iFFT) algorithm module.  Moreover, the use 
of factors improves data reusability and decreases the number of required computations.  
He also studies the phase factors and the correct use of them concerning this structure. 
Finally, he examines three different structures of FFT, based on the number of 
butterfly machines that each of them includes.  He first considers a 48-butterfly FFT, 
where each of the three stages has 16 butterflies and that each butterfly is used only once 
in every 64-point block of data.  After that he examines the use of a 16-butterfly FFT, 
where all three stages use the same 16 butterflies, so that each butterfly is used three 
times, once per stage.  Then, he explores the use of a 4-butterfly FFT, where each 
butterfly is used 12 times, four times per stage, for three stages. 
2. Sullivan Thesis 
Based on Snodgrass’ dissertation [1], Sullivan illustrates the application of RPR 
as a new method of fault tolerance in FPGAs against single event effects [4].  She 
examines the use of different degrees of RPR depending on the arithmetic operation and 
compares the impact of implementation of those degrees to area consumption.  
Specifically, she categorizes the problems that are suitable for RPR implementation into 
two major divisions.  These categories depend on the required computation, and are 
addition/subtraction or multiplication/division.  Later she investigates the possibilities of 
implementing RPR in a FFT algorithm. 
For each category, she describes the mathematical relationships, and depending 
on the two operands, she defines each case.  She determines the lower and upper bounds, 
and at the same time, points to special cases that demand unique handling in order to 
 7
avoid possible errors or overflows.  In a similar manner, she designs a RPR voter for each 
task, acknowledging the differences between addition voter and multiplication voter, and 
provides useful tips about the necessary signals, functions behavior and checking 
comparators of the module. 
In the final portion of her research on the arithmetic operations, she comes to the 
conclusion that the use of RPR instead of TMR in the addition/subtraction operation does 
not guarantee less area occupancy in every case.  After checking the results of FPGA’s 
area comparison, she confirms that “… in order for RPR to be a more desirable fault-
tolerance approach than TMR for a simple operation like addition or subtraction, the 
degree of RPR must be significantly less than 0.5 - and that for both the adder and the 
voter in a RPR addition process to be smaller than the analogous TMR modules the 
degree of RPR must be less than 0.25 [4].” 
Moreover, she concludes that for the multiplication operation “… a RPR 
multiplication module requires 1/3 to 1/2 the FPGA slices of a TMR multiplication 
module depending on the degree of RPR [4].”  However, she notes that the size of the 
multiplication RPR voter is extremely large when compared to the TMR voter module, a 
drawback that should not be underestimated.  Another crucial question that she tries to 
answer is the dilemma of whether to test intermediate results or just the final result for 
error.  She points out, “Any benefit of testing intermediate results in RPR processes must 
be considered against the additional space it requires [4].” 
Later, she explains the basics of a FFT algorithm and reveals thoughts about 
different ways of implementing a RPR voter in a FFT.  According to those thoughts, 
“…we may include one or more voters on the final or intermediate results [4].”  
Therefore, there are many choices; for example, we can implement a voter after each 
multiplication, or after each complex product or maybe at the end of each stage.  She 
predicts that using an 8/32 degree of RPR in a FFT butterfly operation instead of a TMR 
will save 66 percent of occupied FPGA slices.  Finally, she concludes that using a RPR 
voter for a few major points within the system will keep the area cost of the FPGA low 
compared to the cost of TMR. 
 8
B. PREVIOUS PAPERS 
Various papers, application reports and notes were also considered in this thesis.  
The following notes are not an abstract from these reports.  Only the ideas and thoughts 
that were considered useful for the design are discussed. 
1. Application Reports and Notes 
In [7], Wu describes the implementation of a Radix-4 DIF FFT using the Texas 
instrument TMS320C80 digital signal processor (DSP).  Although the report is dated, it 
reveals basic FFT implementation principles in a parallel processor and points out 
possible errors and/or hazards due to data overflow.  In [8], Delphin of ST Industries 
describes the implementation of the Radix-4 FFT Algorithm using the ST120 DSP.  
There she explains the basic structure of a FFT Radix-4 and a new algorithm that is 
derived from the Cooley and Tukey algorithm [6].  She discusses the correct order and 
the characteristics of each stage BF’s factors and twiddle numbers.  In order to save 
memory resources, she chooses the “in-place” FFT structure, but she notices the need for 
each stage’s output digit reversing procedure.  Finally, she provides a C++ example 
program that reveals the structure of specific parts of her design. 
2. Papers 
There are three relevant papers that discuss the design and implementation 
improvements of a Radix-4 FFT algorithm in a FPGA.  Each of them provides important 
advice concerning the structure of the design considered in this thesis.  In the first paper 
[9], Sun, Liu and Ji give a detailed description of the FFT’s memory structure, RAM 
storage capabilities and restrictions of RAM used in a FPGA.  They consider a new way 
of handling data addresses to allow manipulation of four input and output data streams, 
while at the same time, bypassing the limitations of modern two-port RAM hardware.  In 
the second paper [10], Bouguezel discusses the profits of using the improved Radix-4 
FFT algorithm in comparison to the Cooley-Tukey FFT algorithm [6].  In the third paper 
[11], Chao, Qin, Yingke and Chengde work on the design of a high performance FFT 
processor, based on a FPGA. They introduce two important ideas for the optimization of 
the FPGA, ideas that could be implemented in future iterations of the design presented in 
 9
this thesis.  The first idea is the lifting scheme, which is a transform that reduces the 
number of real multiplications in a BF from four to three.  While of benefit, note that it 
also increases the number of real additions from two to three.  The second idea is the use 
of an adaptive overflow calculation.  This novel approach ensures that no overflow takes 
place over the entire calculation.  This is performed without limiting the data path width 
or decreasing the efficiency of the BF. 
With important information from previous theses, application notes and papers 
considered, the theoretical background required for the creation of the FFT design is 
discussed in Chapter III. 
 10




III. THEORETICAL APPROACH 
A. PROBLEM DISCUSSION 
In order to successfully implement a FFT in a FPGA, many aspects must be 
considered and wise decisions must be made concerning different algorithms, geometries 
and decimations.  Each approach has advantages and disadvantages but the most suitable 
one must be selected.  The goal is the implementation of a 64-point FFT that can handle 
data in real time, while occupying as few resources as possible. Before discussion of 
design optimization, where the three basic categories of decimations, geometries and 
algorithms must be considered, it is useful to describe the discrete Fourier transform. 
1. Discrete Fourier Transform 
The discrete Fourier transform (DFT) plays an important role in many 
applications of digital signal processing.  The reason for its importance is the presence of 
efficient algorithms for computing the DFT. 
The DFT sequence   X k of N complex-valued numbers, given a sequence of 
data {x (n )}of length N, is computed as: 
1
0





X k x n W


  0 1k N   ,  
















   0 1n N   . 
For each value of k, N complex multiplications (4N real multiplications) and N-1 
complex additions (4N-2 real additions) are required.  This indicates that in order to 
compute all N values of the DFT, 2N  complex multiplications and 2N N complex 
additions are required. 
 12
Unfortunately, direct computation of the DFT is inefficient due to the fact that 
“…it does not exploit the symmetry and periodicity properties of the phase factor NW  
[12].”  These properties are summarized as: 
  Symmetry property:  /2k N kN NW W
    
  Periodicity property:  k N kN NW W
   
The solution to this problem is the decomposition of the N-point DFT into 
successively smaller DFTs. This new approach, named divide-and-conquer, made the 
computation of the DFT more efficient, leading to fast Fourier transforms (FFT) 
algorithms. 
2. Fast Fourier Transform 
The fast Fourier transform (FFT) is an efficient algorithm that uses a reduced 
number of arithmetic operations as compared to DFT, “… eliminating redundancies that 
result from adding certain data sequence values after they have been multiplied by the 
same factors of fixed complex constants during the evaluation of different DFT transform 
coefficients. The efficiency is achieved at the expense of reordering the data sequence, 
but the additional expense is generally small compared to the reduction in multiplications 
and additions.” [13] 
In order to use an efficient FFT algorithm based on the divide-and-conquer 
approach, the number of data points, N, is highly composite, meaning N can be described 
as 1 2 3* * *...* uN r r r r , where { jr } are prime. 
In the case where 1 2 3 ... ur r r r r     , then uN r  and the number r is called 
the radix of the FFT algorithm.  Each FFT radix-r algorithm can be categorized into two 
groups, depending on the decimation chosen. The two groups are the decimation in-time 
(DIT) algorithms, where time samples are computed in alternating groups, and the 
decimation in-frequency (DIF) algorithms, where frequency samples are computed, 
separately, in alternating groups. 
 13
a. Radix-2 Versus Radix-4 Versus Split-Radix 
When comparing the three major FFT algorithms, Radix-2 DIF, Radix-4 
DIF and Split-Radix DIF, the following conclusions can be made [5], [14]: 
The Radix-2 DIF algorithm (Figure 1) significantly decreases the number 
of complex multiplies compared to the DFT from 2N  to 2( / 2) logN N  and complex 
additions from ( 1)N N   to 2logN N .  It is a very popular algorithm due to the symmetry 
and periodicity properties of its twiddle factors. 
 
Figure 1.   Radix-2 DIF Butterfly (From [15]) 
 
The Radix-4 DIF algorithm (Figure 2) yields an even greater decrease in 
the number of complex multiplies as compared to the Radix-2 DIF.  The reduction is 
from 2( / 2) logN N  to 2(3 / 8) logN N .  Therefore, the Radix-4 DIF requires 75 percent 
as many multiplies as Radix-2 DIF and the same number of additions.  It also 
demonstrates the same symmetry and periodicity properties of its twiddle factors. 
 14
 
Figure 2.   Radix-4 DIF Butterfly (From [16]) 
The Split-Radix DIF (Figure 3) is computationally superior to both the 
Radix-2 and Radix-4 algorithms when considering the number of required multiplications 
and additions. However, this algorithm has the disadvantage of structure irregularity, a 
disadvantage that prohibits its use in an implementation design effort [12].  So, after 
comparing these three popular algorithms, the Radix-4 DIF algorithm is selected for the 
design considered in this thesis. 
 
 
Figure 3.   Split Radix DIF (From [16]) 
 15
b. In-Place Versus Constant Geometry Structure 
The term “in-place” (Figure 4) refers to the fact that each time a butterfly 
is computed, the correct set of data is read from memory and the products from the 
butterfly computation are written back into the same set of places in memory [17].  The 
advantage of the “in place” is the fact that the set of data that are needed each time is the 
same set of data that are produced by the butterfly.  This means that data can be 
overwritten, in order to minimize memory requirements.  This is very useful in 
implementing a FFT in an old FPGA, where the goal is using the least amount of required 
memory, vice achieving real time signal processing. 
The term “constant geometry” (Figure 5) indicates that the connections 
between memory slots are the same in each stage.  The advantage of this structure is the 
fact that it is less complicated than the “in-place” structure, reducing hardware size.  
Obviously, “constant geometry” seems more attractive for a highly parallel hardware 
implementation [18], but the potential of the “in-place” in an old FPGA results in its 
preference for the thesis design. 
 
Figure 4.   Radix-2 DIF FFT with in-place input and output (From [15]) 
 16
 
Figure 5.   Radix-2 DIF FFT with constant geometry structure (From [15]) 
B. CONCEPTUAL DESIGN MODEL 
In order to evaluate RPR concepts in a FFT, many aspects are considered in 
choosing an implementation.  The final choice for this thesis is a 64-point Radix-4 in-
place DIF FFT implementation in the Virtex-2 XC2V6000 FPGA.  The following section 
provides a brief overview of the theoretical approach to this thesis design. 
1. 64-Point Radix-4 in-Place DIF 
The Radix-4 DIF FFT divides the 64-point DFT into four 16-point DFTs, then 
into 16 four-point DFTs [7-8].  The butterfly of a Radix-4 consists of four inputs and four 
outputs, as shown below: 
/4 1 /2 1 3 /4 1 1
0 /4 /2 3 /4
( ) ( ) ( ) ( ) ( )
N N N N
kn kn kn kn
N N N N
n N N N
X k x n W x n W x n W x n W
   






[ ( ) ( ) ( ) ( 3 ) ]
4 2 4
N N NN k k k kn
N N N N
n
N N Nx n x n W x n W x n W W


      
,
 
where:  4 ( )
N k k
NW j   
 17
  2 ( 1)
N k k





NW j . 
Combining terms yields the following equation: 
/4 1
0
( ) [ ( ) ( ) ( ) ( 1) ( ) ( ) ( 3 )]
4 2 4
N
k k k kn
N
n
N N NX k x n j x n x n j x n W


        
. 












( ) [ ( ) ( ) ( ) ( 3 )]
4 2 4
( 1) [ ( ) ( ) ( ) ( 3 )]
4 2 4
( 2) [ ( ) ( ) ( ) ( 3 )]
4 2 4
















N N NX k x n x n x n x n W
N N NX k x n jx n x n jx n W
N N NX k x n x n x n x n W









      
       
       




 ( 3)( 3 )]4 k nNNjx n W  .
 
This is called the butterfly and is repeated for all four-point bundles as shown in 
Figures 6 and 7. 
 
Figure 6.   16-Point Radix-4 DIF FFT Butterfly (From [8]) 
 18
 
Figure 7.   64-Point Radix-4 DIF FFT 
 19
2. An Improved Radix-4 DIF FFT Algorithm 
The improved Radix-4 DIF FFT algorithm is derived from the following 
equations. 















A a b c d W
B a jb c jd W
C a b c d W
D a jb c jd W
   
   
   
















and  _ _
k k k
N N r N imW W W  . 





[( ) ( ) ( ) ( )]
[( ) ( ) ( ) ( )]
[( ) ( ) ( ) ( )]
[( ) ( ) ( ) ( )]
r im r im r im r im N
n
r im r im r im r im N
n
r im r im r im r im N
n
r im r im r im r im N
A a ja b jb c jc d jd W
B a ja j b jb c jc j d jd W
C a ja b jb c jc d jd W
D a ja j b jb c jc j d jd W
       
       
       
        .
 
For the computation of the real and imaginary part of the first output of the 
butterfly, only the butterfly’s inputs are added, without multiplying any phase factors. 
  r r r r r
im im im im im
A a b c d
A a b c d
   
        (III.1) 








r im r im
r r r r
r im r im
im r im r
im im im im
im r im r
factor a b c d
factor a b c d
factor a b c d
factor a b c d
factor a b c d
factor a b c d
   
   
   
   
   
   
      (III.2) 
Expressing the B, C, D outputs of the butterfly in factor terms: 
1 _ 4 _
4 _ 1 _
2 2
2 _ 5 _
2 2
5 _ 2 _
3 3
3 _ 6 _
3 3








r N r N im
n n
im N r N im
n n
r N r N im
n n
im N r N im
n n
r N r N im
n n
im N r N im
B factor W factor W
B factor W factor W
C factor W factor W
C factor W factor W
D factor W factor W





















     
based on the reference of Table 1. 
Table 1.   Characteristics of 64-point Radix-4 FFT 
Stage 1 2 3 
Butterflies per group 1 4 16 
Butterfly group 16 4 1 























Based on the theoretical background discussed above, it is clear that the best 
design implementation for this application is a 64-point Radix-4 in place FFT, which can 
handle 32-bit fixed point signals in real time. 
 21
IV. DESIGN IMPLEMENTATION DETAILS OF FFT  
This chapter focuses on the design implementation of the FFT.  The objective is 
to create a FFT that contains only modules, based on this research, designed in a Xilinx 
environment. This implementation is for a Virtex II FPGA XC2V-6000-6bf957. 
A. DESCRIPTION 
The FFT design contains three identical stage modules, one main controller and a 
last-stage reversing module, as shown in Figure 8.  It receives as inputs, a 32-bit real 
number and a 32-bit imaginary number, both of which are fixed-point, normalized, two’s 
complement numbers.  It produces as outputs, a 32-bit real number and a 32-bit 
imaginary number.  The detailed FFT design is included in Appendix A.  According to 
analysis conducted in this thesis and based on an application report from Cheng [18], 
there is the possibility of a 3-bit overflow per stage, meaning a 9-bit overflow for the 
entire FFT.  In order to avoid overflow issues, a simple solution was chosen.  Input 




Figure 8.   64-Point Radix-4 in Shape DIF Signed Fixed Point 32Bit FFT—The Upper 
Level 
1. Stage 1-2-3 
All three stages are constructed identically, are pipelined, and their latency is 85 
clock cycles per stage.  These three stages are depicted in Figure 9.  They include a stage 
controller, a RAM, a compute factor, a BF multiplier and a multiplexer module.  The 
 23
stage controller generates addresses, the RAM receives those addresses and the input 
signals and sends the output signals to the compute factor.  The compute factor computes 
the factors and sends them to the BF multiplier or to the multiplexer.  The BF multiplier 
performs the necessary multiplications and sends the data to the multiplexer.  The 




Figure 9.   64-Point Radix-4 in Place DIF Signed Fixed Point 32bit FFT—Stages 1&2&3 
 24
a. Stage Controller 
The stage controller receives three and exports four signals.  The stage 
controller of each stage waits for the write enable (WE) signal of the main controller.  
From this time on, it enters into a continuous process of creating input and output 
addresses for RAM, based on the algorithm depicted in Figure 7.  It also generates the 
addresses for the ROM’s twiddle factors and the switch command for the multiplexer. 
b. RAM 
The RAM module contains a ROM memory with the needed twiddle 
factors and two 128-point, 32-bit RAM from Xilinx CORE Generator tool.  The RAM is 
the only part of the design that is made with the CORE Generator.  The RAM receives 3 
addresses, one for the incoming input signals, one for the required output signal and one 
for the required twiddle factors.  Each of the real and imaginary input signals are stored 
in one half of the RAM and are exported only when the complete 64-point signals are 
collected.  The RAM module exports four signals, the real and imaginary signal and the 
real and imaginary twiddle factor.  Note that the ROM’s stored twiddle factors are not 
rounded. 
c. Compute Factor 
The compute factor has four inputs and five outputs.  It needs four pairs of 
signals in order to compute the proper factors and starts the procedure after commanded 
by the main controller.  It computes three pairs of factors that are sent to the butterfly as 
demonstrated in Equation (III.2) and a pair of signals that are ready for the next stage as 
demonstrated in Equation (III.1).  This pair of signals bypasses the BF multiplier and 
goes directly to the multiplexer. 
d. BF Multiplier 
The BF multiplier is the most complex module of the design.  It receives 7 
inputs, among them a pair of input factors from the compute factor and a pair of twiddle 
factors from RAM.  It outputs the real and imaginary part of the computed signal. 
 25
The aim is to compute the parts of Equation III.3. This is demonstrated by 
the following computation: 
1 _ 4 _




r N r N im
n n
im N r N im
B factor W factor W
B factor W factor W
 
      (IV.1) 
Thus, four multiplies and two additions are required.  For each multiply, 
the Virtex II embedded pipelined multiplier, MULT18x18S, is used.  This choice was 
made to ensure maximum possible speed for our pipelined FFT.  This was based on the 
fact that the embedded multipliers are probably faster than any behavioral multiplier that 
could be designed for this thesis work.  The disadvantage of this choice is the limit bit 
number of each multiplier.  The Virtex II embedded multiplier can handle 18-bit signed 
two’s complement numbers and outputs a 36-bit result.  However, the input signals in this 
design are 32-bit long.  This requires the use of four embedded multipliers for each 
multiplication.  To illustrate, computation of the first product, 1 _*
n













Figure 10.   BF Multiplier—Use of Embedded Multipliers—Part1 
Figure 10 clarifies the situation.  It can be observed that after the import of 
the two 32-bit long signals, a factor and a twiddle factor, four different outputs are 
received from the embedded multipliers.  Next, the signal is subject to the concatenation 
 27
or addition of bits in order to secure the correct multiplication of the signed fixed point 
numbers.  Finally, in the third stage three 64-bit long signals are received. 
At the fourth stage, those three signals are sent to the carry save adders 
module where the number of the desired signals is decreased from three to two.  This is 
depicted in Figure 11.  At the end of the fourth stage, two 64-bit long signals are 
available for import to the carry look ahead adder module.  This module has a structure 
similar to that depicted in Figure 12. The carry look ahead adder adds the two 64-bit 
numbers and outputs the product that is being concatenated.  The concatenation is based 
on the fact that although two 32-bit fixed point signals (with 30-bit fractional number) are 





Figure 11.   BF Multiplier—CSA module—Part2 
 29
 
Figure 12.   BF Multiplier—CLAH Module—Part3 (From [19]) 
So far, only one product of Equation IV.1 has been computed.  However, 
four identical designs of the structure, which is shown in Figure 13, are used to compute 















The multiplexer receives a pair of input signals from the BF multiplier and 
a pair of signals directly from the compute factor module.  The stage controller 
manipulates the multiplexer and determines which of the two pairs are going to be sent to 
the next stage. 
2. Main Controller 
The main controller is the module that manipulates the entire design by managing 
the controllers of each stage.  It contains a reset command that restarts the entire system 
and two pairs of signals that awake on time the stage controllers and the compute factors 
of each stage. 
3. Reversing Last Stage 
The reversing last stage module generates the output address of the output signals.  
In order to avoid using another RAM for storing the signals in random order and 
outputting then in an arithmetic order, the preference is to export them in the in-place 
random order and indicate the correct order by using an address monitor. 
B. IMPLEMENTATION EFFORTS AND RESULTS  
In order to investigate the use of the RPR module it must first be compared to a 
TMR module that works under the same initial conditions.  Therefore, a TMR 64-point 
Radix-4 in place DIF was designed and then implemented in the Virtex-2 XC2v6000 
FPGA.  Note that the voter in both cases (TMR and RPR) must be in the same position.  
In any other configuration a comparison would be ineffective. 
1. Implementing a TMR 
The design of a TMR is a simple process when compared to the design of a RPR.  
It is necessary to choose the frequency of the check points in the FFT and evaluate the 
values of the three identical modules through a small voter.  Then the results are verified, 
with the expectation that at least two of the three values will be identical.  In this manner, 
a FPGA is protected from unwanted space radiation. 
 32
As the frequency of the check points increases, the protection effectiveness 
increases.  But, the increased use of voters raises the capacity demand of the design.  A 
good compromise position for voters is to use one voter per FFT stage.  The practical 
implication is that only three voters are required for a 64-point Radix-4 FFT. 
At the conclusion of the implementation effort, a successful 64-point Radix-4 in-
place DIF in a Virtex-2 XC2v6000 FPGA was designed and implemented.  The entire 
design is included in Appendix B.  The resources that were used at this time were much 
less that 12 percent of the slice resources of the FPGA, except from one, the embedded 
18 x 18 pipelined multipliers.  Exactly 3 x 4 x 4 = 48 embedded multipliers were required 
from the total of 144 embedded multipliers available in the XC2v6000 FPGA.  Based on 
calculation, if the goal is to create a TMR, three times the number of embedded 
multipliers is required in the primary design.  Therefore, less than 36 percent of the 
FPGA’s slices and 3 x 48 = 144 embedded multipliers should be required.  Thus, the 
entire amount of available embedded multipliers was required for this design. 
At this point all indications were positive and the design of the TMR was 
designed.  After successfully synthesizing and simulating the design, an attempt was 
made to implement it in the FPGA.  However, it was discovered that the FFT could not 
be implemented in the FPGA.  The reason was based on the demand of the 6-block RAM 
of the module for six empty adjacent embedded multipliers.  The embedded RAMs and 
multipliers were sharing the same routing resources in the Virtex II FPGA.  Thus, the 
program required 144 + 6 = 150 multipliers or 104 percent of the available resources and 
the implementation effort failed.  Nevertheless, in order to compare the TMR and RPR 
source and power requirements, an implementation of both modules was attempted on the 
larger Virtex II XC2V8000 FPGA and is discussed in the next Chapter V. 
2. Implementing a RPR—First Attempt—RPR Degree 8/32 
Based on the previous concepts, a RPR module was created using, as its core, the 
primary design, with RPR voters embedded in the end of each stage.  This design 
consisted of one precise module unit and two smaller average precision module units. 
The precise module unit was identical to the primary design module and handled 32-bit 
 33
real and imaginary inputs, while outputting 32-bit real and imaginary results.  The 
average precise module units were distinct in the upper and lower bound units, in that 
each of them defined the upper and lower limits of the precise module. 
a. RPR Upper and Lower Modules 
The upper and lower average precision modules were almost identical. 
They truncated the 32-bit real and imaginary input into 8-bit numbers.  In the upper 
module, the truncated number was increased by one bit on the least significant bit (LSB). 
The 8-bit number passed through the modified compute factor module and entered the 
modified BF multiplier.  Inside the BF multiplier, due to the need for an 8 x 8 bit 
multiplier instead of the 32 x 32 bit precise module’s multiplier, a single 18 x 18S 
embedded multiplier was used instead of four.  There was no need for CSA or CLAH and 
the result of the multiplication occurred sooner than expected from the precise module 
unit.  In order to keep the program synchronized, a delay was inserted after the 8 x 8 BF 




Figure 14.   Reduced Precision Redundancy Stage Portrayal 
b. Overflow Approach 
In the primary design of the FFT, the overflow problem was recognized 
and the simplest solution was implemented.  Therefore, the input was restricted to 
 35
normalized signals with values less than 92 .  This secured the design from possible 
overflow events.  In the first attempt at implementing RPR, a different approach was 
elected.  Based on Cheng’s work, “Autoscaling Radix-4 FFT for TMS320C6000” [20], a 
3-bit shifting per stage approach was adopted in order to avoid overflow issues.  The 
specific implementation was the adoption of 3-bit shifting in the input of each RAM’s 
stage. 
c. Ambiguity Phenomenon 
In the design of the RPR modules, the upper bound is discriminated from 
the lower bound modules by adding one bit in the LSB of the truncated 8-bit number.  
















    
 
Therefore, using the following equations: 
Examplefactor A B C D     
1 2* *Example N Example NReal factor W real factor W image  , 
yields the following results for the upper, lower, and precise modules: 




Example Lower Lower Lower Lower
Example ecise ecise ecise ecise
Example Upper Upper Upper Upper
factor A B C D
factor A B C D
factor A B C D
   
   
   
 
But, these outcomes lead to the following ambiguity: 
Pr
? ?
Lower ecise UpperExample Example Example
factor factor factor
 
   
 
 36
A solution to this problem is the creation of a unified upper and lower 
bound module where each equation loads upper or lower values depending on the sign. 




Example Lower Lower Upper Upper
Example ecise ecise ecise ecise
Example Upper Upper Lower Lower
factor A B C D
factor A B C D
factor A B C D
   
   
   
 
And so: 
PrLower ecise UpperExample Example Example
factor factor factor   
d. RPR Survey Problems 
Incorporating 3-bit shifting in each stage of the FFT introduces a total of 
9-bit shifting at the final output.  If 8-bit RPR shield (8/32 degree) is obtained, it is 
concluded that in a non-overflow case, that RPR is incapable of securing the precise 
module as illustrated in the following equations.  Figure 15 makes clear the fact that in a 
non-overflow case, the RPR cannot sufficiently protect the third stage FFT. 
9
6
3 _ _ Pr _ ( _ ) 2
3 _ _ _ 2
rd Stage ecise Output not overflow







Figure 15.   RPR Shield and 3-Bit Shifting Obstacle 
 38
There are two possible solutions to this problem. Either a program must be 
created that recognizes and manipulates the possible overflow cases by shifting the input 
data by the appropriate amount of bits only when it is desired, or the degree of the RPR 
must be increased.  In the first situation, the concept of an intelligent overflow 
manipulator is most desirable in order to preserve the 8/32 degree of the RPR.  However, 
this manipulator will increase the complexity of the design due to the need for marking 
and shielding of the shifted bit amount throughout the entire FFT structure.  In the second 
case, an increase of the RPR degree is not desirable, but seems rather attractive at this 
point because it minimizes alteration of the primary module. 
3. Implementing a RPR—Second Attempt—RPR Degree 14/32 
Taking into consideration the ambiguities and errors from the previous 
implementation, a RPR module was created, using the primary design as its core, with 
RPR voters embedded at the end of each stage.  This design is included in Appendix C 
and consists of one precise module unit and two smaller low precision module units.  The 
precise module unit is essentially identical to the previous design module (IV.B.2) and 
handles 32-bit real and imaginary inputs, while outputting 32-bit real and imaginary 
results.  In this case, the only difference is that the bound module units are identical and 
not distinct into upper and lower bound units.  
a. RPR Bound Modules 
The RPR bound module truncates the 32-bit real and imaginary input into 
a 14-bit number.  The 14-bit number passes through the modified compute factor module 
and enters the alternated BF multiplier.  Inside the BF multiplier, due to the need for a 14 
x 14 bit multiplier instead of the 32 x 32 bit precise module’s multiplier, only one 18 x 
18S embedded multiplier was used instead of four.  There was no need for CSA or 
CLAH and the result of the multiplication occurred sooner than expected from the precise 
module unit.  In order to keep the program synchronized, a delayer was used after the 14 
x 14 BF multiplier.  The major difference between the first and the second RPR 
implementation was the altered approach of the RPR function.  Instead of using an upper 
and lower bound, two identical bounds were used that handle the truncated 14-bit value 
 39
of the 32-bit input.  At the end of each stage, the voter inspects the duplicate truncated 
values and proceeds to a simple bit by bit comparison.  If the values are identical, then 
the voter computes the expected upper and lower limits based on the truncated 14-bit 
output and on the theoretical expected errors as shown in Equation IV.3. 
Pr Pr Pr Pr Precise ecise ecise ecise ecisefactor A B C D     
Trunc Trunc Trunc Trunc Truncfactor A B C D       
_ ( 1' 1) ( 1' 1) 2' 10Trunc Up Trunc Trunc Trunc Truncfactor A b B b C D b        
_ ( 1' 1) ( 1' 1) 2' 10Trunc Lo Trunc Trunc Trunc Truncfactor A B C b D b b        
In the worst possible case, 
Pr2 '10 2 '10eciseb factor b    (IV.2) 
Using the same method, 
Pr Pr _Pr Pr _Pr* *ecise ecise N ecise ecise N eciseReal factor W real factor W image
   
_ _* *Trunc Trunc N Trunc Trunc N TruncReal factor W real factor W image
   
Trying to compute the worst possible case, (IV.2)*1 + (IV.2)*1: 
Pr3' 100 Re 3' 100eciseb al b     (IV.3)  
The voter compares the precise value to the expected upper and lower bounds and 
decides to choose either the precise value or the average (truncated) value. 
b. Verifying Results 
In order to verify the truth of the outputs of the implemented FFT 
structure, a MATLAB FFT simulation file was created.  This file was used to compare 
the output forms of the three different sub-programs.  The first subprogram was actually 
the built-in MATLAB FFT, the second was a clone of the Verilog design in the 
MATLAB environment, and the third was a MATLAB translator for the implemented 
 40
FFT results.  Since MATLAB usually handles 64-bit numbers and due to the decision for 
3-bit shifting of the input in every stage of the FFT, a possible error was identified in the 
precise calculation of the FFT between MATLAB’s calculated outputs and Verilog’s 
translated outputs.  These expected errors are tabulated in Table 2. 
Table 2.   Expected Errors of Precision Calculation 
 
The purpose of the entire design was the testing of the implemented FFT 
in a radiation environment.  Prior to this, testing was conducted via simulation of possible 
radiation effects.  For this reason, a radiation module (Figure 16) was imported into the 
FFT design that was capable of introducing error bits at the beginning of each stage, prior 
to the butterfly calculations. 
FFTs Outputs Expected Error  
Output of the 1st Stage 252  
Output of the 2nd Stage 222  
Output of the 3rd Stage 192
 41
 
Figure 16.   FFT and Radiation Module 
The radiation module allows great flexibility in the introduction of error 
bits, giving the ability to decide in which stage, in which module (one of the truncated 
RPR or the precise value) and for how long this “form” of radiation lasts.  Errors 
 42
introduced in the first stage have less impact on the precision of the average RPR in 
contrast to errors introduced in the last stage as shown in Table 3, again due to the three-
bit shifting decision. 
Table 3.   Errors Expected Due to the Import of Radiation in the Precise Module 
 
In this section, the implementation efforts of three different modules, 
TMR, RPR with 8/32 degree, and RPR with 14/32 degree were discussed.  In the next 
chapter, the results analysis will be covered based on these designs. 
Radiation introduced in the precise module Expected Error  
Radiation introduced in the 1st stage 72  
Radiation introduced in the 2nd stage 42  
Radiation introduced in the 3rd stage 12
 43
V. RESULTS 
The efforts for implementing a TMR version of the FFT in the Virtex II 
XC2V6000 were fruitless, forcing a revision to the plan.  In order to compare TMR and 
RPR modules, the same FPGA chip had to be used, so it was essential to use a larger 
FPGA of the same family. The next larger, available Virtex II FPGA, the XC2V8000, 
was used. 
First, the synthesis reports for the RPR and TMR modules were compared, as 
indicated in Table 4, where a significant difference in occupied resources is noted.  RPR 
needs 74 percent of the slices that TMR uses, 76 percent of the slice flip flops, 68 percent 
of the four-input LUTs and 50 percent of embedded multipliers that the TMR uses.  In 
Figure 17, the differences identified in the synthesis reports are presented. 




TMR module TMR’s 
occupancy % 
RPR module RPR’s 
occupancy % 
# Slices 11028 23% 8236 17% 
# Slice Flip 
Flops 
18525 19% 14067 15% 
# 4 input LUTs 17985 19% 12255 13% 
# BRAMs 6 3% 6 3% 
# MULT18x18S 144 85% 72 42% 
 44
 
Figure 17.   TMR Vs RPR Synthesis Results 
Secondly, the Xilinx’s XPower Analyzer was used to investigate the power that 
each of the modules required for operation.  When the outcomes from both cases were 
compared, an interesting conclusion resulted.  Although both systems needed the same 
amount of quiescent power, RPR required 19 percent less dynamic power than TMR.  
These results are tabulated in Table 5 and depicted in Figure 18. 
Table 5.   TMR Versus RPR—XPower Analyzer Results 
Virtex II 
XC2V8000 
TMR module RPR module 
Total Quiescent 
Power 
0.13785 W 0.13785 W 
Total Dynamic 
Power 
0.17408 W 0.14079 W 
Total Power 0.31193 W 0.27864 W 
Junction 
Temperature 
28.3 degrees C 27.9 degrees C 
 45
 
Figure 18.   TMR Vs RPR—XPower Analyzer Comparison 
After inspecting the synthesis and power reports from Xilinx for the two 
implemented modules, RPR with a degree of 14/32 and TMR, a more detailed conclusion 
can be made concerning the RPR method and is discussed in greater detail in Chapter VI. 
 
 46
THIS PAGE INTENTIONALLY LEFT BLANK 
 47
VI. CONCLUSIONS AND RECOMMENDATIONS 
A. SUMMARY 
The objective of this thesis was the creation of a FFT structure that would be 
implemented in a FPGA, adopting two different methods of redundancy, TMR and RPR.  
The purpose was to investigate the capabilities of RPR.  First a simple 64-point Radix-4 
in-place FFT was implemented that could handle fixed point, 2’s complement numbers 
and was tested with accurate results.  Next, a TMR structure was designed by replicating 
three identical FFT structures and by importing a voter at the end of each stage.  The next 
stage was the design of a RPR structure with a degree of 8/32.  The design was 
successfully implemented, but it failed to protect the system against radiation.  Taking 
into consideration the problems from the previous unsuccessful design, the construction 
of a new RPR structure with a degree of 14/32 was conducted and resulted in a design 
that worked correctly and managed to protect the FFT structure efficiently. 
One of the major concerns about the RPR method is the size of the voter. This 
thesis suggests a simple alteration from the method suggested from Snodgrass [1], where 
there is no need for generating upper and lower bounds.  Instead, the truncated value of 
the precise number is formed and duplicated.  This alteration helps simplifying things and 
decreases the size of the voter, since now the voter can easily, with a bit-to-bit 
comparison, verify the correctness of the truncated values and with a simple addition or 
subtraction can output the predicted theoretical boundaries of the precise value. 
Although several interesting results were obtained, during this research, relative to 
the specific structure chosen (the 64-point Radix-4 in-place FFT), some conclusions with 
broad theoretical impact should also be mentioned.  These findings can be summarized 
by the following statements: the RPR method is sufficient, it requires fewer resources, 
and is more power efficient than TMR when considering arithmetic operations.  
Additionally, with RPR, there are reduced resource requirements and power consumption 
over TMR, but there is a sacrifice in precision. 
 
 48
B. RECOMMENDATIONS FOR FUTURE STUDY 
1. Overflow Manipulator  
The current structure of this design did not permit a further decrease in RPR 
degree, but the addition of an intelligent overflow manipulator in each stage would allow 
abandonment of the high RPR degree and permit protection of each stage with the same 
small amount of bits that are desired.  This would enable further investigation of the 
impact of RPR degree, and assist in understanding the trade-off of RPR degree, on 
precision, capacity size, and power consumption. 
2. Implementing a 4 Recursive Butterfly FFT Instead of the 12 Butterfly 
FFT 
The basic FFT module used in this thesis was required to handle data in real time.  
That forced use of a four-BF structure per stage for the Radix-4 FFT, creating a rather 
large FFT processor, requiring a significant number of embedded multipliers (33 percent) 
and a trivial amount of remaining resources (less than 13 percent in all cases) for a Virtex 
II XC2V6000 FPGA implementation.  This revealed that this design had a significant 
weakness, the embedded multipliers.  It is interesting to note that in considering the FFT 
design, a recursive four-BF multiple-stage expansion effort was evaluated.  This would 
have enabled the ability to keep the majority of the FFT structure intact, altering only the 
controller.  Of course, a replacement of the embedded multipliers would permit a 
deviation from the Virtex II family, but even so, the decrease in the number of required 
BFs, in addition to, the decrease in the demanded RAM resources would permit the 
development of a less resource demanding FFT, with RPR protection, at the price of 
sacrificing the operational sample rate. 
 
 49
APPENDIX A. FFT IMPLEMENTATION 
This Appendix includes the 64-point, in-place, Radix-4, DIF FFT design as it is 
described in Chapter IV.A.  The module receives two fixed-point 32-bit inputs (real and 
imaginary) and outputs two fixed-point 32-bit results. The design was implemented in a 
Virtex II XC2V6000 BF957 using Xilinx tools.  The synthesis report confirmed that the 
design worked at a clock speed of 225MHz. 
 
 50
A1. FFT MODULE–XILINX SCHEMATIC DESIGN 
 








Figure 21.   RAM Structure 
A2. FFT MODULE—XILINX BEHAVIORAL AND STRUCTURAL DESIGN 
A2.1.  Main Controller of TMR  








output reg WE_stage1,WE_stage2,WE_stage3; 
output reg WE_Compute_Factor_Stage1=0; 
output reg WE_Compute_Factor_Stage2=0; 
output reg WE_Compute_Factor_Stage3=0; 
output reg WE_Bit_Reversing_Last_Stage=0; 
reg [8:0] counter_FFT=0; 
 













 if (counter_FFT==85) WE_stage2=1; 
 if (counter_FFT==170) WE_stage3=1; 
 if (counter_FFT==51) WE_Compute_Factor_Stage1=1; 
 if (counter_FFT==112) WE_Compute_Factor_Stage2=1; 
 if (counter_FFT==117) WE_Compute_Factor_Stage3=1; 
 if (counter_FFT==256) WE_Bit_Reversing_Last_Stage=1; 
 counter_FFT<=counter_FFT+1; //< !!!! 
 end 
//keep in  mind that the controller of each stage needs 2 clocks from the time it is 
triggered 




A2.II. Reversing Bit of Last Stage Module 




reg [7:0] order_Inp_Memory=0; 
output reg [6:0] Address_of_Output_Result=0; 
always @(posedge CLK) 
 54
begin 
//I have to proceed in bit reversing in every stage. It is convenient to change the Input  
//Address(reversing the bit) before the data are trasnfered into RAM of the next stage. 
// I have to manipulate the  6bit . I have to make a bit  
// reversal based on Radix4(four decimal system) => for example :300 -> 003 ,  
// 012->210 etc. 
if (WE_Bit_Reversing) 
begin 
 if (order_Inp_Memory==64)  
 begin 
 order_Inp_Memory=0; 








A2.III.  Controller of First Stage Timescale 1ns / 1ps 
module Controller_of_BF_stage1(CLK,RESET,WE,WE_BF_for_multiplexer, 
order_Memory1, order_Twiddle9, order_Inp_Memory1); 
            
//based on - Fast Fourier Transform [64FFT] Univ. of the Ryukyus, Okinawa, Japan 
//from the moment the WE_basic_stage1 is on (1), the Sampled_Input_signal must be 
//imported on the second following CLK, meaning: 
// WE_basic_stage1 is on at CLK0 
// Ram_input_sampled_signal will be active at CLK2 
input CLK,RESET,WE; 
output reg [6:0] order_Memory1;//7bit each piece(due to 64 present) 
output reg [5:0] order_Twiddle9;//6bit each piece(due to 64 present) 
output reg [6:0] order_Inp_Memory1;//7bit each piece 
output reg WE_BF_for_multiplexer; 




reg [6:0] order_Inp_Memory; 
parameter stage=1; 
parameter N=64; 
reg [5:0] counter=5'b00000; 
reg [7:0] count_for_each_BF=8'b0; 
reg [6:0] inpA=7'b0; 
 55
reg [5:0] counter2=6'b010010;//if I need 18 clks delay ???//mallon den 8a xreiastw to 
order_stage 2 Memory 
reg [7:0] changer1=64 
//in stage 1, Controller needs 1 clock in order to give the right output 
always @(posedge CLK) 
begin 
 case (stage) 
 1: 
 begin 
//order_Memory:generates the output address for stage1 Memory: 0,16,32,48 - 1,17,33,49  
//keep in mind that order_Memory has to take into consideration, in which place 
//is your data in RAM. First 64 slots or next 64 slots ((changer1: task) 
//order_Twiddle: generates the output address for stage1 Twiddle: 0,0,0 - 0,1,2 - 0,2,4 - ... 
//order_Inp_Memory:generates the input address for stage1 Memory: 0,1,2,3,4,....63 - 
64,65..127 
 if (WE) 
 begin 
   
  if (inpA==127) inpA<=0; 
  inpA<=inpA+1; 
  order_Memory<=count_for_each_BF+counter+changer1; 
  order_Inp_Memory<=inpA; 
  if (RESET) 
   begin 
   counter<=5'b00000; 
   count_for_each_BF<=0; 
   counter2<=13; 
   inpA<=0; 
   end 
   
  if (((inpA)>=63)&(inpA<127)) changer1<=0; // changer1 is responsible 
for be the pointer in RAM for the two  
  else changer1<=64; 
  if (count_for_each_BF==0) 
  begin 
  order_Twiddle<=32'b01000000000000000000000000000000; 
  WE_BF_for_multiplexer <=1; 
  end 
  if (count_for_each_BF==16) 
  begin 
  order_Twiddle<=counter; 
  WE_BF_for_multiplexer <=0; 
  end 
  if (count_for_each_BF==32) 
  begin 
 56
  order_Twiddle<=counter<<1;//b2<=counter*2; 
  WE_BF_for_multiplexer <=1; 
  end 
  if (count_for_each_BF==48) 
  begin 
  order_Twiddle<=(counter<<1)+(counter);//b3<=counter*3; 
  WE_BF_for_multiplexer <=1; 
  end 
  count_for_each_BF<=count_for_each_BF+16; 
  if (count_for_each_BF==48) 
  begin 
   count_for_each_BF<=0; 
   counter<=counter+1; 
   counter2<=counter2+1; 
   if (counter==15) 
    begin 
    counter<=0; 
    end 
   // the following 2 if statements are for the order_Stage2_Memory 
   if (counter2==15) 
    begin 
    counter2<=0; 
    end 
  end 





//One stage registers for the commands 






// The Wn_real and Wn_im are by-passing the "Compute Factor". So I have to "delay" 
the Wn the same 
// amount of clocks as if they were imported to "Compute Factor". In order to avoid  
confusing 
// parameter in Controler and in order to avoid using pipelined registers for the Wn in 
Ram, 
 
//Register for Order_Twiddle 


































A2.IV. RAM  
`timescale 1ns / 1ps 
module Ram_edition1_stage1( 
  order_Memory, 
  order_Twiddle, 
  Ain_real,Ain_im, 
  Win_real1,Win_im1, 
  Ram_input_sampled_signal_real0, 
  Ram_input_sampled_signal_im0, 
  addr_Inp, 
  WE, 
  CLK 
 58
  ); 
   
// Includes both Twiddle Rom and Memory RAM for each stage   
// This Edition has 1 Real and 1 Image Part, does not include the "twiddle ROM" 
// and actually has 2 128N RAMs constructed by Coregen (based exactly on the Xilinx 
Library 
// RAMB16_S36_S36) 
input [31:0] Ram_input_sampled_signal_real0,Ram_input_sampled_signal_im0; 
input [6:0] order_Memory;// gives the address of the output that we need 
input [5:0] order_Twiddle;//gives the address of the twiddle that we need 
input [6:0] addr_Inp;// gives the Input address that we had to store the values 
input CLK; 
output [31:0] Ain_real,Ain_im; 
output  [31:0] Win_real1,Win_im1; 
wire  [31:0]twiddle_real [63:0]; 
wire  [31:0]twiddle_im [63:0]; 










































































  genvar e; 
 60
  for (e=1;e<65;e=e+1) 
  begin : L1 
   assign twiddle_real[64-e][31:0]=twiddle_real_A[(e*32-1):((e-
1)*32)]; 
   assign twiddle_im[64-e][31:0]=twiddle_im_A[(e*32-1):(e-1)*32]; 
  end 
 endgenerate 
    
assign    Win_real1=twiddle_real[{1'b0,order_Twiddle[5:0]}];//here 
you have to put the twiddle data 
assign    Win_im1=  twiddle_im[{1'b0,order_Twiddle[5:0]}]; 
endmodule 
A2.V. Multiplexer and Voter of Each Stage   
`timescale 1ns / 1ps 
module Multiplexer_A(clk,Input_realA,Input_imageA,Input_realB,Input_imageB, 
WE_BF_for_multiplexer,Out_real,Out_im); 
input [31:0] Input_realA,Input_imageA,Input_realB,Input_imageB; 
input clk,WE_BF_for_multiplexer; 
output reg [31:0] Out_real; 
output reg [31:0] Out_im; 
always @(posedge clk) 
begin 
 if (WE_BF_for_multiplexer) 
  begin 
  Out_real <=Input_realA; 
  Out_im <=Input_imageA; 
  end 
 else 
  begin 
  Out_real <=Input_realB; 
  Out_im <=Input_imageB; 
  end 
end 
endmodule 
A2.VI.  Compute Factor 
`timescale 1ns/1ps 
 




input [31:0] Input_real,Input_image; 
 61
output reg  [31:0] Aout_real,Aout_im ; 
output reg  [31:0] factorA,factorB; 
output reg WE_BF;//gives the order to BF_multiplier to start the procedure.   
reg [31:0] factor1_a,factor2_a,factor3_a,factor4_a,factor5_a,factor6_a ; 
reg [31:0] factor1_b,factor2_b,factor3_b,factor4_b,factor5_b,factor6_b ; 
 
reg [1:0] counter=0; 
reg [1:0] uncounter=0; 
reg  [31:0] Ain_real,Ain_im,Bin_real,Bin_im,Cin_real,Cin_im,Din_real,Din_im ; 
reg stage1=0; 
reg stage2=0; 
reg [31:0] Areal_a,Aim_a; 
reg [31:0] Areal_b,Aim_b; 














reg [31:0] Real [3:0]; 
reg [31:0] Im [3:0]; 
always @(posedge clk) 
begin 
//The 'counter' pairs the 4 input and gives the order to the 'stage1' to start  
//the calculation of the needed factors 
if (WE_compute_factor) 
begin 



























always @(posedge clk) // We compute all the factors here and also the Input and output  
//        that doesn't need multiplier. (Areal,Aim) 
// In every 4 pairs due to Radix4, the one pair doesn't need multiplier 
 begin 
  if (stage1) 
  begin 
  factor1_a  <= Ain_real+Bin_im; 
  factor1_b  <= Cin_real+Din_im; 
  factor2_a  <= Ain_real+Cin_real; 
  factor2_b  <= Bin_real+Din_real; 
  factor3_a  <= Ain_real+Din_im; 
  factor3_b  <= Cin_real+Bin_im; 
  factor4_a  <= Ain_im+Din_real; 
  factor4_b  <= Cin_im+Bin_real; 
  factor5_a  <= Ain_im+Cin_im; 
  factor5_b  <= Bin_im+Din_im; 
  factor6_a  <= Ain_im+Bin_real; 
  factor6_b  <= Cin_im+Din_real; 
   
  Areal_a   <=Ain_real+Bin_real; 
  Areal_b   <=Cin_real+Din_real; 
  Aim_a    <=Ain_im+Bin_im; 
  Aim_b    <=Cin_im+Din_im; 
   
  stage2<=1; 
  end 
 end 
 










 t8<=factor6_a-factor6_b;  
end 
 
always @(posedge clk) 
begin 
WE_BF<=0; 
 if (stage2) 
 begin 
  if  (uncounter==2'b01)   
  begin 
  Aoutreal_reg_0 <=t1; 
  Aoutim_reg_0 <=t2; 
  WE_BF<=1; 
  end 
 else if  (uncounter==2'b10)  
  begin 
  factorA <=t3; 
  factorB<=t4; 
  WE_BF<=1; 
  end 
 else if   (uncounter==2'b11)  
  begin 
  factorA <=t5; 
  factorB<=t6; 
  WE_BF<=1;  
 end 
 else if (uncounter==2'b00)  
  begin 
  factorA <=t7; 
  factorB<=t8; 
  WE_BF<=1; 
  end 




//my BF_multiplier1 needs 10 cycles in order to send the input to output 
 64
// that's why I need to keep my Aim, Areal for 11 cycles. #11|| 
//register1 for Areal,Aim 
always @(posedge clk) 
begin 
 Aoutreal_reg_1  <= Aoutreal_reg_0; 
 Aoutim_reg_1   <= Aoutim_reg_0; 
  
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_2  <= Aoutreal_reg_1; 




always @(posedge clk) 
begin 
 Aoutreal_reg_3  <= Aoutreal_reg_2; 
 Aoutim_reg_3   <= Aoutim_reg_2; 
 
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_4  <= Aoutreal_reg_3; 
 Aoutim_reg_4   <= Aoutim_reg_3; 
 
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_5  <= Aoutreal_reg_4; 
 Aoutim_reg_5   <= Aoutim_reg_4; 
 
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_6  <= Aoutreal_reg_5; 
 Aoutim_reg_6   <= Aoutim_reg_5; 
 
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_7  <= Aoutreal_reg_6; 




always @(posedge clk) 
begin 
 Aoutreal_reg_8  <= Aoutreal_reg_7; 
 Aoutim_reg_8   <= Aoutim_reg_7; 
 
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_9  <= Aoutreal_reg_8; 
 Aoutim_reg_9   <= Aoutim_reg_8; 
  
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_10  <= Aoutreal_reg_9; 
 Aoutim_reg_10  <= Aoutim_reg_9; 
  
end 
always @(posedge clk) 
begin 
 Aoutreal_reg_11  <= Aoutreal_reg_10; 
 Aoutim_reg_11  <= Aoutim_reg_10; 
  
end 
always @(posedge clk) 
begin 
 Aout_real   <= Aoutreal_reg_11; 




A2.VII. BF’s Multiplier 
`timescale 1ns / 1ps 
//02-01-10 32bit input, N=64, Radix4 
module BF_multiplier1 
(clk,reset,clk_enable,Wn_real1,Wn_im1,factorA,factorB,out1_real,out1_im); 
input    [31:0] Wn_real1,Wn_im1 ; 
input    [31:0] factorA,factorB ; 
input clk,clk_enable,reset; 
// clk_enable = WE_BF from Compute Factor. 
// only if it is enabled, the multiplier functions are going to worked. 
// otherwise, every 4 signals, one is going to run through the multiple pipeline  
// registered insted of the multipliers 
output  reg  [31:0] out1_real; 
 66
output  reg  [31:0] out1_im; 
wire  [31:0] out1_real0; 
wire  [31:0] out1_im0; 
reg  [31:0] Wn_real0; 
reg  [31:0] Wn_im0; 
reg  [31:0] factorA1; 
reg  [31:0] factorB1;  
wire [31:0] Sum1a ; 
wire [31:0] Sum1b ; 
wire [31:0] Sum2a ; 
wire [31:0] Sum2b ; 
 
//using the traditional way as described in N.Gkikas thesis 
// each time computes only one pair of output. 
// So every time uses the 4 multipliers of virtex2  
// in every clk accepts new input for computing the 1 pair of FFT stage. 
//11-12-09 32bit input, N=64, Radix4 
//register1 
always @(posedge clk) 
begin 
 if (clk_enable) 
  begin 
   Wn_real0    <=Wn_real1; 
   Wn_im0    <=Wn_im1; 
   factorA1    <=factorA; 
   factorB1    <=factorB; 




// Based on the following procedure 
// for (n=0;n<3;n=n+1) 
//  begin 
//msb_keeper_real[63:0]= factor[1+n][31:0]*Wn_real[n][31:0] - 
factor[4+n][31:0]*Wn_im[n][31:0]; 
//msb_keeper_im[63:0]= factor[4+n][31:0]*Wn_real[n][31:0] + 
factor[1+n][31:0]*Wn_im[n][31:0]; 
//   out1_real=msb_keeper_real [63:32]; 
//   out1_im =msb_keeper_im  [63:32]; 
//  end 
//mult_32x32_edit1(clk,reset,clk_enable,Ainput,Binput,Sumfinal)    
    
mult_32x32_edit2 i1(clk,reset,clk_enable,factorA1,Wn_real0,Sum1a); 
mult_32x32_edit2 i2(clk,reset,clk_enable,factorB1,Wn_im0,Sum1b); 
    
 67
mult_32x32_edit2 i3(clk,reset,clk_enable,factorB1,Wn_real0,Sum2a); 
mult_32x32_edit2 i4(clk,reset,clk_enable,factorA1,Wn_im0,Sum2b);   
assign out1_real0=Sum1a+Sum1b; 
assign out1_im0  =Sum2a-Sum2b; 
//register 3 
always @(posedge clk) 
begin 
   if (clk_enable) 
   begin 
    out1_real<=out1_real0; 
    out1_im  <=out1_im0; 
   end 
end 
endmodule   
A2.VIII. Multiplier 32x32 Module 
`timescale 1ns/100ps 
// This program is a 32x32 bit multiplier that uses the MULT18x18S multipliers of  
// Virtex2 and send the 3 [..]bit results to the CSA. The multiple CSA give us 2 outputs 
// that we send afterwards to the modified CLA in order to take the final result.  
// we concatenate the final result in order to have an 32bit output the results were checked 
// and are correct 
 
// multiplier Edition 3 
module mult_32x32_edit2(clk,reset,clk_enable,Ainput,Binput,Sumfinal); 
input   [31:0] Ainput,Binput; 
input clk,clk_enable,reset; 
wire [63:0] Sumfinal_not_ready; 
output reg [31:0] Sumfinal;//reg 
wire Areg_sign,Breg_sign; 
reg [35:0] Ain,Bin; 
wire [35:0] Ain0,Bin0; 
reg [17:0] Ain1=18'b0; 
reg [17:0] Ain2=18'b0; 
reg [17:0] Bin1=18'b0; 
reg [17:0] Bin2=18'b0; 
reg [17:0] Ain1_part2,Ain2_part2,Bin1_part2,Bin2_part2; 
wire [16:0] C1,C1a; 
wire [10:0] C0; 
wire [63:0] Addition1, Addition2, Addition3; 
wire [63:0] Sum1,Sum2,Sum3; 
reg [63:0] Addition1a, Addition2a, Addition3a; 
wire [35:0] telos1,telos2,telos3,telos4; 
reg  [35:0] a_telos3,a_telos4; 















MULT18X18S A0B1 (.A(Ain1),  
                      .B(Bin2),  
                      .C(clk),  
                      .CE(clk_enable),  
                      .R(reset),  
                      .P(telos3));//anti 71 
MULT18X18S A1B0 (.A(Ain2),  
                      .B(Bin1),  
                      .C(clk),  
                      .CE(clk_enable),  
                      .R(reset),  
                      .P(telos4));  
    
always @(posedge clk) 








MULT18X18S A0B0 (.A(Ain1_part2),  
                      .B(Bin1_part2),  
                      .C(clk),  
                      .CE(clk_enable),  
                      .R(reset),  
                      .P(telos1)); 
MULT18X18S A1B1 (.A(Ain2_part2),  
                      .B(Bin2_part2),  
                      .C(clk),  
                      .CE(clk_enable),  
 69
                      .R(reset),  
                      .P(telos2));//anti 71 







always @(posedge clk) 
begin             
 Addition1a<=Addition1; 






//there is a register in csa , thats why I dont use one here. 
 
multiplier_behav_pip_a aaw (Sum1,Sum2,Sum3,clk); 









 A2.IX. BF Multiplier—CSA module 
`timescale 1ns / 1ps 
//CSA . We create a 64 CSA that have 3 inputs and 2 outputs. 
//We collect the inputs from the 4 multipliers of virtex2 MULT18x18S 
//We send the two outputs to the Carry Lookahead adder of the 
multplier_behav_pip_32x23 
   
module csa_for_multiplier_beh_pip32x32_a(Ain,Bin,Cin,Output1,Output2,CLK2); 
input [63:0] Ain,Bin,Cin; 
output reg [63:0] Output1,Output2; 
input CLK2; 
reg [63:0] Cout,Sum; 
integer i; 





 for (i=0;i<64;i=i+1) 
 begin :L1 
 Cout[i]<=(Ain[i]&Bin[i])|(Ain[i]&Cin[i])|(Bin[i]&Cin[i]);  
 Sum[i]<= Ain[i]^Bin[i]^Cin[i];  









A2.X. BF Multiplier—CLAH module 
`timescale 1ns / 1ps 
module multiplier_behav_pip_a(Ain,Bin,Summary,CLK1); 
input [63:0] Ain,Bin; 
output [63:0] Summary; 
input CLK1; 
wire C0; 
wire [15:0] Gener,Propagate  ; 
reg [3:0] Propagate3stage1,Gener3stage1 ; 
wire [3:0] Propagate3stage,Gener3stage; 
wire  [11:0] doesntmatter ; 
wire useless,useless1; 
wire [7:0] doesntmatter1; 
reg [63:0] Ain1,Ain2,Ain3,Ain4,Bin1,Bin2,Bin3,Bin4; 
reg [15:0] Gener1,Gener2,Gener3,Propagate1,Propagate2,Propagate3  ; 
reg [3:0] Cin1; 
reg [15:0] Cjfinal; 
wire [11:0] Cjstage5; 
wire [2:0] Cin4stage; 




  genvar e; 
  for (e=0;e<16;e=e+1) 
  begin : L1 
   
   //assign A= 
 71
   //assign B=Bin[3+e*4:0+e*4]; 
    
 alu1 u1 (Ain[3+e*4:0+e*4],Bin[3+e*4:0+e*4],C0,Gener[e], Propagate[e]); 




always @(posedge CLK1) 









  genvar k; 
  for (k=0;k<4;k=k+1) 
  begin :L2 
CLAH name1(Gener1[3+k*4:0+k*4],Propagate1[3+k*4:0+k*4],C0, Gener3stage[k], 
Propagate3stage[k], doesntmatter[2+k*3:0+k*3]);  
     




always @(posedge CLK1) 




Gener2[15:0]<=Gener1[15:0]; //we keep them in order to use them on stage 4 
Propagate2[15:0]<=Propagate1[15:0];//we keep them in order to use them on stage 4 









always @(posedge CLK1) 
  begin 
 72
 
Cin1[3:0]<={Cin4stage[2:0],C0};//input as the Cin to the CLAH of stage 4-Cin for 
CLAH 1 is 0 (we dont have Cin in start) 
Gener3[15:0]<=Gener2[15:0]; //we keep them in order to use them on stage 4 







  genvar i; 
  for (i=0;i<4;i=i+1) 
  begin :L3 
   
CLAH name3 (Gener3[3+i*4:0+i*4],Propagate3[3+i*4:0+i*4], Cin1[i],doesntmatter1[i], 
doesntmatter1[i+4],Cjstage5[2+i*3:0+i*3]);  
     
  end 
 endgenerate 
  
// Register 4  
always @(posedge CLK1) 










  genvar p; 
  for (p=0;p<16;p=p+1) 
  begin :L4 
   
alu2 u2 (Ain4[3+p*4:0+p*4],Bin4[3+p*4:0+p*4],Cjfinal[p],Summary[3+p*4:0+p*4]); 





    




input [3:0] A,B;//n-1 
output  reg Gener; 
output reg Propagate; 
wire   [4:1] g; //n 
wire [4:1] prop; //n 
and and0 (g[1],A[0],B[0]); 
and and1 (g[2],A[1],B[1]); 
and and2 (g[3],A[2],B[2]); 






    begin 
  Propagate=prop[1]&prop[2]&prop[3]&prop[4];//i didnt put the C0 
  Gener= 
g[1]&prop[2]&prop[3]&prop[4]|g[2]&prop[3]&prop[4]|g[3]&prop[4]|g[4]; 





module alu2  (A,B,C0,Sum); 
input C0; 
input  [3:0] A,B ;//n-1 
output reg  [3:0] Sum ;//n-1 
reg Cout; 
always@(A,B,C0) 
   
    begin 
    {Cout,Sum}= A+B+C0; 







input [4:1] G,P; 
output  [3:1] Cj; 
output Generatea,Propagate; 











APPENDIX B. TMR IMPLEMENTATION 
This appendix includes the TMR version of the 64-point, in-place, Radix-4, DIF 
FFT design as it is described in Chapter IV.B.1.  All arithmetic calculations are protected 
by triple redundancy.  The module receives two fixed-point 32-bit inputs (real and 
imaginary) and outputs two fixed-point 32-bit results.  In the last step of each FFT’s 
stage, the three computed values are inspected by a voter, based on the principle of triple 
redundancy.  The design failed to be implemented in a Virtex II XC2V6000 BF957 and 
was finally implemented in a Virtex II XC2V8000 5FF1152 using Xilinx tools.  The 
synthesis report from the XC2V6000 confirmed that the design needed 85 cycles per 
stage, a total of 255 cycles of latency at a clock speed of 134MHz. 
In this appendix, only the schematic and the structural or behavioral modules that 
were changed or added after the creation of the basic FFT design (Appendix A) are 
illustrated. 
 76
B1. TMR MODULE—XILINX SCHEMATIC DESIGN 
 














Figure 24.   TMR’s BF Module 
B2. TMR MODULE—XILINX BEHAVIORAL AND STRUCTURAL DESIGN 
B2.I.  Multiplexer and Voter of Each Stage   
`timescale 1ns / 1ps 
Module MultiplexerA_TMR(clk,Input1_realA,Input1_imageA,Input1_realB, 





input [31:0] Input1_realA,Input1_imageA,Input1_realB,Input1_imageB; 
input [31:0] Input2_realA,Input2_imageA,Input2_realB,Input2_imageB; 
input [31:0] Input3_realA,Input3_imageA,Input3_realB,Input3_imageB; 
 
input clk,WE_BF_for_multiplexer; 
output [31:0] Out_real; 
output [31:0] Out_im; 
reg [31:0] Before_Voter_real [2:0]; 
reg  [31:0] Before_Voter_im [2:0]; 
wire [31:0] Out0_real,Out0_im; 
always @(posedge clk) 
begin 
 if (WE_BF_for_multiplexer) 
  begin 
  Before_Voter_real[0]  <=Input1_realA; 
  Before_Voter_im[0]  <=Input1_imageA; 
  Before_Voter_real[1]  <=Input2_realA; 
  Before_Voter_im[1]  <=Input2_imageA; 
  Before_Voter_real[2]  <=Input3_realA; 
  Before_Voter_im[2]  <=Input3_imageA; 
  end 
 else 
  begin 
   
  Before_Voter_real[0]  <=Input1_realB; 
  Before_Voter_im[0]  <=Input1_imageB; 
  Before_Voter_real[1]  <=Input2_realB; 
  Before_Voter_im[1]  <=Input2_imageB; 
  Before_Voter_real[2]  <=Input3_realB; 
  Before_Voter_im[2]  <=Input3_imageB; 


































THIS PAGE INTENTIONALLY LEFT BLANK 
 81
APPENDIX C. RPR IMPLEMENTATION 
This appendix includes the RPR version of the 64-point, in place, Radix-4, DIF 
FFT design as it is described in Chapter IV.B.3.  All arithmetic calculations of the FFT 
are protected by the reduced precision redundancy method.  The module receives two 
fixed-point 32-bit inputs (real and imaginary) and outputs two fixed-point 32-bit results.  
In the last step of each FFT’s stage, the precise value and the truncated values are 
inspected by a voter, based on the principle of reduced precision redundancy.  The design 
was first implemented in a Virtex II XC2V6000 BF957 and then in a Virtex II 
XC2V8000 5FF1152 using Xilinx tools.  The synthesis report from the XC2V6000 
confirmed that the design needed 85 cycles per stage, a total of 255 cycles of latency at a 
clock speed of 163Mhz. 
In this chapter, only the schematic and structural or behavioral modules that were 
changed or added after the creation of the TMR design (Appendix B) are illustrated. 
 82
C1. RPR MODULE—XILINX SCHEMATIC DESIGN 
 
















Figure 28.   RPR—Butterfly 
C2. RPR MODULE—XILINX STRUCTURAL AND BEHAVIORAL DESIGN 
C2.I. Radiation Module 
`timescale 1ns / 1ps 
module Radiation(clk,Radiation); 
input clk; 
output  [359:0] Radiation; 
//Radiation Guide 
 86
// Radiation [359:0] 
// RadiationA[119:0] = Radiation [359:240] radiation of first stage     
// RadiationB[119:0] = Radiation [239:120] radiation of second stage     
// RadiationC[119:0] = Radiation [119:0] radiation of third stage     
reg [13:0] RadiationAr_upper=14'b0; 
reg [13:0] RadiationBr_upper=14'b0; 
reg [13:0] RadiationCr_upper=14'b0; 
reg [13:0] RadiationAr_lower=14'b0; 
reg [13:0] RadiationBr_lower=14'b0; 
reg [13:0] RadiationCr_lower=14'b0; 
reg [31:0] RadiationAr_TMR=32'b0; 
reg [31:0] RadiationBr_TMR=32'b0; 
reg [31:0] RadiationCr_TMR=32'b0; 
reg [13:0] RadiationAi_upper=14'b0; 
reg [13:0] RadiationBi_upper=14'b0; 
reg [13:0] RadiationCi_upper=14'b0; 
reg [13:0] RadiationAi_lower=14'b0; 
reg [13:0] RadiationBi_lower=14'b0; 
reg [13:0] RadiationCi_lower=14'b0; 
reg [31:0] RadiationAi_TMR=32'b0; 
reg [31:0] RadiationBi_TMR=32'b0; 
reg [31:0] RadiationCi_TMR=32'b0; 
wire [119:0] RadiationA,RadiationB,RadiationC; 
reg  [1:0] counter=2'b00; 
reg [2:0] random=3'b001; 
parameter High=14'b00111000000000; 
parameter Low= 14'b00000000000001; 
reg [6:0] mm=0; 
wire [13:0] Value1; 
assign Value1=High; 



















































 if ((mm==4))//||(mm==6)||(mm==7)) 
  begin 
  RadiationBr_TMR<={(Value1[13:11]+random),Value1[10:0],18'b0}; 
  end 
  else 
  begin 
  RadiationCr_TMR<=0;//{(Value1+random),18'b0}; 
  end 
  RadiationCr_TMR<=0;//{(Value1+random),18'b0}; 
  RadiationAr_TMR<=0;//{(Value1+random),18'b0}; 
 88
  RadiationAi_TMR<=0;//{(Value1+random),18'b0}; 
  RadiationBi_TMR<=0;//{(Value1+random),18'b0}; 
  RadiationCi_TMR<=0;//{(Value1+random),18'b0}; 
  RadiationAr_upper<=0; 
  RadiationBr_upper<=0; 
  RadiationCr_upper<=0; 
  RadiationAi_upper<=0; 
  RadiationBi_upper<=0; 
  RadiationCi_upper<=0; 
  RadiationAr_lower<=0; 
  RadiationBr_lower<=0; 
  RadiationCr_lower<=0; 
  RadiationAi_lower<=0; 
  RadiationBi_lower<=0; 







































C2.II. Multiplexer and Voter of Each Stage   






input [13:0] Input1_realA,Input1_imageA,Input1_realB,Input1_imageB; 
input [31:0] Input2_realA,Input2_imageA,Input2_realB,Input2_imageB; 
input [13:0] Input3_realA,Input3_imageA,Input3_realB,Input3_imageB; 
input clk,WE_BF_for_multiplexer; 
output  [31:0] Out_real; 
output [31:0] Out_im; 
reg signed [31:0] Before_Voter_real; 
reg signed [31:0] Before_Voter_im; 
reg signed [13:0] Upper_bond_real,Upper_bond_im,Lower_bond_real,Lower_bond_im; 
reg signed [31:0] Median_real,Median_im; 
reg signed [13:0] Bottom_real; 
reg signed [13:0] Top_real; 
reg signed [13:0] Bottom_im; 
reg signed [13:0] Top_im; 
reg signed [13:0] Precise_real; 
reg signed [13:0] Precise_image; 
 
 







always @(posedge clk) 
begin 
 if (WE_BF_for_multiplexer) 
  begin 
 90
  Upper_bond_real   =Input1_realA; 
  Upper_bond_im    =Input1_imageA; 
  Before_Voter_real   =Input2_realA; 
  Before_Voter_im   =Input2_imageA; 
  Lower_bond_real   =Input3_realA; 
  Lower_bond_im    =Input3_imageA;  
  end 
 else 
  begin 
   
  Upper_bond_real   =Input1_realB; 
  Upper_bond_im    =Input1_imageB; 
  Before_Voter_real   =Input2_realB; 
  Before_Voter_im   =Input2_imageB; 
  Lower_bond_real   =Input3_realB; 
  Lower_bond_im    =Input3_imageB; 
  end 
 
Median_real={({Upper_bond_real[13],Upper_bond_real}+{Lower_bond_real[13],Lower















and P1 (K4_r,K1_r,K2_r); 
and P2 (K5_r,K1_r,K3_r); 





and P11 (K4_i,K1_i,K2_i); 
and P21 (K5_i,K1_i,K3_i); 





// The following example is a behavioral voter of RPR 
// But due to decrease in the speed I prefered 




// if (Before_Voter_real>Top_real) 
//  begin 
//  Out_real={Median_real,24'b0}; 
//  end 
// else if (Before_Voter_real<Bottom_real) 
//  begin 
//  Out_real={Median_real,24'b0}; 
//  end 
// else 
//  begin 
//  Out_real=Before_Voter_real; 










// if (Before_Voter_im>Top_im) 
//  begin 
//  Out_im={Median_im,24'b0}; 
//  end 
// else if (Before_Voter_im<Bottom_im) 
//  begin 
//  Out_im={Median_im,24'b0}; 
//  end 
// else 
//  begin 
//  Out_im=Before_Voter_im; 


















C2.III.  BF—Delayer   
`timescale 1ns / 1ps 
module Delayer_RPR(In_real,In_im,Out_real,Out_im,clk); 
input clk; 
input    [13:0] In_real,In_im; 
output reg  [13:0] Out_real,Out_im; 
 
reg [13:0] Real_reg_0,Im_reg_0; 
reg [13:0] Real_reg_1,Im_reg_1; 
reg [13:0] Real_reg_2,Im_reg_2; 
reg [13:0] Real_reg_3,Im_reg_3; 
reg [13:0] Real_reg_4,Im_reg_4; 
reg [13:0] Real_reg_5,Im_reg_5; 
reg [13:0] Real_reg_6,Im_reg_6; 
 
// This delayer is for the BF outputs due to the fact that the use of 8/32 or 14/32 degree of 
RPR result a decrease in use of multipliers, CSA and CLAH and in the decrease of the 
needed pipelines. So I have to replace the loss of the 6 "cutted" pipelines with 6 registers 
in order my FFT to be synchronous. 
always @(posedge clk) 
begin 
 Real_reg_0  <= In_real; 




always @(posedge clk) 
begin 
 Real_reg_1  <= Real_reg_0; 





always @(posedge clk) 
begin 
 Real_reg_2  <= Real_reg_1; 




always @(posedge clk) 
begin 
 Real_reg_3  <= Real_reg_2; 




always @(posedge clk) 
begin 
 Real_reg_4  <= Real_reg_3; 




always @(posedge clk) 
begin 
 Real_reg_5  <= Real_reg_4; 
 Im_reg_5  <= Im_reg_4; 
  
end 
always @(posedge clk) 
begin 
 Out_real  <= Real_reg_5; 
 Out_im  <= Im_reg_5; 
end 
endmodule 
C2.IV.  BF——Multiplier 
`timescale 1ns/100ps 
// multiplier RPR Edition 1 
module mult_14x14_edit1_RPR(clk,reset,clk_enable,Ainput,Binput,Sumfinal); 
 
input   [13:0] Ainput,Binput; 
input clk,clk_enable,reset; 
output reg [13:0] Sumfinal; 
reg  [17:0] Ain,Bin; 
reg  [17:0] Ain1,Bin1; 
 94




wire [35:0] telos; 
 





   
MULT18X18S A0B0 (.A(Ain1),  
                      .B(Bin1),  
                      .C(clk),  
                      .CE(clk_enable),  
                      .R(reset),  
                      .P(telos)); 
 












APPENDIX D. MATLAB FILE 
This appendix contains the MATLAB file created to verify the results from 
Xilinx’s RPR module. This file has multiple functions such as: 
 Generates input signals, converts them into two’s complement fixed-point binary 
numbers and exports them into files for use in Modelsim. 
 Imports the generated output files from Modelsim and converts them into 
decimal. 
 Generates two different algorithms for calculating the FFT. It uses the 
MATLAB’s build-in FFT function and a behavioral simulation in MATLAB, 
based on the implemented Xilinx design.  Simultaneously, it loads the data from 
Xilinx that is generated by the Modelsim simulator program. 
 Generates three different graphical figures based on the three set of data. 
 Calculates the bounds for the expected precise calculations error and for the 
radiation imported errors. 
 Shows the output results for every stage of the FFT and compares the three sets of 
data, enabling the user to inspect the impact of radiation in the whole process of 
the FFT algorithm. 
D1. MAIN FILE 
% It works perfectly for |Inputs|<2^(-9). 
% It works with the later RPR designs due to the 3bit scale down of 
% the inputs at the beginning of each stage. 
  
% Radix 4 fft N=64 point 
% programma creates twiddle for N=64 bit 
% fixed point two's complement 
% 30 LSB are the floating. 





% this program computes the FFT Radix4 N=64point in shape geometry, 
% reversing bit in every stage,compares the result of the calculations 
to  
% between my Matlab FFT design and Verilog Design. 
% Athanasios Gavros 15/07/2010 
% edition 1.3a 
% As an input signal : real part :'equationA', imaginary part 
:'equationB'. 
% In Phase Compare between my FFT and Verilog, you are going to observe 
 96
% sometimes huge difference. But that is not entirely truth. Due to 
small 
% numbers (10^(-13) etc ..) that should be accepted equal to zero, any  
% flunctuation can cause wrong results. That why we compare the image 
and  
% real part of the two outputs and if the difference is smaller than 
% 9.3*(10^-7) , Matlab is printing the message  
Phase_Checking='Correct'. " 
  
%Beware that the results of Verilog are not Bit Reversed on the final 
%stage. We are reversing the Address of the Output , but here in 
Matlab, we 
%are doing this final bit reversal in order to compare our final 
results 










































    equationA(n+1)=0.5;%end 
end     
for counter=1:1:64 
    sample_signal_real(counter)=  double( equationA(counter));%0.0001; 
    sample_signal_image(counter)= double( equationB(counter));%0.0001; 




for mm = 0:1:63 
   %radix-4 
    twiddl(mm+1)=double(exp(-2*pi*j*mm*1/64)); 
    Wn_real(mm+1)=(real(twiddl(mm+1)));%cos(2*pi*mm/64); 
    Wn_im(mm+1)=(-imag(twiddl(mm+1)));%sin(2*pi*mm/64); 





    for i=0:16:48 
        if (i==0) 
        Ain_real=sample_signal_real(i+p); 
        Ain_im= sample_signal_image(i+p); 
        end 
        if (i==16) 
        Bin_real=sample_signal_real(i+p); 
        Bin_im= sample_signal_image(i+p); 
        end 
        if (i==32) 
        Cin_real=sample_signal_real(i+p); 
        Cin_im= sample_signal_image(i+p); 
        end 
        if (i==48) 
        Din_real=sample_signal_real(i+p); 
        Din_im= sample_signal_image(i+p); 
        end 
     ert1=ert1+1; 
     check1(ert1)=i+p; 
    end 
        factor1     = Ain_real+Bin_im-Cin_real-Din_im; 
        factor2     = Ain_real-Bin_real+Cin_real-Din_real; 
        factor3     = Ain_real-Bin_im-Cin_real+Din_im; 
        factor4     = Ain_im-Bin_real-Cin_im+Din_real; 
        factor5     = Ain_im-Bin_im+Cin_im-Din_im; 
        factor6     = Ain_im+Bin_real-Cin_im-Din_real; 
  
        Areal_stage1(0+p)       =Ain_real+Bin_real+Cin_real+Din_real; 
        Aim_stage1(0+p)         =Ain_im+Bin_im+Cin_im+Din_im; 
  
            Areal_stage1(16+p)  = double(factor1*Wn_real(p) + 
factor4*Wn_im(p));% ena error sto Wn (p+1)!!! pantoy xaos 
            Aim_stage1(16+p)   = double(factor4*Wn_real(p) - 
factor1*Wn_im(p)); 
 98
            Areal_stage1(32+p)  = double(factor2*Wn_real(2*(p-1)+1) + 
factor5*Wn_im(2*(p-1)+1)); 
            Aim_stage1(32+p)    = double(factor5*Wn_real(2*(p-1)+1) - 
factor2*Wn_im(2*(p-1)+1)); 
            Areal_stage1(48+p)  = double(factor3*Wn_real(3*(p-1)+1) + 
factor6*Wn_im(3*(p-1)+1)); 
            Aim_stage1(48+p)    = double(factor6*Wn_real(3*(p-1)+1) - 
factor3*Wn_im(3*(p-1)+1));   
         
             
             
             
end 
%-------------------------- 
% bit reversal of the Radix4 






    for i2=0:16:48 
        ttt=ttt+1; 
         
    k22(ttt)=v2+p2+i2; 













    for i2=0:4:12 
        ttt=ttt+1; 
         
    k22a(ttt)=v2+p2+i2; 













%% Stage 2 




    for i2=0:4:12 
        if (i2==0) 
        Ain_real2=Areal_stage1_r(i2+p2+v2); 
        Ain_im2= Aim_stage1_r(i2+p2+v2); 
        end 
        if (i2==4) 
        Bin_real2=Areal_stage1_r(i2+p2+v2); 
        Bin_im2= Aim_stage1_r(i2+p2+v2); 
        end 
        if (i2==8) 
        Cin_real2=Areal_stage1_r(i2+p2+v2); 
        Cin_im2= Aim_stage1_r(i2+p2+v2); 
        end 
        if (i2==12) 
        Din_real2=Areal_stage1_r(i2+p2+v2); 
        Din_im2= Aim_stage1_r(i2+p2+v2); 
        end 
      v2+p2+i2; % test milestone 
    end 
   
        factor1_stage2      = Ain_real2+Bin_im2-Cin_real2-Din_im2; 
        factor2_stage2      = Ain_real2-Bin_real2+Cin_real2-Din_real2; 
        factor3_stage2      = Ain_real2-Bin_im2-Cin_real2+Din_im2; 
        factor4_stage2      = Ain_im2-Bin_real2-Cin_im2+Din_real2; 
        factor5_stage2      = Ain_im2-Bin_im2+Cin_im2-Din_im2; 
        factor6_stage2      = Ain_im2+Bin_real2-Cin_im2-Din_real2; 
         
        Areal_stage2(v2+p2)     
=Ain_real2+Bin_real2+Cin_real2+Din_real2; 
        Aim_stage2(v2+p2)       =Ain_im2+Bin_im2+Cin_im2+Din_im2; 
  
            Areal_stage2(v2+p2+4)   = 
double(factor1_stage2*Wn_real(1+e2*4) + factor4_stage2*Wn_im(1+e2*4)); 
            Aim_stage2(v2+p2+4)     = 
double(factor4_stage2*Wn_real(1+e2*4) - factor1_stage2*Wn_im(1+e2*4)); 
            Areal_stage2(v2+p2+8)   = 
double(factor2_stage2*Wn_real(1+e2*8) + factor5_stage2*Wn_im(1+e2*8)); 
            Aim_stage2(v2+p2+8)     = 
double(factor5_stage2*Wn_real(1+e2*8) - factor2_stage2*Wn_im(1+e2*8)); 
            Areal_stage2(v2+p2+12)  = 
double(factor3_stage2*Wn_real(1+e2*12) + 
factor6_stage2*Wn_im(1+e2*12)); 










% bit reversal of the Radix4 






    for i2=0:16:48 
        ttt=ttt+1; 
         
    k22(ttt)=v2+p2+i2; 








%% Stage 3 
%  Input signal 
  
for p3=0:4:60 
    for i3=1:1:4 
        if (i3==1) 
        Ain_real3=Areal_stage2_r(i3+p3); 
        Ain_im3= Aim_stage2_r(i3+p3); 
        end 
        if (i3==2) 
        Bin_real3=Areal_stage2_r(i3+p3); 
        Bin_im3= Aim_stage2_r(i3+p3); 
        end 
        if (i3==3) 
        Cin_real3=Areal_stage2_r(i3+p3); 
        Cin_im3= Aim_stage2_r(i3+p3); 
        end 
        if (i3==4) 
        Din_real3=Areal_stage2_r(i3+p3); 
        Din_im3= Aim_stage2_r(i3+p3); 
        end 
      oo1=p3+i3; % test milestone 
    end 
       
        factor1_stage3      = Ain_real3+Bin_im3-Cin_real3-Din_im3; 
        factor2_stage3      = Ain_real3-Bin_real3+Cin_real3-Din_real3; 
        factor3_stage3      = Ain_real3-Bin_im3-Cin_real3+Din_im3; 
        factor4_stage3      = Ain_im3-Bin_real3-Cin_im3+Din_real3; 
        factor5_stage3      = Ain_im3-Bin_im3+Cin_im3-Din_im3; 
        factor6_stage3      = Ain_im3+Bin_real3-Cin_im3-Din_real3;  
        Areal_stage3(p3+1)      
=Ain_real3+Bin_real3+Cin_real3+Din_real3; 
        Aim_stage3(p3+1)        =Ain_im3+Bin_im3+Cin_im3+Din_im3; 
 101
  
            Areal_stage3(p3+2)   = double(factor1_stage3*Wn_real(1) + 
factor4_stage3*Wn_im(1)); 
            Aim_stage3(p3+2)     = double(factor4_stage3*Wn_real(1) - 
factor1_stage3*Wn_im(1)); 
            Areal_stage3(p3+3)   = double(factor2_stage3*Wn_real(1) + 
factor5_stage3*Wn_im(1)); 
            Aim_stage3(p3+3)     = double(factor5_stage3*Wn_real(1) - 
factor2_stage3*Wn_im(1)); 
            Areal_stage3(p3+4)  = double(factor3_stage3*Wn_real(1) + 
factor6_stage3*Wn_im(1)); 
            Aim_stage3(p3+4)    = double(factor6_stage3*Wn_real(1) - 
factor3_stage3*Wn_im(1)); 
            
end 
%-------------------------- 
% bit reversal of the Radix4 






    for i2=0:16:48 
    ttt=ttt+1; 
         
    k22(ttt)=v2+p2+i2; 





%if (Aim_stage1(i)<10^(-10))  





























title ('My Matlab FFT design - Radix 4, N=64, in shape geometry, 
reversing bit in each stage -    The following plot is Stage1 real'); 




title ('Stage1 im'); 
axis auto ; 
  
axis auto ; 
subplot (8,1,3) 
plot ([0:63],Areal_stage2,'-o'); 
title ('Stage2 real'); 




title ('Stage2 im'); 




title ('Stage3 real'); 




title ('Stage3 im'); 









title ('The magnitude result of each stage of my FFT desing in Matlab- 



















%% saving the choosen equation into binary file named  
Input_from_Matlab_real.in & Input_from_Matlab_image.in 
  
%-------------------------- 
% decimal to binary 
% Accepts decimal numbers between -1 < Number < 1 
%and transform them into (2bit integer and 30 bit floating) 
% equationA is the input decimal number variable. 
% pinakas is the output binary number 




    if (number3==1)  
        Realoo(1:64)=sample_signal_image(1:64); 
    else 
        Realoo(1:64)=sample_signal_real(1:64); 




    pinakas{ax,1}=0; 
    pinakas{ax,2}=1; 
    for tr=3:1:64 
        pinakas{ax,tr}=0; 
    end 
end 
if (Realoo(ax)==0) 
    for i=1:1:64 
        pinakas{ax,i}=0; 
        telos=1; 
    end 
end 
if (Realoo(ax)==-1) 
    pinakas{ax,1}=1; 
    pinakas{ax,2}=1; 
    for tr=3:1:64 
        pinakas{ax,tr}=0; 
    end 
end 
if (Realoo(ax)<1)&(Realoo(ax)>0) 
    pinakas{ax,1}=0; 
    pinakas{ax,2}=0; 
    zhtoymeno5=Realoo(ax); 
 104
    for tr=3:1:64 
        zhtoymeno5=zhtoymeno5*2; 
        if (zhtoymeno5>=1) 
            zhtoymeno5=zhtoymeno5-1; 
            pinakas{ax,tr}=1; 
        else 
            pinakas{ax,tr}=0; 
        end 




    pinakas{ax,1}=1; 
    pinakas{ax,2}=1; 
    zhtoymeno5=1+Realoo(ax); 
    for tr=3:1:64 
        zhtoymeno5=zhtoymeno5*2; 
        if (zhtoymeno5>=1) 
            zhtoymeno5=zhtoymeno5-1; 
            pinakas{ax,tr}=1; 
        else 
            pinakas{ax,tr}=0; 
        end 







   Decimal_to_binary=[Decimal_to_binary pinakas{i,e}]; %concatenate 
end 
  
 if (number3==1) binary_result_to_Verilog_image = Decimal_to_binary   ; 
% contains the decimal result of our Verilog Design (image)   
 else  binary_result_to_Verilog_real = Decimal_to_binary   ; % contains 
the decimal result of our Verilog Design (real) 
 end 
  
    end 
end 
% it stores the 32bit number in line without any ; or any other 
distinction 
fid22 = fopen('Input_from_Matlab_image.in','w'); 
fprintf(fid22,'%u',binary_result_to_Verilog_image); 
fclose(fid22) 
fid122 = fopen('Input_from_Matlab_real.in','w'); 
fprintf(fid122,'%u',binary_result_to_Verilog_real); 
fclose(fid122) 

















title ('Matlabs Input Function Real'); 





title ('Matlabs Input Function Image'); 




title ('FFT N=64 - Matlabs Ready Function'); 










title ('Radix 4 FFT N=64 VERILOG output before the proper reverse of 
the final stage - As it comes from the file'); 
axis auto; 
%--------------------------------------------------------------------- 
%% Helpfull reversing of verilog files stages 
  
%-------------------------- 
% bit reversal of the Radix4 






    for i2=0:16:48 
    ttt=ttt+1; 
         
    k22(ttt)=v2+p2+i2; 





%if (Aim_stage1(i)<10^(-10))  













title ('My Verilog FFT Stage Final Output'); 
ylabel('Magnitude'); 
axis auto ; 






  for v21=1:1:4 
    for i21=0:4:12 
    ttt1=ttt1+1; 
         
    k221(ttt1)=v21+p21+i21; 





%if (Aim_stage1(i)<10^(-10))  





%% The magnitude result of each stage of my FFT desing in Matlab 
%compared to the magnitude result of each stage from Verilog(after 
being 


























title ('The magnitude result of each stage of my FFT desing in Matlab 
compared to the magnitude result of each stage from Verilog(after being 
radiated)(not reversed)- This plot is for Magnitude of stage1'); 




























ylabel('Difference in stage1 values of figure 12'); 
grid 



























ylabel('Difference in stage1 values of figure 12'); 
grid 
















ylabel('Difference in stage3 values of figure 12'); 
%-------------- 




















title ('Real results Difference'); 
figure (10) 
plot (imag(ee1(:,1)),[1:64],'xr',imag(ee1(:,3)),[1:64],'ob'); 








title ('Radix 4 FFT N=64 - My Verilog FFT'); 
ylabel('Magnitude'); 
















title ('FFT N=64 - Matlabs Build In Function'); 
ylabel('Magnitude'); 
axis auto ; 
subplot (6,1,6) 
plot (F1,Angle_of_build_in_fft,'-x'); 
title ('Angle of the build in fft '); 
ylabel('Phase') 
%---------------------------------------------------------------------- 
%comparing  My Matlab Function FFT with FFT Verilog (comparing  
%the magnitude and the phase) 
figure (4) 
for i=1:1:64 
    if (my_Matlab_fft<10^12) 






    Phase_and_Magnitude_Checking='Correct !!' 
else 




title ('Difference in Magnitude or in Phase between build in FFT and my 
Matlab FFT '); 
axis auto ; 
%axis ([0 1 0 (10^(-8))]);%metalaksh 
xlabel('N=64') 
ylabel('Absolute value') 
legend('Magnitude Difference','Phase Difference'); 
figure (13) 
subplot(3,2,1),plot (F1,Xin_real,'-o') 
title ('Matlabs Input Function Real'); 
axis auto ; 
subplot (3,2,2),plot (F1,Xin_image,'-o') 
title ('Matlabs Input Function Image'); 
axis auto ; 
subplot (3,2,[5 6]) 
difference_in_output_a=fft_build_in_matlab-FFT_Verilog_Final_Magn; 
plot (F1,difference_in_output_a,'xr'); 
ylabel('Detailed Difference in Magnitude of the Output') 
  
subplot (3,2,[3 4]) 
plot (F1,fft_build_in_matlab,'--gs',F1,FFT_Verilog_Final_Magn,'--+'); 
ylabel('Magnitude of the Output') 
xlabel('N=64 samples') 




    if  (abs(difference_in_output_a(i))>3.8*10^(-6)) 
        plot ((i-1)/64,FFT_Verilog_Final_Magn(i),'ro') 
        legend ('fft build in matlab','FFT Verilog Final Magn (Post-
Route Simulation)','Identified Errors'); 








    if (Real_diff<Check_phase1(i)) 
        Real_diff=Check_phase1(i); 
    end 
    if (Image_diff<Check_phase2(i)) 
        Image_diff=Check_phase2(i); 




    Major_diff=Real_diff; 
else 




if (single(Major_diff)<=(2^(-1)))  
    title ('FFT Comparing Check =Correct! Your Verilog FFT works at 167 
MHz'); 
else 








% binary to decimal 
% Accepts binary numbers between 01.000..00 - 11.111...11 
%(2bit integer and 30 bit floating) 
% ap is the input binary number variable.(be careful, there is no 
decimal 
% point in the input binary numbers! 
% result1 is the output decimal number 
  
  
% the following part loads the 32bit number from Verilog that are in 
line without any ; 
% or any other distinction (only a space between the numbers) 
% "transformed_bin_to_dec" is a 32bit 64 cell matrix that contains the 
binary 





btd = fopen(TB_data_file_Output_image,'r'); 
Binary_to_decimal_image=fscanf(btd,'%c'); 




%for eqq=0:1:31   
   [token, Binary_to_decimal_image] = strtok(Binary_to_decimal_image); 
    %transformed_bin_to_dec{i,1:32}=sscanf (token,'%c'); 
    transformed_bin_to_dec_image{i}=token; 
    %Binary_to_decimal{1,1:32}; 









%for eqq=0:1:31   
   [token, Binary_to_decimal_real] = strtok(Binary_to_decimal_real); 
    %transformed_bin_to_dec{i,1:32}=sscanf (token,'%c'); 
 112
    transformed_bin_to_dec_real{i}=token; 
    %Binary_to_decimal{1,1:32}; 







    if (number2==1)  
        ap=transformed_bin_to_dec_image{oeo}; 
    else 
        ap=transformed_bin_to_dec_real{oeo}; 












    result1=0; 
    for  e1=3:1:32 
        if (ap(e1))=='1' 
            num1=1; 
        else 
            num1=0; 
        end 
         
        zhtoymeno1=(zhtoymeno1)/2; 
        result1=num1*zhtoymeno1+result1; 
    end   
end 
     
if (ap(1:2)=='11') 
    result1=-1; 
    for  e1=3:1:32 
        if (ap(e1))=='1' 
            num1=1; 
        else 
            num1=0; 
        end 
        zhtoymeno1=(zhtoymeno1)/2; 
        result1=result1+num1*zhtoymeno1; 
    end   
end 
% in the following calculation I multiply the final result by 2^9 due 
to 
% 3bit step down on each stage of the FFT =>(2^3)*(2^3)*(2^3)=2^9 
 113
 if (number2==1) decimal_result_of_Verilog_image (oeo)= result1*multipl   
; % contains the decimal result of our Verilog Design (image)   
 else  decimal_result_of_Verilog_real (oeo)= result1*multipl   ; % 
contains the decimal result of our Verilog Design (real) 
 end 
  


























THIS PAGE INTENTIONALLY LEFT BLANK 
 115
LIST OF REFERENCES 
[1] Joshua D. Snodgrass, “Low-Power Tolerance for Spacecraft FPGA-based 
Numerical Computing”, Dissertation,  Department of Electrical and Computer 
Engineering, Naval Postgraduate School, Monterey, California, September 2006. 
 
[2] L. D. Edmonds, “An Introduction to Space Radiation Effects on 
Microelectronics” JPL publication 00-06, Jet Propulsion Laboratory, California 
Institute of Technology Pasadena, California, May 2000. 
 
[3] Herschel Loomis and Alan Ross, “Configurable Fault-Tolerant Architectures for 
Reliable Space-Based Computing (CFTP),” Naval Postgraduate School Research 
Proposal to the Secretary of the Air Force, March 2010. 
 
[4] Margaret A. Sullivan, “Reduced Precision Redundancy applied to arithmetic 
operations in field Programmable Gate Arrays for satellite control and sensor 
systems”, Master’s Thesis, Department of Electrical and Computer Engineering, 
Department of Mechanical and Astronautical Engineering, Naval Postgraduate 
School, Monterey, California, December 2008. 
 
[5] Nikolaos Gkikas, “Architecture for the Creation in Software, FFT /IFFT for 
Wireless Networks”, Master’s Thesis, Department of Physics, University of 
Ioannina, Patra, Greece, December 2005. 
 
[6] J. W. Cooley and J.W. Tukey, “An Algorithm for the Machine Calculation of 
Complex Fourier Series”, Math of Comp., Vol.19, No.90, pp 297-301, April 
1965. 
 
[7] Charles Wu, “Implementing the Radix-4 Decimation in Frequency (DIF) Fast 
Fourier Transform (FFT) Algorithm Using a TMS320C80 DSP”, Application 
report: SPRA152, SC Sales & Marketing – TI Taiwan, Digital signal Processing 
Solutions, January 1998, Texas Instruments. 
 
[8] Marianne Delphin, “Implementing the Radix-4 FFT Algorithm Using the ST120 
DSP”, application note AN1381, October 2001, STMicroelectronics. 
 
[9] Zhijan Sun, Xuemei Liu and Zhongxing Ji, “The Design of Radix-4 FFT by 
FPGA”, College of Science, Qingdao Technological University , Qingdao, 
Shandong, 266033, China. 
 
[10] Saad Bouguezel,M. Omair Ahmad, “Improved Radix-4 and Radix-8 FFT 





[11] Chu Chao, Zhang Qin, Xie Yingke and Han Chengde, “Design of a High 
Performance FFT Processor Based on FPGA.” 
 
[12] John G. Proakis, Dimitris G. Manolakis, “Digital Signal Processing Principles, 
Algorithms and Applications”, Third Edition, Prentice-Hall International, INC, 
1996. 
 
[13] Douglas F. Elliot, “Handbook of Digital Signal Processing Engineering 
Applications”, Rockwell International Corporation, Anaheim, California, 
Academic Press, INC. 
 
[14]  Siva Kumar Palaniappan & Tun Zainal Azni Zulkifli, “Design of 16-point Radix-
4 Fast Fourier Transform jn 0.18μm CMOS Technology”, American Journal of 
Applied Science 4(8), pp. 570-575, ISSN 1546–9239, 2007.  
 
[15] Douglas Lee Jones, “Decimation-in-Frequency (DIF) Radix-2 FFT”, 
http://cnx.org/content/m12018/latest/. (Accessed May 20, 2010). 
 
[16] CMLab, DSP Research Group, Taiwan, “Fast Fourier Transform,” 
http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html. 
(Accessed May 20, 2010). 
 
[17] Vijay Madisetti, Douglas Bennett Williams, “The Digital Signal Processing 
Handbook”, IEEE Press, CRC Press LLC, 1997. 
 
[18] Raymond F. Bernstein, Jr. “A Pipelined Vector Processor and Memory 
Architecture for Cyclostationary Processing.”, Dissertation, Department of 
Electrical and Computer Engineering, Naval Postgraduate School, Monterey, 
California, December 1995. 
 
[19] Herschel H. Loomis, Jr., Notes from EC4830, Naval Postgraduate School, 
Monterey, CA, Week1, March 2, 2009,  (unpublished). 
 
[20] Yao-Ting Cheng, “Autoscaling Radix-4 FFT for TMS320C6000”, application 




INITIAL DISTRIBUTION LIST 
1. Defense Technical Information Center 
Ft. Belvoir, Virginia  
 
2. Dudley Knox Library 
Naval Postgraduate School 
Monterey, California  
 
3. Herschel H. Loomis, Jr. 
Naval Postgraduate School 
Monterey, California 
 
4. Alan A. Ross 
Naval Postgraduate School 
Monterey, California 
 
