Study of hardware and software optimizations of SPEA2 on hybrid FPGAs by Theophila, Brad
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
7-1-2008
Study of hardware and software optimizations of
SPEA2 on hybrid FPGAs
Brad Theophila
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Theophila, Brad, "Study of hardware and software optimizations of SPEA2 on hybrid FPGAs" (2008). Thesis. Rochester Institute of
Technology. Accessed from
Study of Hardware and Software Optimizations of SPEA2
on Hybrid FPGAs
by
Brad Theophila
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Engineering
Supervised by
Dr. Marcin Lukowiak
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
July 2008
Approved By:
Dr. Marcin Lukowiak
RIT Department of Computer Engineering
Primary Advisor
Dr. Vincent J. Amuso
RIT Department of Electrical Engineering
Secondary Advisor
Dr. Muhammad Shaaban
RIT Department of Computer Engineering
Secondary Advisor
Thesis Author Permission Statement
Title: Study of Hardware and Software Optimizations of SPEA2 on Hybrid FPGAs
Author: Brad Theophila
Degree: Master of Science
Program: EECB
College: Kate Gleason College of Engineering
I understand that I must submit a print copy of my thesis or dissertation to the RIT
Archives, per current RIT guidelines for the completion of my degree. I hereby grant to
the Rochester Institute of Technology and its agents the non-exclusive license to archive
and make accessible my thesis or dissertation in whole or in part in all forms of media in
perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation.
I also retain the right to use in future works (such as articles or books) all or part of this
thesis or dissertation.
Print Reproduction Permission Granted:
I, Brad Theophila, hereby grant permission to the Rochester Institute of Technology to
reproduce my print thesis in whole or in part. Any reproduction will not be for commercial
use or profit.
Brad Theophila Date
Inclusion in the RIT Digital Media Library
Electronic Thesis & Dissertation (ETD) Archive
I, Brad Theophila, additionally grant to the Rochester Institute of Technology Digital
Media Library (RIT DML) the non-exclusive license to archive and provide electronic ac-
cess to my thesis or dissertation in whole or in part in all forms of media in perpetuity. I
understand that my work, in addition to its bibliographic record and abstract, will be avail-
able to the world-wide community of scholars and researchers through the RIT DML. I
retain all other ownership rights to the copyright of the thesis or dissertation. I also retain
the right to use in future works (such as articles or books) all or part of this thesis or disser-
tation. I am aware that the Rochester Institute of Technology does not require registration
of copyright for ETDs. I hereby certify that, if appropriate, I have obtained and attached
written permission statements from the owners of each third party copyrighted matter to be
included in my thesis or dissertation. I certify that the version I submitted is the same as
that approved by my committee.
Brad Theophila Date
Acknowledgments
I would like to thank Dr. Lukowiak, Dr. Amuso, and Dr. Shaaban for their guidance in
the process of this thesis. I’d also like to thank some of my peers including Jason Enslin,
Glenn Ramsey, and Dmitriy Bekker for their continued support and lending of a helping
hand.
iii
Abstract
Traditional radar technology consists of multiple platforms, each designed to process only
a single mission objective, such as Ground Moving Target Indication (GMTI), Airborne
Moving Target Indication (AMTI) or Synthetic Aperture Radar (SAR). This is no longer
considered a cost effective solution, thus leading to the increased need for a single radar
platform which can perform multiple radar missions. Many algorithms have been de-
veloped to specifically address multi-objective design problems. One such approach, the
Strength Pareto Evolutionary Algorithm 2 (SPEA2), applies the concept of evolution through
a Genetic Algorithm (GA) to the design of simultaneous orthogonal waveforms.
The objectives of the various radar missions are often conflicting. The goal of SPEA2
is to find the best waveform suite in the Pareto sense. Preliminary results of this algorithm
applied to a scaled down multi-objective mission scenario have been promising. One set-
back of the use of this algorithm is its abundant computational complexity. Even in a scaled
down simulation, performance does not meet expectations.
This thesis investigated a hardware and software optimization of SPEA2 applied to si-
multaneous multi-mission waveform design, using hybrid FPGAs. Hybrid FPGAs contain
a combination of a single or multiple embedded processors and reconfigurable hardware.
The algorithm was first implemented in C on a PC, then profiled and analyzed. The C
code was translated to run on an embedded PowerPC 405 processing core on a Virtex4 FX
(V4FX). The hardware fabric of the V4FX was utilized to offload the main bottleneck of
the algorithm from the PowerPC 405 core to hardware for speedup, while various software
optimizations were also implemented, in an effort to improve performance. Performance
results from the V4FX implementation were not ideal. Thus, many suggestions for future
iv
work that may achieve the desired performance are posed in this thesis.
v
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Multi-Mission Radar Optimization . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Motivation for Hardware/Software Optimization on Hybrid FPGA . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Waveform Conflict Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Multi-Objective Optimization Problem (MOP) . . . . . . . . . . . . . . . . 6
2.3 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 SPEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 NSGA-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.3 PESA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.4 SPEA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.5 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 The Strength Pareto Evolutionary Algorithm 2 (SPEA2) . . . . . . . . . . . 14
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Environmental Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 SPEA2 Applied to Multi-Mission Radar Waveform Design . . . . . . . . . 20
4.1 Waveform Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
4.2.1 Revisit Time MTI Objective Function . . . . . . . . . . . . . . . . 22
4.2.2 Pulse Integration MTI Objective Function . . . . . . . . . . . . . . 22
4.2.3 Peak Sidelobe Level SAR Objective Function . . . . . . . . . . . . 23
4.2.4 Integrated Sidelobe Level SAR Objective Function . . . . . . . . . 24
4.3 Scaled Simulation Scenario: SAR Mission . . . . . . . . . . . . . . . . . . 26
4.4 Full Bandwidth, Single Aperture Experiment . . . . . . . . . . . . . . . . 28
4.5 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Xilinx Virtex-4 FX Hybrid FPGA . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Virtex-4 FX Hybrid-FPGA Overview . . . . . . . . . . . . . . . . . . . . 36
5.2 PowerPC 405 Embedded Processor . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Auxiliary Processor Unit Controller . . . . . . . . . . . . . . . . . . . . . 38
5.4 ML410 Development Board . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 SPEA2 Base System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Performance Analysis and Optimizations . . . . . . . . . . . . . . . . . . . 45
6.1 Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.1 Fixed-Point Conversion . . . . . . . . . . . . . . . . . . . . . . . 50
6.3.2 In-line Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3.3 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.4 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.5 Sparse 2D IFFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3.6 VPH Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Hardware Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.6 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6.1 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6.2 Software Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1 Increasing Clock Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Utilization of Second PowerPC . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3 2D IFFT in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
7.4 Migrate Power Calculations to FPGA Fabric . . . . . . . . . . . . . . . . . 71
7.5 Migrate Entire Evaluation Function to FPGA Fabric . . . . . . . . . . . . . 71
7.6 Hybrid FPGA Alternative Solutions . . . . . . . . . . . . . . . . . . . . . 72
7.6.1 Cluster Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.6.2 Cell Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
viii
List of Figures
2.1 Mapping from decision space to objective space [2] . . . . . . . . . . . . . 7
2.2 Simple genetic algorithm flowchart [9] . . . . . . . . . . . . . . . . . . . . 9
2.3 Example binary encoded individual [9] . . . . . . . . . . . . . . . . . . . . 10
3.1 SPEA2 Flow [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Illustration of SPEA2 archive truncation scheme [9]. N = 4 and numbers
indicate elimination order. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Binary Encoded Waveform Suite [9] . . . . . . . . . . . . . . . . . . . . . 21
4.2 MTI Revisit Time objective function. [9] . . . . . . . . . . . . . . . . . . 22
4.3 MTI Pulse Integration objective function [9] . . . . . . . . . . . . . . . . 23
4.4 Peak Sidelobe Level (PSL) SAR objective function [9] . . . . . . . . . . . 24
4.5 Integrated Sidelobe Level (ISL) SAR objective function [9] . . . . . . . . 25
4.6 SAR Mission [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.7 Waveform Suite for Specific Mission[9] . . . . . . . . . . . . . . . . . . . 28
4.8 Simulation Parameters for Full Bandwidth, Single Aperture Experiment [9]
Note: Simulation was only run with population size 50 and archive size 10 . 29
4.9 Initial population and final archive population for software implementation
on PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Initial population and final archive population for MATLAB implementa-
tion in [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 PPC405 CPU core block diagram. [11] . . . . . . . . . . . . . . . . . . . . 38
5.2 APU controller processing operative block diagram [18] . . . . . . . . . . 39
5.3 ML410 Development Board [16] . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Base SPEA2 system diagram . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Device utilization summary of SPEA2 base system . . . . . . . . . . . . . 44
6.1 Evaluation function execution time pie chart . . . . . . . . . . . . . . . . . 49
6.2 SPEA2 system with FFT co-processor . . . . . . . . . . . . . . . . . . . . 53
6.3 Pipelined, Streaming IO FFT Architecture [21] . . . . . . . . . . . . . . . 55
ix
6.4 CORE Generator screenshot of XFFT parameters and resources . . . . . . 55
6.5 APU Load Instruction and FFT State Machine . . . . . . . . . . . . . . . . 57
6.6 APU Store Instruction State Machine . . . . . . . . . . . . . . . . . . . . . 58
6.7 Device utilization summary of SPEA2 system with FFT co-processor . . . 59
6.8 Initial population and final archive population for hardware/software im-
plementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.9 Percentage of run times for non-optimized software on PC and software
only optimizations and software/hardware optimizations on the Virtex4FX . 62
7.1 Dual PowerPC SPEA2 system with FFT co-processors and shared BRAM . 67
7.2 Dual PowerPC memory map [19] . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 XFFT Resource Usage vs. Throughput [21] . . . . . . . . . . . . . . . . . 70
7.4 Overview of the Cell processor [15] . . . . . . . . . . . . . . . . . . . . . 73
7.5 Performance of 1D and 2D FFT in DP (top) and SP (bottom) [15] . . . . . 74
x
List of Tables
6.1 Evaluation function descriptions and number of operation per range cell
(150 cells total) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 SPEA2 PC Implementation Performance . . . . . . . . . . . . . . . . . . . 61
6.3 SPEA2 Hybrid FPGA Implementation Performance . . . . . . . . . . . . . 61
xi
Chapter 1
Introduction
1.1 Multi-Mission Radar Optimization
Conventional surveillance systems are no longer cost effective solutions because they lack
the capability to perform more than one function. For instance, if both ground and airborne
moving target indication missions were required, two independent radar platforms would be
needed to fulfill both objectives. Radar technology is being developed to address multiple
missions within a single platform. Performance of this type of multi-mission platform
will be hindered by the sequential operation of each mission. This presents the need for
waveform suites capable of representing multiple missions [1].
Detailed analysis of the performance of a waveform suite for all missions must be car-
ried out. Pre-existing approaches achieved this through closed-form analytical approaches,
where there was one performance evaluation for each mission. This was made possible
because waveform generation capabilities were limited and only one Coherent Processing
Interval (CPI) was taken into consideration. Digital generation allows for many wave-
forms to be created, all addressing multiple missions over many CPI’s. Direct analytical
approaches cannot handle this expansion of possibilities, so other methods, such as evolu-
tionary computing have been adopted.
1
1.2 Thesis Description
Work done prior to this thesis involved developing a multi-mission radar waveform suite
concept [1] and selecting an evolutionary algorithm for the purpose of solving the Multi-
Objective Optimization Problem (MOP) as it applies to the radar scenario at hand [2, 3,
9]. A simulation applying the Strength Pareto Evolutionary Algorithm 2 (SPEA2) to the
concept of multi-mission radar waveform optimization was coded and tested in Matlab in
[2, 9].
This simulation was used as the basis for work in this thesis. The Matlab code was
used to develop a C code implementation of the scenario, and was tested on a PC using Mi-
crosoft Visual Studio 2005. The results of the C code implementation were analyzed and
verified for correctness, using the Matlab simulation results. The C code, PC implementa-
tion was then analyzed for run-time performance. The code was profiled and an extensive
investigation of the bottleneck and its computational complexities were performed.
At this point, a base hardware/software system for the SPEA2 radar scenario was cre-
ated and customized on the Virtex-4 FX (V4FX). The C code was then modified to run
on one of the embedded PowerPC 405 CPU cores in the V4FX. Upon verification of cor-
rectness of the modified C code on the PowerPC 405 core, software optimizations were
performed, using performance analysis data gathered previously. The code-base was then
tested again to ensure that the code was still behaving correctly, after employing software
optimizations.
A Fabric Co-Processor Module (FCM) designed to handle the biggest bottleneck of
the algorithm, a 2-Dimensional Inverse Fast Fourier Transform (2D IFFT), was then cre-
ated and tested using Xilinx ISE tools. This FCM was interfaced to a PPC405 core using
the Auxiliary Processor Unit (APU). The code-base was modified to integrate the FCM
and then tested again to verify correctness of the results. At this point, the code was pro-
filed again, and further software optimizations were performed. After adding more soft-
ware optimizations and fine-tuning the FCM, the final system was profiled and tested for
correctness of results. The performance results were then analyzed and used to develop
2
suggestions for future work, which are provided in this document.
1.3 Motivation for Hardware/Software Optimization on
Hybrid FPGA
The intent is to someday achieve real-time performance of this algorithm, on a mobile
platform which could reside on a plane or vehicle. The first step in achieving this is to
identify and overcome the bottlenecks. Before starting work on this thesis it was identified
that a repetitious 2-Dimensional Inverse Fast Fourier Transform (IFFT) was consuming
much of the execution time in a previous scenario.
With the concept that IFFT operations are good candidates for hardware speedup,
a hardware/software optimization was chosen as the implementation goal, on a Hybrid
FPGA. Specifically, a Virtex-4 FX (V4FX) was chosen as the implementation platform. A
hybrid FPGA contains at least one embedded processor, as well as reconfigurable hardware
fabric. The V4FX has two embedded PowerPC cores, in addition to the hardware fabric.
This platform allows offloading of the 2D IFFT into hardware for speedup, as well as man-
aging and optmizing other aspects of the algorithm in C code on the embedded PowerPC
cores. Thus, this platform offers the potential for speedup of the algorithm, in a single
platform.
1.4 Overview
This thesis will begin by giving background information in Chapter 2. This chapter will
discuss waveform conflict analysis and how it is an instance of the Multi-Objective Opti-
mization Problem (MOP). This chapter will also discuss evolutionary algorithms, Pareto
optimality, and evolutionary algorithms to solve the MOP that use Pareto optimality.
The Strength Pareto Evolutionary Algorithm 2 (SPEA2) will be discussed in 3. This
3
chapter will highlight the specific features of SPEA2 and give details of the fitness evalua-
tion and environmental selection schemes employed.
The SPEA2 applied to multi-mission radar waveform design will be discussed in Chap-
ter 4. This chapter will explain the physical Synthetic Aperture Radar (SAR) scaled down
mission, and how it applies to multi-mission radar waveform optimization and SPEA2,
including the waveform suite approach and objective functions used for determining the
fitness of waveforms. This chapter will also discuss a C implementation on a PC that was
carried out for the scenario, as well as results from this implementation.
Performance analysis and optmizations of SPEA2 applied to multi-mission radar wave-
form design will be discussed in Chapter 6. This will include profiling of the algorithm,
software optimizations, hardware optimizations, and results after performing the optimiza-
tions. Future work will be discussed in Chapter 7. Suggestions for methods of improving
performance using the same implementation platform, as well as other implementation
platform options, will be provided and discussed. Finally, a conclusion that sums up the
entire thesis will be presented in Chapter 8.
4
Chapter 2
Background
This chapter will discuss some background information needed to understand the concepts
of this thesis. Waveform conflict will be discussed Section 2.1. This will lead to the in-
troduction of a Multi-Objective Optimization (MOP) problem which will be discussed in
Section 2.2. An overview of evolutionary algorithms, a method for solving the MOP, will
be provided in Section 2.3. The concept of Pareto Optimality will first be discussed in 2.4,
before introducing some specific evolutionary algorithms for solving the MOP which use
Pareto Optimality, in Section 2.5. Of these evolutionary algorithms introduced, the Strength
Pareto Evolutionary Algorithm 2 (SPEA2) will be included, which is the algorithm focused
on in this thesis. A comparison of the algorithms will be provided in Section 2.5.5, as well
as reasons why SPEA2 was chosen to be used in this thesis.
2.1 Waveform Conflict Analysis
The progression of radar technology has lead to different descriptions of a radar waveform.
Originally the parameters of a single pulse or continuous tone defined the waveform. The
introduction of the coherent processing interval (CPI) in conjunction with ground-based
radars made it a necessity to define the parameters of a pulse burst. The development of
airborne radar systems brought about the need to process several CPIs. Waveforms became
more complex, consisting of parameters of the pulse such as amplitude, shape, bandwidth,
pulse width and compression ratio, as well as parameters of the pulse burst for a single CPI
5
such as Pulse Repetition Frequency (PRF) and duty cycle, and parameters for the overall
dwell such as PRF for range resolution and Doppler ambiguities.
Many conflicts can arise when synthetic aperture radar (SAR) imaging and air and
ground moving target indication (AMTI and GMTI) missions are addressed in a multi-
mission operation. The complexity of these conflicts adds to the difficulty associated with
finding an optimal waveform. These conflicts include [1]:
• Dwell Time, which is usually in tens of milliseconds (about 100ms) for AMTI and is
in tens of minutes (about 20 minutes) for SAR
• Pulse Repetition Frequency (PRF), which for AMTI missions is required to be a
medium prf while a low prf is needed for GMTI and SAR
• Bandwidth, which is typically required to be in the range several megahertz for AMTI
and GMTI while SAR requires anywhere from 30 MHz for stripmap to 400 MHz for
spotlight
• Beam Scan, which is required to be a scanning beam for GMTI, AMTI and spotlight
SARS missions, while stripmap SAR missions need a side-looking beam.
2.2 Multi-Objective Optimization Problem (MOP)
The Multi-Objective Optimization Problem (MOP) consists of finding a vector of decision
variables that satisfies constraints and optimizes a vector function whose elements repre-
sent the objective functions [12]. The mathematical description formed by these functions
provide performance criteria that are most often conflicting. In order to find a solution
which best fits the problem, all objective functions must be of acceptable value.
The diagram in Figure 2.1 illustrates the mapping from decision variable space to multi-
objective function space. The problem can be formulated in mathematical terms as finding
the vector
x = [x∗1, x
∗
2, ..., x
∗
n]
T (2.1)
6
F1
F3
F2
x2
x1
T
nxxxx ],...,,[ 21= Tk xfxfxfxf )](),...,(),([)( 21=
(Feasible Region)}{ nRxX ∈=
Decision Variable Space Objective Function Space
}{ nRy∈=Λ
Figure 2.1: Mapping from decision space to objective space [2]
that satisfies the inequality
gi(x) ≥ 0 for i = 1, 2, ...,m (2.2)
as well as the inequality
hi(x) = 0 for i = 1, 2, ..., p < n (2.3)
in order to optimize the vector function
~f(~x) = [f1(~x), ...fi(~x), ..., fk(~x)] . (2.4)
Any point which satisfies the conditions stated in (2.2) and (2.3) is a member of set X . Any
value of x from equation (2.1) in X is considered a feasible solution. The vector ~f(~x) in
(2.4) serves to map values of X to a set of vectors that contains all possible values of the
objective functions. In order to obtain an acceptable solution to the problem we must find
separable minima for each objective function. Assuming the existence of such minima, let
~xo(i) = [x
o(i)
1 , x
o(i)
2 , ..., x
o(i)
n ]
T (2.5)
7
be a vector which optimizes the ith objective function and is mapped to the set of feasible
solutions such that
~xo(i) ∈ X (2.6)
and
fi(~x
o(i)) = min
x∈X
fi(~x). (2.7)
The minimum value for the ith objective function is denoted as f oi and the vector
~f o = [f o1 , f
o
2 , ..., f
o
n]
T (2.8)
is ideal for this multi-criterion optimization problem. The ideal solution is a point in Rn
that determines (2.8).
2.3 Evolutionary Computation
Evolutionary computation applies the biological evolutionary process to computer-based
problem solving and optimization, such as the the MOP as discussed in Section 2.2. Ge-
netic algorithms (GAs), a specific type of evolutionary algorithm (EA), maintain a popula-
tion of solutions to a given problem. These solutions survive from generation to generation
based on a ’survival of the fittest’ scheme. They provide a means for concurrently search-
ing a vast solution space for the best solution. Closed form analysis used in traditional
optimization methods (gradients, linearity, continuity) require the function being evaluated
to have certain properties. These techniques are sufficient for single missions, but fail to
address optimization of multi mission criterion which is required in the radar domain [1].
The application of genetic algorithms to radar concepts allows for the requirements of each
mission to be used for evaluation of the waveform suite. A visual depiction of the flow of
a general EA can be seen in 2.2. The following pseudo-code below gives a description of
the flow of a basic EA [2]:
1. Initialize population
8
2. Create offspring through random variation
3. Evaluate fitness of each candidate solution
4. Apply selection
5. Check for termination, if not satisfied, loop back to 2
	

	
	 
	


Figure 2.2: Simple genetic algorithm flowchart [9]
Population members, or individuals, each represent a candidate solution to the problem.
An individual is comprised of a collection of genes, which together form a chromosome.
These chromosomes contain information which can be decoded into a set of parameters
and input into a function. Evolutionary Operators (EVOPs) are used to perform operations
on the populations in an effort to generate chromosomes with greater fitness [12]. Some
common EVOPs include mutation, recombination, and selection.
Chromosomes can be encoded in binary strings or real numbers. An example of a
binary encoded individual can be seen in Fig. 2.3. Bit-wise mutation can be performed on
binary encoded strings, where a bit is changed to its opposite value. Crossover is a form
of recombination which takes a segment from two parent chromosomes and concatenates
them together.
9
Figure 2.3: Example binary encoded individual [9]
2.4 Pareto Optimality
When dealing with MOP’s the concept of finding a solution is rather different, in contrast
to traditional optimization problems. With the introduction of multiple objective functions,
the intent shifts from finding one solution to that of finding an acceptable compromise, with
associated ”trade-offs” involved. This form of optimality was first proposed by Francis
Ysidro Edgeworth in 1881 and was later adopted and generalized by Vilfredo Pareto in
1896. It is most often referred to as Pareto optimum.
Mathematically a point ~x∗ ∈ Ω is considered Pareto optimal if for every ~x ∈ Ω and
I = {1, 2, ...k} either,
∀i∈I(fi(~x) = fi( ~x∗)) (2.9)
or there is at least one i ∈ Ω such that
fi(~x) > fi( ~x∗). (2.10)
This definition states that the solution ~x∗ is Pareto optimal if there exists no feasible vector
~x which would increase some criterion without causing a simultaneous decrease in at least
one other criterion [6]. The entire decision space is considered in this solution.
10
2.5 Previous Work
Many evolutionary algorithms have been developed to address the MOP, discussed in Sec-
tion 2.2. Some of these evolutionary algorithms will be discussed in the following sections.
These algorithms all use the concept of Pareto optimality to find the best solution to the
MOP. The main points of the algorithm will be outlined in Sections 2.5.1, 2.5.2, 2.5.3, and
2.5.4, and a performance comparison will be provided in Section 2.5.5.
2.5.1 SPEA
The Strength Pareto Evolutionary Algorithm (SPEA) was developed by Zitzler and Thiele [27]
as an evolutionary approach to the multi-objective optimization problem. It is based on a
combination of techniques, with the intention of finding multiple Pareto-optimal solutions
in parallel. With the need for such solutions to be determined in the multi-mission radar
waveform optimization problem, this algorithm is well suited for this purpose [3]. In the
SPEA, non-dominated solutions determined in each successive generation are stored in an
external or archival set of population members. Scalar fitness is assigned to individuals
based on Pareto optimality. Clustering is used in an effort to reduce the number of non-
dominated solutions, as well as to preserve the characteristics of the trade-off front.
2.5.2 NSGA-II
The NSGA-II (Non-dominated Sorting Genetic Algorithm) [8] was developed by Deb,
Agrawal, Pratap, and Meyarivan in 2000 as another approach to evolutionary multi-objective
optimization, and is an improved version of the previously proposed NSGA. NSGA-II uses
a fast non-dominated sorting procedure, an elitist-preserving approach, and a parameterless
niching operator. The algorithm separates the solutions into fronts based on Pareto dom-
inance. The first front will have the lowest domination (none) and each successive front
will be increasingly dominated. Density estimation and a crowded-comparison operator
are used to preserve diversity and define an order among the individuals. This ranking is
11
the basis for environmental and mating selections, of which the resulting offspring is com-
bined with the parent population, and subsequently is then truncated by removing half of
the pool of individuals (those being the weakest solutions).
2.5.3 PESA
The PESA (Pareto Envelope-based Selection Algorithm) [7] is another multi-objective evo-
lutionary algorithm, proposed by Corne, Knowles, and Oates in 2000. This approach uses
2 populations, a small ”internal population” and a larger ”external population” or archive.
The archive contains a set of non-dominated solutions while the internal population stores
new candidate solutions for inclusion in the archive set. Crowding is monitored through
the use of a hyper-grid division of phenotype space, and is used as a means for selection.
Those solutions not selected for the archive set are purged at the end of each generation.
2.5.4 SPEA2
The Strength Pareto Evolutionary Algorithm 2 (SPEA2) was developed by Zitzler and
Thiele [26] as an improved version of the SPEA algorithm. Some of the weaknesses of
SPEA are addressed in this new algorithm. A more fine-grained fitness approach is em-
ployed, which considers a single individual and each individual it dominates or is domi-
nated by. A density estimation technique based on a principle of nearest neighbor is used
to guide the search process. Also, an augmented archive truncation method is incorporated
to preserve the boundary solutions. This approach was applied to the multi-mission radar
waveform optimization problem by Amuso and Enslin in [2, 9].
2.5.5 Comparisons
The behavior of SPEA2 was compared to that of SPEA, NSGA-II and PESA for a number
of test functions in [26]. SPEA2 exhibited a significant improvement over its previous
version, SPEA. SPEA2 and NSGA-II both showed good performance results, but SPEA2
12
showed superior results as the dimension of the objectives of the problem increased. In the
case of solving the MOP problem, which these algorithms are intended to do, performance
refers to how good the solutions to the MOP problem are. PESA converged quickly on
the test problems but could not preserve the boundary solutions as well as NSGA-II and
SPEA2, whose diversity is maintained throughout the later generations.
In terms of bloating, SPEA2 was compared to 4 other reduced bloating multiobjec-
tive techniques [5]: Standard GP with tree depth limitation, Constant Parsimony Pressure,
Adaptive Parsimony Pressure, and a ranking method (Two Stage) where functionality is
optimized first and program size afterwards. SPEA2 was found to keep the tree size lower
than any of the other methods, as well as able to find more compact solutions faster. Due
to its reduction of bloat and favorable performance in higher dimension problems, SPEA2
is a good choice for the multi-mission radar waveform optimization problem, which is
characterized by the need to optimize many multi-mission waveform objective functions.
13
Chapter 3
The Strength Pareto Evolutionary Al-
gorithm 2 (SPEA2)
As discussed in Section 2.5.5, SPEA2 is a good choice for an evolutionary computation
approach to solving the MOP. Amuso and Enslin applied SPEA2 to a Multi-Mission Radar
Waveform Optimization which is an instance of the MOP, in [2, 9]. The following sections
will describe in more detail the inner workings of SPEA2, including an overview and flow
description in Section 3.1, as well as a description of the fitness evaluation in Section 3.2
and the environmental selection in Section 3.3.
3.1 Overview
Some of the highlights of SPEA2 were presented in Section 2.5.4. SPEA2 employs a more
fine-grained fitness approach than its predessor SPEA, thus resulting in better solutions.
The intent of SPEA2 is to find the best solution in a Pareto sense. Density calculations are
used to preserve diversity and the Pareto front. A flow chart for SPEA2 can see in Figure
3.1. The flow of the algorithm is as follows:
Step 1) Generate an initial population P0 of size N . An empty archive set P 0 of size N is
also created. This set serves as the external set of non-dominated solutions, as well as
some dominated solutions, in the event that the number of non-dominated solutions
14
is less than the specified N .
Step 2) Fitness values are assigned to each individual in the population. This fitness eval-
uation takes into account all individuals and the individuals they dominate or are
dominated by.
Step 3) Next is the environmental selection stage which consists of copying all non-dominated
individuals in Pt and P t to P t+1, where t represents the current generation. If the size
of the updated archive set P t+1 exceeds the chosen constant size N then a truncation
operator is used to decrease the size of P t+1. If the size of P t+1 is equal to N then
operation is complete.
Step 4) Check for the termination condition t ≥ T or some other chosen condition, and
set A to the set of non-dominated individuals in P t+1. Then stop execution of the
algorithm.
Step 5) Upon failure of meeting the termination condition a mating selection procedure is
carried out. The mating pool is filled by means of a binary tournament with replace-
ment on P t+1.
Step 6) Recombination and mutation operators are applied to the mating pool. The re-
sulting population becomes the population set Pt+1. The generation counter t is
incremented and the algorithm returns to Step 2.
Amuso and Enslin [9, 2] implemented this algorithm in Matlab and verified its perfor-
mance using a three knapsack one hundred item multidimensional multi-objective knapsack
problem. The algorithm was also applied to a dual objective multi-mission radar waveform
design problem. Results from this application were promising, demonstrating improvement
in fitness for both objective functions used. Computing constraints limited the population
size and maximum generations utilized in this simulation.
15
		

	
 	



	
		
	  
	


		

 
!	∀ 
#
∃%
#&	%∋
∋	%(		
#



)∗	
+( 
,%	 −
.
	




∋
+	%
	 /
0+		%

   
Figure 3.1: SPEA2 Flow [9]
16
3.2 Fitness Evaluation
Fitness values are assigned to each individual in the population. Contrary to the fitness
scheme in SPEA, fitness assignment in SPEA2 is more fine grained, taking into account
all individuals and the individuals they dominate or are dominated by. Density is also
incorporated into the fitness calculation in SPEA2. Clustering is no longer used to prune
the archive set. Instead, a truncation method is used to preserve the boundary solutions.
Each individual i in the population Pt and the archive Pt is assigned a strength value,
S(i) = | {j | j ∈ Pt + P t ∧ i Â j } | (3.1)
such that the + represents a multi-set union, and Â indicates the Pareto dominance
relation. The strength values are used to calculate the raw fitness,
R(i) =
∑
j∈Pt+P t,jÂi
S(j) (3.2)
This raw fitness is determined by the dominators in both the population and archive
sets. This is a notable difference from SPEA which only considered the archive set. A
low raw fitness is desirable, such that R(i) = 0 corresponds to the individual i being non-
dominated. A high value for raw fitness indicates that the individual is dominated by many
others.
A density estimation technique is used to distinguish between individuals of the same
raw fitness value. The method used is a derivative of the kth neighbor technique, where
the density estimation is the inverse of the distance from each population member to each
archive member. Each distance calculated is stored in a list and sorted in increasing order.
The kth element in the list gives the distance sought, σki . We refer to k as the square root
of the total number of individuals such that, k =
√
N +N . The density D(i) is then
calculated for each individual i using the equation
D(i) =
1
σki + 2
. (3.3)
17
Finally, the fitness F (i) is calculated by adding the density estimation to the raw fitness
such that
F (i) = R(i) +D(i) . (3.4)
In terms of run-time computation, density estimation dominates the fitness assignment rou-
tine with a complexity of O(M2 logM), while calculations of S(i) and R(i) are of order
O(M2).
3.3 Environmental Selection
The environmental selection stage consists of copying all non-dominated individuals in Pt
and P t to P t+1, where t represents the current generation. If the size of the updated archive
set P t+1 exceeds the chosen constant size N then a truncation operator is used to decrease
the size of P t+1. If the size of P t+1 is equal to N then operation is complete. Otherwise,
the size of N is larger or smaller than N .
If the set is smaller than N then the best N − |P t+1| dominated individuals in the
previous archive are are copied to the new archive. If the set exceeds N then an archive
trancation method is used to remove individuals from P t+1 until |Pt+1| = N . At each
iteration that individual i is chosen for which i ≤d j for all j ∈ P t + 1 with
i ≤d j :⇔ ∀ 0 < k < |P t+1| : σki = σkj ∨
∃ 0 < k < |P t+1| : [(∀ 0 < l < k : σli = σlj) ∧ σki < σkj ] (3.5)
where σki is the distance of individual i to its k
th neighbor in P t+1. At each stage the min-
imal distance neighbor is chosen to remove, and ties are broken by consecutively remov-
ing those neighbors that have the next smallest distance. An illustration of the truncation
scheme can be seen in 3.2.
The worst case run-time complexity for this truncation method is O(M3). Most often
the complexity will be O(M2 logM). This depends on the distances of the second and
third neighbors, as well as how they are sorted in the list.
18












Figure 3.2: Illustration of SPEA2 archive truncation scheme [9]. N = 4 and numbers indi-
cate elimination order.
A description of the physical radar mission will be provided in Chapter 4, as well as an
explanation of how SPEA2 is applied to and optimizes the radar missions obectives.
19
Chapter 4
SPEA2 Applied to Multi-Mission Radar
Waveform Design
An in depth description of the algorithmic flow and evolutionary concepts of SPEA2 were
provided in Chapter 3. The following sections will describe the radar scenario developed by
Amuso and Enslin [2, 9], as well as give insight into how SPEA2 is applied to the scenario.
The waveform suite and objective functions will be discussed. A software implementation
of SPEA2 applied to the radar scenario will also be presented.
4.1 Waveform Suite
Existing multi-mode radar systems lack the ability to process multiple missions simultane-
ously with the use of a single waveform, as discussed in 1.1. The various radar parameters
crucial to MTI and SAR missions were mentioned in 2.1. A single waveform is defined, re-
ferred to as a waveform suite, to combine different radar parameters to accomplish multiple
missions simultaneously [9].
The various radar parameters which could be added to a waveform are:
• Number of Pulses per CPI
• Pulse Repetition Frequency (PRF)
• Center Frequency (fc)
20
• Bandwidth
• Azimuth Beam Steering Angle
• Elevation Beam Steering Angle
The waveform suite can consist of all or a subset of these parameters, for each CPI. The
parameters are binary encoded for the purpose of the genetic algorithm used to evolve them.
Each CPI has an associated binary string which represents the parameter value for that point
in time. This binary string is referred to as a sub-waveform. These sub-waveforms (one for
each corresponding CPI) are concatenated to make an entire waeform suite. An example
binary encoded waveform is shown in 4.1. The waveform suite depicted in this example is
comprised of five bit binary sub-waveforms. The bits in each sub-waveform can represent
one or more of the encoded radar paramters, concatenated together.
Figure 4.1: Binary Encoded Waveform Suite [9]
4.2 Objective Functions
Four objective functions were developed and combined to form a multi-objective function
to optimize the multi-mission simultaneous orthogonal waveform suite [3]. These objec-
tive functions include revisit time and pulse integration for MTI missions, as well as peak
sidelobe level (PSL) and integrated sidelobe level (ISL) for SAR missions.
21
4.2.1 Revisit Time MTI Objective Function
One fitness measure, and its associated objective function for MTI missions is revisit time
for a two-dimensional range cell. Mission requirements determine a maximum revisit time
of Y in seconds. A revisit time X is chosen as a compromise between MTI and strip-map or
spotlight SAR missions. Revisit times less than X will benefit MTI missions which require
shorter times, while revisit times greater than X can be detrimental to target tracking and
discrimination requirements. In addition, revisit times less than the maximum Y seconds
will reduce the amount of time distributed to stripmap and SAR missions. The average
revisit time for the area of interest is used to formulate the Revisit Time MTI Objective
Function. The function is displayed in Figure 4.2. In this figure, revisit time X is 10
seconds and revisit time Y is 30 seconds.
0 0.5 1 1.5 2 2.5 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Revisit Time (s)
Fi
tn
es
s
Figure 4.2: MTI Revisit Time objective function. [9]
4.2.2 Pulse Integration MTI Objective Function
Another MTI objective function involves pulse integration. The coherent pulses of a two-
dimensional range cell over a surveillance region are totaled and integrated over a coherent
22
time interval. This value is useful in determining the probability of detection. These val-
ues are averaged and weighted by the variance of each cross range bin. These weighted
averages are then applied to the objective function in Figure 4.3.
1 8 16 32
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Pulses
Fi
tn
es
s
Figure 4.3: MTI Pulse Integration objective function [9]
4.2.3 Peak Sidelobe Level SAR Objective Function
The SAR waveform’s cross range resolution is a function of the sidelobe levels of the an-
tenna pattern. The cross range resolution fitness is quantified by applying the Peak Sidelobe
Level (PSL) to the Revisit Time MTI Objective Function in Figure 4.2. The first step in
calculating the PSL is to determine the antenna pattern of the effective aperture. This is
accomplished using the equation:
F (θ) =
N2∑
n=1
N1n−1∑
m=0
wnme
j
4pi(xn+mVRT )
λ
sin θ. (4.1)
The next step is to calculate the PSL using the equation:
PSL = max[F (θ)db > 3db beamwidth of AF ]−max[F (θ)db] (4.2)
23
The resulting PSL fitness value is found by applying the calculated PSL to the objective
function in Figure 4.4. An average of PSL fitness values across a sliding window of
effective SAR aperture lengths is taken to determine the total PSL fitness value.
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Value of PSL (dB)
Fi
tn
es
s
Figure 4.4: Peak Sidelobe Level (PSL) SAR objective function [9]
4.2.4 Integrated Sidelobe Level SAR Objective Function
The Integrated Sidelobe Level (ISL) is another fitness measure which can be calculated by
using the antenna pattern found using 4.1. The energy in the sidelobes is calculated by
removing the energy from the main lobe, and summing the resulting energy. The ISL for
an effective aperture length is then applied to the function in Figure 4.5 to determine the
ISL fitness measure. The average of a sliding window of aperture lengths results in the total
ISL fitness for a single waveform.
An image space representation of a point source target, as well as the antenna pattern
of the synthetic aperture, are required to determine the PSL and ISL. This adds some extra
steps in the process of calculating these fitness values. Likewise, these steps also add
significant computational complexity. The steps in this process are as follows:
24
0 5 10 15 20 25 30 35 40 45 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absolute Value of ISL (dB)
Fi
tn
es
s
Figure 4.5: Integrated Sidelobe Level (ISL) SAR objective function [9]
Step 1) Begin by placing a point source target in the center of a two-dimensional image.
A video phase history (VPH) is created by taking a two-dimensional Fast Fourier
Transform (FFT) of this image.
Step 2) A VPH mask is found for each waveform suite. Using the pulse position and
corresponding frequency, a 1 is inserted in the proper position in the VPH mask
to indicate that a target in that position is illuminated. A zero is placed in pulse
positions with no illumination. The VPH mask and 2D VPH found in Step 1 are
complex multiplied together, point by point.
Step 3) A 2D inverse FFT is performed on the resulting VPH mask from Step 2. The VPH
is once again part of the image spaced domain.
Step 4) The results from Step 3 are used to calculate the PSL and ISL, and their associated
objective functions, as described previously.
25
4.3 Scaled Simulation Scenario: SAR Mission
The scenario chosen as a focus for this thesis was a scaled down SAR scenario. The sim-
ulation time is 0.346 seconds and the region of interest (ROI) consists of 10 x 15 (150
total) range cells. A true simultaneous multi-mission simulation would address MTI and
SAR scenarios in the same simulation. Due to computational constraints, a SAR scenario
was chosen as the starting point for software/hardware implementation and optimization.
Future work would integrate the MTI mission, after a SAR scenario hardware/software
optimization was verified.
The SAR simulation scenario was generated to accomplish a SAR mission using peak
sidelobe level (PSL) and integrated side lobe level (ISL) as objectives [9]. PSL and ISL
provide a means to measure the quality of a SAR image. The scaled scenario is illustrated
in 4.6.


	


	

	
	
	






 

Figure 4.6: SAR Mission [9]
The scenario discussed here was first created and simulated in [9]. A beamwidth of
26
θ = 5◦ was chosen. This dictated the length of the ROI (l) by the equation
l = 2d ∗ tan
(
θ
2
)
(4.3)
where d is the distance to the ROI. A distance of 500 meters was chosen for d in an effort
to keep the ROI size to a minimum, due to the requirement for a 2D IFFT operation to be
performed on each range cell, and its associated computational complexities. This results
in a value of roughly 45 meters for l. A down-range dimension of 30 meters was chosen
for the rectangular ROI.
The dimensions of the range cells are directly related to the SAR system’s resolution. In
order to produce an integer number of computationally manageable range cells, a resolution
of 3 meters (down-range and cross-range) was chosen. Cross-range resolution is a function
of radar aperture size according to the equation
δcr =
D
2
(4.4)
where D is the size of the antenna. The down-range resolution can be approximated by the
equation [14]
δdr =
c
2B
(4.5)
where c is the wave propagation speed and B is the bandwidth of the radar signal itself.
Thus, a bandwidth of 50 MHz was chosen to satisfy a down-range resolution of 3 m.
The airplane was flown at a speed of 130 m/s for 45 m, in this particular scenario. A
run time of 0.346 s was required to achieve this speed and distance. A maximum IFFT
size of 256 x 256 was declared, due to the immense computational time needed to perform
such an operation. A PRF of 369 Hz was chosen in order to produce 128 total pulses for
the simulation, which is half the IFFT size. By keeping the number of pulses half the size
of the IFFT it provides improved IFFT resolution, and in turn, more accurate PSL and ISL
fitness values.
27
4.4 Full Bandwidth, Single Aperture Experiment
The actual experiment investigated and performed in this thesis was a full bandwidth, single
aperture experiment [9], using the scaled SAR scenario described in 4.3. The radar param-
eter being evolved in this experiment was the azimuth beam steering angle of a single radar
aperture. This angle was varied from -31 to 31 degrees, in increments of 2 degrees. These
angles were encoded using 5 bits per CPI, with only one pulse per CPI. The resulting wave-
form suite for each individual was 640 bits in length (128 CPI’s x 5 bits). The waveform
suite is visually represented in 4.7.
Figure 4.7: Waveform Suite for Specific Mission[9]
As the radar traversed, the VPH for each range cell in the ROI was filled. If a particular
cell was spotted or illuminated by the radar during a CPI, its VPH would be filled with a
whole column of ones for the corresponding pulse position. If this were a multiple band-
width experiment, only a sub-band of ones would be filled for that pulse position in the
VPH. A two-dimensional IFFT was then performed on each range cell’s VPH. The result
of this operation was used to determine the PSL and ISL values for each cell. These val-
ues were then applied to the objective functions displayed in Figures 4.4 and 4.5, which
provided a fitness value for the PSL and ISL objectives. The chart in Figure 4.8 shows the
simulation parameters used in this experiment.
28
	
 
	
 


 

 
  !!∀#
∃
%&% !∀#
∋( )∗#
 
+#% ,−.,−
/((0 ,.,

 )
1&
 )
∃
&% 2
3
 )!4
∋5/
 
∋5/%&
 )!
1
∋5/ !!
3
6
%7 	)2)2,2
58 9:;/9:
9
 !)

 !
∀

 !!
93,
 !,!
93,
 !!!
<
 !!
Figure 4.8: Simulation Parameters for Full Bandwidth, Single Aperture Experiment [9]
Note: Simulation was only run with population size 50 and archive size 10
29
4.5 Software Implementation
The full bandwidth, single aperture SAR simulation scenario discussed in 4.4 was applied
to SPEA2 and coded in C language using Microsoft Visual Studio 2005. This scenario
was first simulated in Matlab by Amuso and Enslin [2, 9]. The flow of the algorithm was
discussed in 3.1. A single member or individual is represented programmatically using a C
struct as is shown in Listing 4.1.
Listing 4.1: Member struct
1 //Structure to hold a member
2 struct member
3 {
4 char chromosome[waveformLen];
5 float fitness[numFitVals];
6 int rank;
7 double fitnessRank;
8 };
The chromosome field is the waveform for the current member. In the case of this
experiment, 128 CPI’s x 5 bits = 640 waveform bits were needed. One char was used
to represent each bit in the waveform, therefore 640 chars, or 640 bytes were needed.
The second member field is fitness, which is a float representation of the fitness values
in the scenario, and for this case is 2 (PSL and ISL). The next field is rank, which is
the total number of members that dominate the current member. An integer was used to
represent rank. The last field is fitnessRank, which is a total of all of the current member’s
dominator’s ranks, in addition to a density calculation. A double was chosen to represent
this value because it could potentially be very large and fractional. The C code for the main
loop of the algorithm can be seen in Listing 4.2.
30
Listing 4.2: Main loop
1 int main( int argc, char *argv[] ){
2 int i, gen;
3 init(); //Init population, read in values
4 for(gen = 0; gen < maxGenerations; gen++){
5 ranking_SPEA2(0); //Rank for arch and pop
6 genFitnessRank(0); //Fitness rank for arch and pop
7 placeNonDomSolutions(); //Fill archive
8 ranking_SPEA2(1); //Rank new archive
9 genFitnessRank(1); //Fitness rank for new archive
10 binaryTournament(); //Choose mating pairs
11 for(i=0; i < numPairs; i++){ //numPairs = 25
12 uniformCrossover(i); //Mating function
13 binaryMutation(i); //Mutate the offpsring
14 }
15 for(i=0; i < popSize; i++){
16 evaluation(i); //Evaluate fitness
17 }
18 }
19 cleanUp();
20 return 0;
21 }
The application first initializes various variables and then reads the initial population’s
chromosome data and fitness values from a file. It then enters the main loop of the algo-
rithm. It first assigns a rank to each member in the entire general population and archive
population. This involves analyzing each member and determining how many other mem-
bers it dominates, based on fitness. The next step involves assigning a fitness rank value
to each member in the general and archive population. The fitness rank is determined for
each member which is based upon the ranking of all of the current member’s dominators
as well as a density calculation.
Next, the parent archive is filled with non-dominated members in placeNonDomSolutions().
This is referred to as environmental selection. In this experiment the archive size is 10. If
31
there are exactly 10 non-dominated members in the general population and archive, those
members are added to the archive and the function exits. If there are more than 10 non-
dominated members, all non-dominated members are added to the archive, and a truncation
method removes members until the archive is reduced to 10. This truncation method is
based on nearest neighbor and was discussed in Section 3.3.
The next step in the algorithm is to assign a rank and fitness rank to the newly cho-
sen parent archive, in ranking SPEA2(1) and genFitnessRank(1), respectively. The
arguments to these functions indicate whether to analyze both the general and archive pop-
ulations or the archive population only. A binary tournament is then executed using the
parent archive, in binaryTournament(). This function randomly generates mating pairs,
which correspond to pairs of indexes into the parent archive array. The algorithm then loops
through each mating pair and peforms a uniform crossover (uniformCrossover(i)), and
a binary mutation (binaryMutation(i)) on them. These operations create the offspring
members which are copied into the child population.
The next and most computationally complex step is the fitness evaluation of the newly
generated child population. Each member in the child population is evaluated, which in this
case corresponds to 50 evaluations. Each evaluation must determine the fitness of 150 range
cells per member, which actually corresponds to 50 x 150 = 7500 fitness evaluations per
generation. The evaluation function will be described in more detail in Section 6.1. After
performing a fitness evaluation of each member, the algorithm loops back to the beginning
of the for loop and continues, until 500 generations have been executed.
4.6 Results
The scatter in Figure 4.9 show the results from the C code software implementation of
the 50 population member and 10 archive member scenario on a PC. The initial and final
archive population after 500 generations are depicted, exhibiting a substantial increase in
the fitness of both objective functions.
32
Figure 4.9: Initial population and final archive population for software implementation on
PC
The results of the simulation performed in [9] for the same scenario are shown in 4.10.
The results from the C code software implementation in this thesis are similar to those
from the previous simulation. In both implementations the resulting fitness values for the
evolved archive members are in the upper right hand quadrant of the fitness graph, signify-
ing notable evolution of fitness values. The values are not exactly the same, but different
random number generators were used in each generation, so the resulting fitness will not
be exactly the same.
The initial population and corresponding fitness values from the Matlab simulation
[2, 9] were used to verify the fitness evaluation from this C code implementation. An
evaluation was performed on the initial population in this C implementation, and the re-
sulting fitness values were compared with those from the Matlab simulation. The fitness
values were not exactly the same, due to slightly different evaluation implementations and
floating point accuracy, but they were within a reasonable margin of difference. The C
implementation calculation of the initial population’s fitness was then used as the initial fit-
ness for the initial population. Thus, fitness values became relative to the intial population’s
fitness in the C implementation, and proper evolution of the fitness values was realized.
33
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Peak Side Lobe Level Fitness
In
te
gr
a
te
d 
Si
de
 
Lo
be
 
Le
v
e
l F
itn
e
s
s
 
 
Initial Population
Evolved Archive Population
Figure 4.10: Initial population and final archive population for MATLAB implementation
in [9]
34
Chapter 5
Xilinx Virtex-4 FX Hybrid FPGA
A hybrid-FPGA is a device that contains one or more processor cores inside a sea of re-
configurable logic resources [4]. This type of platform offers the flexibility of running
software applications on embedded processors, as well as the computing power of re-
configurable hardware fabric, tightly coupled with the embedded processors. Course grain
data-parellilism can be handled by the embedded processors, while re-configurable hard-
ware can be used to exploit more fine-grained data-parallelism.
The Xilinx Virtex-4 FX (V4FX) FPGA qualifies as a hybrid-FPGA, as it contains re-
configurable hardware fabric as well as up to two embedded PowerPC 405 cores. An
Auxiliary Processor Unit (APU) controller inside each of the PowerPC 405 cores allows
for a high speed, low latency interface between the processor and custom hardware co-
processors. The following sections will give an overview of the V4FX hybrid-FPGA in
Section 5.1 as well as provide details about the embedded PowerPC 405 core in Section
5.2, the APU in Section 5.3, and the Xilinx ML410 development board in Section 5.4,
which is the target platform for this thesis. The base SPEA2 system on the V4FX will be
discussed in Section 5.5.
35
5.1 Virtex-4 FX Hybrid-FPGA Overview
The Virtex-4 Family from Xilinx greatly enhances programmable logic design capabili-
ties, making it a powerful alternative to ASIC technology, in some applications [22]. Xilinx
offers 3 platforms derived from the Virtex-4 Family - LX, FX, and SX. The target platform
in this thesis is the FX platform. Each platform’s features may differ, but a common set of
features between all the Virtex-4 FPGA platforms are as follows:
• 500 MHz system clocking
• 500 MHz XtremeDSP slices
• 500 MHz integrated block memory
• 1 Gb/s LVDS with SelectIO technology
Those features unique to the FX platform are:
• Embedded PowerPC 405 (PPC405) core with up to 450 MHz operation
• Auxiliary Processor Unit (APU) Interface for direct connection from PPC405 to co-
processors in fabric
• RocketIO Multi-Gigabit Transceiver (MGT) capable of 622 Mb/s to 6.5 Gb/s baud
rates
Xilinx also offers various IP cores for commonly used complex functions including
DSP, bus interfaces, processors, and processor peripherals. The Xilinx CORE Generator
application allows the creation of customized IP cores for integration with the FPGA.
36
5.2 PowerPC 405 Embedded Processor
The V4FX supports up to 2 IBM PowerPC 405 (PPC405) cores. The PPC405 CPU core is
a 32-bit RISC processor. A block diagram of the PPC405 CPU core is presented in Figure
5.1. The PPC405’s features include [17]:
• Up to 450 MHz clock speed
• 32-bit architecture (fixed-point execution unit fully compliant with the PowerPC
UISA)
• Thirty-two 32-bit general purpose registers (GPRs)
• Five-stage pipeline with single-cycle execution of most instructions, including loads
and stores
• Multiply-accumulate instructions
• 16 KB, 2-way set associative instruction and data cache
• Memory Management Unit (MMU) supports multiple page sizes
• On-Chip Memory (OCM) controller serves as a dedicated interface between the
FPGA block RAMs and the PPC405 core
• Auxiliary Processor Unit (APU) controller allowing extension of the PPC405 instruc-
tion set with custom instructions executed by an FPGA Fabric Co-Processor Module
(FCM)
The PPC405 implements a 5-stage instruction pipeline consisting of fetch, decode, ex-
ecute, write-back, and load write-back stages [17]. A constant flow of instructions are sent
to the execute unit, after being fetched and decoded. If execution stalls, instructions are
queued in the fetch queue, and processed when appropriate. The fetch and decode logic
can simultaneously process up to two branches. Branch prediction is performed by the
37
fetch and decode logic if a branch cannot be resolved pre-execution, causing the processor
to fetch instructions based on the predicted branch.
The PPC405 supports up to 450 MHz clock frequency with the highest speed grade
(-12). The PPC405’s used in this thesis were of speed grade -11, allowing for up to 400
MHz maximum clock frequency.
Figure 5.1: PPC405 CPU core block diagram. [11]
5.3 Auxiliary Processor Unit Controller
The Auxiliary Processor Unit (APU) controller provides a flexible high-bandwidth inter-
face between the reconfigurable logic in the FPGA fabric and the pipeline of the integrated
IBM PowerPC 405 CPU [18]. The APU allows an extension of the PPC405 instruction set
38
with custom instructions executed by an FPGA Fabric Co-Processor Module (FCM). This
enables user-defined and configured hardware accelerators, which can serve as a means to
offload demanding computational tasks from the CPU. The instructions supported by the
APU are classified into three main categories [18]:
• User-defined instructions (UDI)
• PowerPC floating-point instructions
• APU load/store instructions
The maximum PPC405 clock frequency allowable when using the APU is 275 MHz,
when using a PPC405 speed grade of -11 [24] (as used in this thesis). A block diagram of
the APU is shown in Figure 5.2.
Figure 5.2: APU controller processing operative block diagram [18]
39
5.4 ML410 Development Board
The target platform for this thesis is the Xilinx ML410 Development Board. The ML410
features a Xilinx Virtex-4 FX FPGA, with a speed grade of -11. This speed grade allows
the embedded PPC405 s to run at a maximum clock speed of 400 MHz, and 275 MHz when
connected to the APU [24]. In addition to the more than 30,000 logic cells, over 2,400 kb
of block RAM, dual IBM PPC405 processors, and RocketIO transceivers available in the
FPGA, the ML410 provides an onboard Ethernet MAC PHY, DDR memory, multiple PCI
bus slots, and standard front panel interface ports within an ATX form factor motherboard
[16]. Some of the main features offered by the ML410 include:
• 32-bit component DDR memory and 64-bit DDR2 DIMM
• 512 MB CompactFlash (CF) card and System ACE CF controller for configuration
• Two onboard 10/100/1000 Ethernet PHYs with RJ-45 connectors
• PCI Express interface and MIC2592B PCI Express power controller
• Two UARTs with RS-232 connectors
• I/O capabilities such as PS/2, USB and GPIO
An image of the board as well as a comprehensive listing of the board’s features is
presented in Figure 5.3. The SPEA2 base system on this platform will be discussed in the
following section.
5.5 SPEA2 Base System
The base SPEA2 system was built using the Xilinx Platform Studio (XPS) and the Base
System Builder (BSB) tool in the Xilinx Embedded Development Kit (EDK). The SPEA2
base system diagram is shown in Figure 5.4. The main components in this system include:
40
Figure 5.3: ML410 Development Board [16]
41
• PowerPC 405 core at 200 MHz clock frequency
• 256 KB block ram
• 256 MB DDR2 ram
• RS232 UARTLITE peripheral
• 512 MB SysAce Compact Flash peripheral
It was determined that 256 Kbytes of block ram were needed to accommodate the in-
struction code or text portion of the application. Up to 256 Mbytes of DDR2 ram is avail-
able, so the maximum quantity was used for stack and heap data, allowing growth for future
systems which may require more storage as the current scenario is scaled up. An RS232
UARTLITE peripheral was required for debugging purposes by means of serial commu-
nication. A SysAce Compact Flash peripheral was also added to the system, to allow for
storage of the initial population and corresponding fitness values. These peripherals were
connected to the PPC405 core through the PLB bus.
The device utilization of the base SPEA2 system is shown in Figure 5.5. The number
of occupied slices in this base system is 14%, leaving a majority of hardware resources for
co-processor and custom logic usage.
42
ppc405_virtex4
PPC405 CPU
200 MHz
IPLB0DPLB0
DPLB1 IPLB1
mpmc
DDR2 SDRAM
256 MB
SPLB1 SPLB0
xps_uartlite
RS232
xps_sysace
SysAce 
Compact 
Flash
bram_block
BRAM BLOCK
256kb
xps_bram_if_cntrl
BRAM Controller
PORTASPLB
PORTA
MFCB
PLB PLB
PLB SLAVES
ppc405_0_iplbppc405_0_dplb
Figure 5.4: Base SPEA2 system diagram
43
Figure 5.5: Device utilization summary of SPEA2 base system
44
Chapter 6
Performance Analysis and Optimizations
Amuso and Enslin [2, 9] identified the computational complexity of this algorithm, stating
that one round of the algorithm with 50 members took approximately 5 minutes to com-
plete (with 500 rounds needed for completion). Thus, a detailed account of the algorithm
with be provided in the following sections. A discussion of the evaluation will be provided
in Section 6.1, before getting into execution times and profiling details in Section 6.2. Soft-
ware and hardware optimizations aimed at reducing run-time will be provided in Sections
6.3 and 6.4, respectively. An analysis of the evolutionary and performance results after
these optimizations will be given in Section 6.5.
6.1 Evaluation Function
In one round of SPEA2, the fitness evaluation function must be performed for the number
of members in the general population, which in this scenario is 50. This fitness evalua-
tion function is the most computational expensive function in the algorithm. Much of this
complexity is associated with processing the SAR image VPH. Pseudocode for the fitness
evaluation function can be seen in Listing 6.1.
45
Function Description Operations
EvaluationDecode() Decode the radar parameters -
LightUp() Build VPH based on parameters -
2D IFFT() 2D inverse FFT of current cell 65536 IFFTs
(256x256)
IFFT shift() Shift 2D IFFT result to align data cor-
rectly, swaps quadrant 1 < − > 3 and
2 < − > 4
131072 shifts
(256x256x2)
CalcPower() Loop through 2D IFFT results and
calculate power and maximum power
(power[i][j] = Re[i][j]2 + Im[i][j]2)
65536 power
calculations
(256x256)
ScalePower() Scale power using max power
(power[i][j] = power[i][j]/maxpower)
65536 scalings
(256x256)
CalcSideLobes() Calculates the PSL and ISL value for the
current cell
-
Table 6.1: Evaluation function descriptions and number of operation per range cell (150
cells total)
Listing 6.1: Evaluation function pseudo-code
1 EvaluationDecode();
2 LightUp();
3
4 //Evaluate each range cell
5 for(i=0; i < numCellRows; i++){ //numCellRows = 10
6 for(j=0; j < numCellCols; j++){ //numCellCols = 15
7 2D_IFFT();
8 IFFT_shift();
9 CalcPower();
10 ScalePower();
11 CalcSidelobes()
12 }
13 }
A brief description of each operation in the evaluation function is provided in Table 6.1.
The first step in the evaluation function is the parameter decode. The parameters encoded
in the current member’s waveform are decoded into azimuth pointing angle values (in the
scenario explored in this thesis) for each CPI. The next function called is the LightUp()
46
function. This function is responsible for ”lighting up” the VPH, meaning that based on
the azimuth angles the radar’s traversal is traced and analyzed at each CPI, and the area on
the region of interest (ROI) that is spotted by the radar is flagged.
In the case of the VPH, each range cell (10x15 cells) has its own 2D array of integer
values, with row and column dimensions equivalent to the number of CPI’s in the scenario,
which is 128 for this thesis. If a range cell is spotted by the radar for a particular CPI, the
column corresponding to that CPI in the VPH is filled with 1’s. It should be noted that a
range cell in the ROI is considered spotted only if the entire cell is inside of the radar’s
beams, no partial cell spotting is allowed.
The next phase of the evaluation function is to loop through each range cell and deter-
mine its fitness. This intitially involves performing a 256 point, two-dimensional inverse
Fast Fourier Transform (2D IFFT) on the VPH for the current cell. This 2D IFFT is accom-
plished using a series of 1D IFFT’s. A 2D IFFT is essentially the product of performing
1D IFFT’s on all of the rows and then performing 1D IFFT’s on all of the columns (or vice
versa). In this case, the 1D row IFFT’s are calculated first, then the 1D column IFFT’s. A
freely available FFT library found on the web was utilized for the IFFT operation itself.
Next, the 2D IFFT results are shifted, in order to properly place the mainbeam power in
the center of the image. This involves swapping the real and imaginary 2D IFFT resulting
values of quadrants 1 with 3, and quadrants 2 with 4. The power of each real and imaginary
value pair is then calculated by the following equation:
Pij = Re
2
ij + Im
2
ij (6.1)
where P is power, and Re and Im represent the real and imaginary values, respectively.
Next is the power scaling phase. The maximum power value is determined, then used to
scale each power value by 1
max power
.
After the power values have been normalized, the next step involves calculating the side
lobes. The ISL is calculated by adding up all the power in the sidelobes and subtracting
from it all the power in the main beam. The PSL is calculated by looping through all of the
47
peak side lobe values and determining the maximum value. The ISL and PSL values are
passed to a function that uses a corresponding objective function to map them to a fitness
value. The ISL and PSL values found for each cell are summed and averaged, to determine
the overall fitness for the current member being evaluated.
6.2 Profiling
Due to the frequency of fitness evaluation function calls and their associated execution
time, the evaluation function was targeted for performance profiling. In comparison to the
magnitude of the fitness evaluation run-time, the run-time of the rest of the SPEA2 code is
insignificant. The C code implementation on a PC was profiled in Microsoft Visual Studio
using the < time.h > C library for timing calculations. The pie chart in Figure 6.1 shows
the distribution of execution times of the various sub-functions of the evaluation function,
represented by percentages.
An obvious conclusion after observing the run-time percentages of the evaluation func-
tion is that the 2D IFFT computation is taking up a majority of the run-time, consuming
91.83% of the time. The side-lobe calculations take up 4.28% of the evaluation run-time,
while the next most timely calculation is power which takes up 3.44% of the time. The
rest of the functions take up less than 1% of the time. From this data, it can be concluded
that the 2D IFFT is the bottleneck of the evaluation function, and hence, the bottleneck of
the entire program. Software and hardware optimizations for reducing the run-time of the
bottleneck will be discussed in the following sections, as well as performance results from
these optimizations.
6.3 Software Optimizations
Many software optimizations were applied to the SPEA2 base system software. Most of
these optimizations are purely software based, while the VPH transposition (to be discussed
48
Figure 6.1: Evaluation function execution time pie chart
49
in 6.3.6) is a software optimization that coincides with a hardware optimization, to be
discussed in Section 6.4. All software optimizations will be discussed in the following
sub-sections.
6.3.1 Fixed-Point Conversion
The PowerPC405 has no dedicated floating point unit, only floating point emulation. Dou-
ble precision floating point numbers and operations are used heavily in the SPEA2 PC
software implementation. Floating point emulation is very inefficient, so fixed point arith-
metic was employed in place of floating point arithmetic. The fitness evaluation function
specifically had a great number of repeated double precision operations. Those operations
which were most repetitious were converted to 32 bit fixed point calculations.
Floating point to fixed point conversion was verified by comparing fitness evaluations
of the initial population with the initial fitness values calculated in the double precision
floating implementation on a PC. The error margin for the fitness values was +/- 0.002,
which was enough accuracy for this application. These new inital fitness values were then
used as the baseline initial fitness values for the simulations.
6.3.2 In-line Function
A simple optimization implemented was changing the power function into an inline func-
tion. This function gets called many times in a nested for loop, for each cell being processed
in the VPH, for each of the members in the general population. The overhead associated
with calling such a function repeatedly was significant in terms of affecting run-time. The
function was inherently a one line function, so converting it to an inline function was ef-
fortless and beneficial.
50
6.3.3 Data Flow
Originally the power calculation was performed before the 2D IFFT shift. This meant that
the IFFT shift had to shift both the real and imaginary values for each index in the VPH. By
moving the power calculation before the 2D IFFT shift, half of the shifts in this function
are eliminated, and hence half of the memory accesses are required.
The power scaling was initially implemented by first looping through the power values,
using independent nested for loops to first find the maximum power value, then normalizing
each power value in a separate set of nested for loops by dividing the current power value
by the maximum value. Instead of this method, the maximum value check was moved to
the power calculation loop, to cut down on redundant memory accesses. Also, rather than
dividing each power value by the maximum value, a scaling variable was used to store
the value 1
max power
, and each power value was multiplied by this value. This reduced
the number of clock cycle from 35 cycles for each divide, to 4 cycles for each 64 bit
multiplication, thus saving 31 clock cycles for every power scaling operation.
6.3.4 Compiler Optimizations
All levels of GCC compiler optimizations (-O1, -O2, -O3, -Os) were incompatible with the
program. In each case the optimization cause the application to behave in an undesirable
manner, such that the resulting output was incorrect. This is detrimental to run-time perfor-
mance because these compiler optimizations offer significant decrease in overall execution
time. Instead of using one of the levels of optimization, each of the GCC optimization op-
tion flags was added to the compilation one by one, and functionality of the application was
verified with each addition. Most of the flags were added to the compilation successfully,
but performance gain was not comparable to even a -O1 level optimization, although slight
increase in performance was achieved.
51
6.3.5 Sparse 2D IFFT
Due to the arrangement of data in the VPH, redundant 1D IFFT calculations were identified
in the 2D IFFT operation. The VPH is a 2D array, such that a column that is ”lit up” would
contain all 1’s, and 0’s otherwise. In terms of a 2D IFFT, which requires 1D IFFT’s of all
rows and then columns (or vice versa), this means that if 1D row IFFT’s are calculated first,
each 1D row IFFT will yield the same values. These redundant IFFT calculations were ex-
ploited by calculating only one 1D row IFFT, and copying its values to the remaining rows
in the VPH. Thus, the total number of 1D IFFT’s performed was reduced from NxN = N2
(where N is the row and column dimension) to N + 1.
6.3.6 VPH Transposition
As discussed in the previous section, the 1D row IFFT’s are calculated first, then the 1D
column IFFT’s. The VPH data is stored in a 2D array of integers. In this case, the row
data is contiguous in the VPH, but columns are not. This is problematic when transferring
data to and from the FPGA fabric from the PPC405 core, via the APU. Since only one
row IFFT is required, while many column IFFT’s are required, it would be more efficient
to have column data contiguous in the VPH to save overhead in the transfer, such that the
column values would not have to each be accessed individually and copied into a separate
array. In order to accomplish this, the VPH was transposed to switch the row and columns,
allowing for contiguous VPH data in the columns of the 2D array.
6.4 Hardware Optimizations
By observing the pie chart of fitness evalaution sub-function execution times in Figure 6.1
it was determined that the 2DIFFT() function was consuming a majority of the time. In
general, FFT operations are well suited for hardware speedup. Thus, an FFT co-processor
was added to the SPEA2 base system, discussed in Section 5.5. The system diagram with
the FFT co-processor interfaced to the PPC405 via the APU is displayed in Figure 6.2.
52
ppc405_virtex4
PPC405 CPU
200 MHz
IPLB0DPLB0
DPLB1 IPLB1
mpmc
DDR2 SDRAM
256 MB
SPLB1 SPLB0
xps_uartlite
RS232
xps_sysace
SysAce 
Compact 
Flash
bram_block
BRAM BLOCK
256kb
xps_bram_if_cntrl
BRAM 
Controller
PORTASPLB
PORTA
MFCB
PLB PLB
PLB SLAVES
ppc405_0_iplbppc405_0_dplb
fcm_custom
FFT CORE
100MHz
SFCB
Figure 6.2: SPEA2 system with FFT co-processor
53
The FFT co-processor was first created and tested, before integrating it with the entire
system. Xilinx CORE Generator was used to create and customize a Xilinx LogiCORE IP
Fast Fourier Transform (FFT), commonly referred to as an XFFT core. The XFFT core
provides four architecture options which offer a trade-off between core size and transform
time[21]:
1. Pipelined, Streaming I/O
2. Radix-4, Burst I/O
3. Radix-2, Burst I/O
4. and Radix-2 Lite Burst I/O
The Pipelined, Streaming I/O architecture was chosen because it is the most high perfor-
mance of those available, at the cost of consuming the most logic resources.
This Pipelined, Streaming I/O architecture consists of multiple Radix-2 butterfly pro-
cessing elements, arranged in a manner that facilitates continuous data processing. Memory
banks for input and intermediate data are present in each Radix-2 element. This architec-
ture allows for continuous streaming data (input data is streamed while unloading result),
frame by frame continous data, or frame data with time intervals inbetween. The imple-
mentation in this thesis uses frame by frame data with intervals in between, such that 1D
IFFT data is sent to the FPGA fabric via the APU, processed, then sent back to the PPC405
core via APU, at which point a frame has been processed and the next frame is loaded.
The parameters chosen for the XFFT core, as well as the corresponding resources
needed are displayed in the CORE Generator screenshot in Figure 6.4. An FFT length
of 256 was chosen, with an input data and phase factor width of 24 bits. The scaled option
was selected, allowing a scale factor of 1/N to be used for the IFFT. Truncation was chosen
for the rounding method, with the output ordering in natural order. An additional CE pin
was selected to enable resetting of the XFFT core when necessary. The core requires 70
XtremeDSP slices and 3 Block RAM’s.
54
Figure 6.3: Pipelined, Streaming IO FFT Architecture [21]
Figure 6.4: CORE Generator screenshot of XFFT parameters and resources
55
The state machine for the APU load instructions is shown in Figure 6.5. Quad-word
loads were used in this implementation, such that 4 words, 32 bits each, were transferred
with each APU load instruction. The initial implementation used 16 bits of input data for
the XFFT core. This was a convenient number of bits, because this allowed 1 FFT input
(real and imaginary value) to be packed into a single word. Thus, each quad-load would
transfer 4 FFT inputs (8 values total). Unfortunately, 16 bits was not enough resolution for
the given data set, and resulted in values which were incorrect, hindering the accuracy of
fitness values. Instead, bit depth was increased to the maximum 24 bits, providing enough
resolution to preserve fitness calcuations. Consequently, only one value could be packed
into one word, meaning only 2 FFT inputs could be transferred with each quad-load (2 real
values and 2 imaginary values).
The APU load instruction FFT state machine is shown in Figure 6.5. The FFT values
are first loaded by quad-word loads until N inputs are received (in this case N = 256). This
prompts the state machine to transition to the FFT states, signaling the XFFT core to begin
loading values. When values are loaded from the APU, they are stored temporarily in a
FIFO, then unloaded from the FIFO into the XFFT core.
Quad-stores were used in this implementation, to transfer data from the FPGA fabric
back to the PPC405 core, via the APU. After an IFFT has been completed, store instructions
are issued by the software. The state machine for the APU store instructions is shown
in Figure 6.6. This state machine is quite simple, it will engage the store states after it
checks to make sure that the data is valid and IFFT execution or some other operation is
not currently active. Data from the IFFT is temporarily loaded into an output FIFO during
the FFT states, and unloaded from that FIFO during the store states.
The resource utilization of the SPEA2 system with the XFFT co-processor core is
shown in Figure 6.7. This system used a rather modest 37% of the available slices. This
fairs well for the implementation of a dual-PPC405 system with another FFT co-processor
attached to it (as to be discussed in Section 7.2), as there should be sufficient resources to
duplicate the current system.
56
IDLE
LOADQ0
LOADQ3
LOAD_DONE
FFT_START
LOADQ2
LOADQ1
FFT_UNLOAD
FFT_EXE
FFT_LOAD
FFT_LOAD_START
FFT_IDLE2
FFT_IDLE1
FFT_DONE
APUFCMLOADDVALID = ‘1’
APUFCMLOADDVALID = ‘1’
APUFCMLOADDVALID = ‘1’
APUFCMLOADDVALID = ‘1’
APUFCMINSTRVALID = ‘1’
APUFCMDECODED = ‘1’
Instruction is LOAD type
No valid 
instruction
RESET
fft_load_count > 0
edone = ‘0’
edone = ‘1’ fft_unload_count > 0
fft_unload_count = 0
fft_load_count = 0
load_count = 0load_count > 0
Figure 6.5: APU Load Instruction and FFT State Machine
57
IDLE
STORE_WAIT
STOREQ3
STORE_DONE
STOREQ2
STOREQ1
APUFCMINSTRVALID = ‘1’
APUFCMDECODED = ‘1’
Instruction is STORE type
No valid 
instruction
RESET
STOREQ0
STORE_PRE
data_valid = ‘0’
data_valid = ‘1’
Figure 6.6: APU Store Instruction State Machine
58
Figure 6.7: Device utilization summary of SPEA2 system with FFT co-processor
59
6.5 Results
The scatter in Figure 6.8 show the results from the software/hardware optimized system
with a scenario using 50 population members and 10 archive members. The initial and
final archive population after 500 generations are depicted, exhibiting an increase in the
fitness of both objective functions, similar to that of the purely software results on a PC,
discussed in Section 4.6.
Figure 6.8: Initial population and final archive population for hardware/software imple-
mentation
The execution times for the non-optimized evaluation function on a PC, as well as for
an optimized evaluation function on a PC are shown in Table 6.2. Also shown in the table
is percentage of execution time and number of function calls per single evaluation.
The execution times for the non-optimized evaluation function on the V4FX, as well
as for an optimized evaluation function on the V4FX are shown in Table 6.3. Evaluation
function run-time percentage as well as number of times called for each function are also
shown in the table.
The percentage of run times for software only optimizations and software/hardware
optimizations on the V4FX, as well as for the initial non-optimized PC implementation,
60
Non-Optimized Evaluation
Function (Floating Point)
Optimized Evaluation
Function (Fixed Point)
Function Times Called Execution Time (sec)
EvalDecode() 1 0.000000004 (<0.1%) 0.0000000083(<0.1%)
LightUp() 1 0.000003296 (<0.1%) 0.00000255 (<0.1%)
2DIFFT() 150 0.0172 (91.83%) 0.0223 (86.49%)
PowerCalc() 150 0.000645 (3.44%) 0.00267 (10.36%)
ScalePower() 150 0.0000591 (0.32%) 0.0000144 (<0.1%)
IFFTShift() 150 0.0000166 (<0.1%) 0.0000076 (<0.1%)
CalcSideLobes() 150 0.000801 (4.28%) 0.00079 (3.06%)
Total Time: 2.8096083 Total Time: 3.8673026
Table 6.2: SPEA2 PC Implementation Performance
Optimized Evaluation
Function (No HW)
Optimized Evaluation
Function (w/ HW)
Function Times Called Execution Time (sec)
EvalDecode() 1 0.018435 (0.0002%) 0.018386 (0.0008%)
LightUp() 1 0.674594 (0.74%) 0.62872 (2.69%)
2DIFFT() 150 0.492765 (81.3%) 0.043123 (27.7%)
PowerCalc() 150 0.048805 (8.05%) 0.048895 (31.39%)
ScalePower() 150 0.020197 (3.33%) 0.020196 (12.96%)
IFFTShift() 150 0.013932 (2.3%) 0.013844 (8.89%)
CalcSideLobes() 150 0.025458 (4.2%) 0.025417 (16.32%)
Total Time: 90.865979 Total Time: 23.368356
Table 6.3: SPEA2 Hybrid FPGA Implementation Performance
are shown in the form of a bar graph in Figure 6.9. These results will be discussed in more
detail in Section 6.6.
6.6 Result Analysis
6.6.1 Comparison of Results
The results in Table 6.3 show a clear speedup of a hardware/software optimized SPEA2
system compared to a software optimized only system, running on a V4FX. The overall
61
Figure 6.9: Percentage of run times for non-optimized software on PC and software only
optimizations and software/hardware optimizations on the Virtex4FX
speedup of the hardware/software optimized system over the software only optimized sys-
tem is 3.89. The main bottleneck of the evaluation function (the 2D IFFT operation) was
reduced from 81.3% of evaluation function run time to 27.7%, resulting in a speedup of
11.43 for the 2D IFFT operation. Consequently, the power function becomes the new bot-
tleneck, now consuming 31.3% of evaluation function run time, compared to the previous
value of 8.05%. Also, run time percentages of the other software only functions were also
increased, as represented in Figure 6.9.
The hardware/software optimized system on the V4FX is 8.32 times slower than a
non-optimized floating point implementation running on a PC. This result is not ideal,
as speedup over the PC implementation was not achieved. The PC implementation was
running on a processor with a clock speed of 2.16 GHz, compared to the V4FX hardware/-
software optimized system running on a PPC405 core at 200 MHz, and an FPGA fabric
clocked at 100 MHz.
62
6.6.2 Software Bottlenecks
The abundance of software-only operations on such a low-power processor as the PPC405
has proved to be a large burden in achieving speedup over a high-performance CPU, even
when successfully offloading a massive bottleneck to hardware for speedup. This indicates
the need to use a system with a higher processor clock speed, in order to overcome the
software bottlenecks which otherwise would be very hard to eliminate. If the new software
bottlenecks could be reduced, this would allow for further speedup of the operations of-
floaded into hardware. Future work, including methods for improving the SPEA2 system
discussed in this thesis, as well as solutions utilizing alternative target platform solutions
will be discussed in Chapter 7.
63
Chapter 7
Future Work
The performance of this algorithm, as discussed in Section 6.6.1 was not ideal. Expecta-
tions of performance increase over a PC implementation were not achieved using an opti-
mized V4FX implementation. This chapter will propose methods for future work, with the
goal achieving the desired performance. These methods will be discussed in the following
sections, as well as Hybrid FPGA alternative solutions.
7.1 Increasing Clock Speed
The first step in reducing the software bottlenecks discussed in 6.6.2 would be to increase
the clock speed of the processor executing the code. The PowerPC in this experiment was
clocked at only 200MHz. On the PowerPC405 the APU will operate at a max clock speed
of 275MHz. This would not provide the type of performance increase necessary to decrease
run time execution. In order to operate at a more acceptable clock speed, the entire system
may have to be swapped with a more high power Xilinx system.
The Xilinx Virtex5 FXT (V5FXT) may offer the computing power needed to tackle
the software bottlenecks realized in this system. The V5FXT has one or more embedded
PowerPC440 (PPC440) Risc CPUs. PowerPC 440 CPU is capable of running up to 550
MHz [25]. The PPC440 CPU has an APU interface that supports hardware acceleration,
like the PPC405 in the V4FX. An FCM in the V5FXT can run at integer multiples of the
processor clock period, up to the actual processor clock speed (1:1), up to a maximum of
64
275 MHz. The APU controller uses the processor clock for the APU controller/processor
interface as well as its internal logic [23]. This essentially means that the FCM (which
in the case of this thesis is the FFT co-processor) can run at half the max speed of the
processor (275 MHz), while the APU and processor can all run at the same maximum
clock speed of 550 MHz. This could offer vast performance improvement over the V4FX
implementation in this thesis which used a PPC405 clock speed of 200 MHz and an FCM
clock speed of 100 MHz. The APU on the PPC440 is also capable of executing quad-word
(128 bits) APU load/store instructions in one FCM clock cycle, as compared to the 4 cycles
it takes in the PPC405. Use of the V5FXT could potentially provide over a 2x increase in
overall performance over the V4FX implementation.
Assuming a clock speed increase, it’s be expected that the 2DIFFT() function would
once again overpower the rest of the functions in run time execution speeds, as the run-time
of the other evaluation sub-functions would have been improved. This would be similar to
the initial execution time distribution of the system as discussed previously in 6 and shown
in Fig. 6.1. In this case, the next step would be to once again speed up the 2DIFFT()
operation, which will be discussed in 7.3, as well as to employ various other potential
optimization methods which will be discussed in the following sections.
7.2 Utilization of Second PowerPC
The utilization of the second PowerPC405 could potentially provide a means for vast per-
formance improvement. The fitness evaluation function is completely separable per in-
dividual, allowing two fitness evaluations to occur in parallel, with no data dependencies
between the evaluations.
The diagram in Fig. 7.1 depicts a dual-processor system with an FFT co-processor
attached to each processor core. Each processor has its own local BRAM, as well as one
shared BRAM block which could be used to transfer data between the two. The processors
would also share the DDR2 memory, but would have their own local stack and heap address
65
space allocated to them. An example memory map for this system, omitting the shared
BRAM, is shown in Figure 7.2.
In this system, CPU 0 would serve as the master processor, and CPU 1 would be the
slave processor. The master would execute the entire algorithm, with the exception of the
partitioned evaluation function, and would transfer necessary data to the slave processor,
through the shared BRAM, such that the slave processor would execute only the fitness
evaluation function, using this data. The slave would then write back to shared BRAM the
corresponding fitness values it determined in its evaluation. There is a Processor Version
Register (PVR) on each PowerPC which allows the software to identify which processor is
currently executing. Synchronization between the two processors can be achieved simply
by using a status flag for each processor and storing it in the BRAM. These flags can be
polled when appropriate.
Sharing on-chip BRAM can provide an extremely fast way to pass kilobyte sized data
between the processors [19]. Chromosome data to be passed to the slave consists of 640
chars, or 640 bytes, which is quite small and easily manageable. After performing the
evaluation, the slave would then write back a float representation of the fitness values,
which in this case would be 2 floats, or 8 bytes. Due to low communication latency through
the use of shared BRAM, and the addition of an FFT co-processor connected to the slave
processor, a speedup of 2x over the original speedup is expected by utilizing a system with
dual-processors.
Xilinx provides an example dual-PowerPC system, which uses shared memory, an
XPS Mutex core, and an XPS Mailbox core [20]. This system is very useful as a tem-
plate for a custom dual-processor system. The example system from Xilinx also includes
example C code for using each of the features of the system.
66
ppc405_virtex4
PPC405 CPU 0
200 MHz
IPLB0DPLB0
DPLB1 IPLB1
mpmc
DDR2 SDRAM
256 MB
SPLB1 SPLB2
xps_uartlite
RS232
xps_sysace
SysAce 
Compact 
Flash
bram_block
BRAM BLOCK 0
256kb
xps_bram_if_cntrl
BRAM 
Controller 0
PORT
ASPLB
PORT
A
MFCB
PLB_0 PLB_0
PLB SLAVES
ppc405_0_iplbppc405_0_dplb
fcm_custom
FFT CORE 0
100MHz
SFCB
ppc405_virtex4
PPC405 CPU 1
200 MHz
IPLB0DPLB0
DPLB1 IPLB1
SPLB0 SPLB3
MFCB
ppc405_1_dplb
fcm_custom
FFT CORE 1
100MHz
SFCB
ppc405_1_iplb
bram_block
BRAM BLOCK 1
256kb
xps_bram_if_cntrl
BRAM 
Controller 1
PORT
ASPLB
PORT
A
PLB_1 PLB_1
bram_block
SHARED
BRAM BLOCK
8kb
xps_bram_if_cntrl
SHARED BRAM 
Controller
PORT
ASPLB
PORT
B
PLBV46_PLBV46 
BRIDGE
Figure 7.1: Dual PowerPC SPEA2 system with FFT co-processors and shared BRAM
67
Figure 7.2: Dual PowerPC memory map [19]
7.3 2D IFFT in Parallel
A two-dimensional IFFT operation must be performed on each range cell for each individ-
ual. Due to the way the VPH is organized, only 1 1-D row IFFT must be calculated, but
256 1-D column IFFT’s must be performed. These 1-D column IFFT’s are inherently sep-
arable. This can be exploited in hardware by using multiple instances of the Xilinx XFFT
core and processing multiple column IFFT’s in parallel.
As a starting point, the streaming I/O pipelined architecture could be added multiple
times, as it is the most high performance FFT architecture available from the Xilinx Core-
Gen library. Given that 37% of total slices were occupied in the SPEA2 hardware optimized
system as shown previously in Figure 6.7, it may be possible to add multiple instances of
this core in hardware. This could provide significant performance improvement over a
single core implementation.
As it stands in the current implementation, the 2D IFFT function takes 0.43123 seconds
to complete. A single IFFT in hardware, including APU loads and stores, takes 0.000050
seconds. This means that a total of 0.01285 seconds are spent processing all of the 1D
IFFTs in hardware, while 0.41838 seconds or 97% of the 2D IFFT function run-time is
spent in software. The only software operations involve copying the first row IFFT to all
of the other rows, as described in Section 6.3.5. These operations can be removed, as
68
each row of data is the same. Instead, when processing each column IFFT, a single value
could be transferred to hardware, and that same value could be loaded into the XFFT core
repeatedly, as an entire column would be all the same values anyways. This would reduce
the software overhead by roughly 128x, reducing it from 0.41838 seconds to 0.0032. It
would also reduce the amount of APU loads, as only 1 value would have to be loaded for
each column IFFT.
With this software optimization in place, it would allow for further performance in-
crease of the overall function. The pipelined, streaming I/O XFFT core used in this im-
plementation requires 4 BRAMs and 48 XtremeDSP slices. The V4FX can support a total
of 232 BRAMS and 128 XtremeDSP slices. The current system with one XFFT core uses
144 BRAMS and 48 XtremeDSP slices. This would allow enough resources for one more
XFFT core to be added to the system, with the XtremeDSP slices being the limiting factor.
The number of XtremeDSP slices is directly related to the input data width and phase fac-
tor width which are both 24 bits in this case. It may be possible to reduce the number of
bits, and still maintain the resolution needed for correct results, although 16 bits has been
tested and is not enough. Reducing the bits would decrease the number of XtremeDSP
slices needed, and may allow for another pipelined, streaming I/O XFFT core to be added
as well.
However, the most powerful version of the V5FXT offers 384 XtremeDSP slices. This
would allow for 8 pipelined, streaming I/O XFFT cores to be added to the system. If we
factor in the utilization of a second processor as discussed in Section 7.2, it would allow
4 XFFT cores per PowerPC core. This could further reduce the single IFFT time by 4,
which would bring the total time for the 2D IFFT operation down to 0.0064125, from the
original 0.43123, if the software overhead is eliminated, as mentioned. This would result in
a 2.68x speedup over the PC implementation of the 2D IFFT operation, without factoring in
a second PowerPC with an FFT co-processor executing the evaluation function in parallel.
69
A radix based architecture could be used as an alternative. The other architecture op-
tions include Radix-4 Burst I/O, Radix-2 Burst I/O, and Radix-2-Lite Burst I/O. The di-
agram in Figure 7.3 shows a comparison of the resource usage versus throughput for all
of the architectures available. Radix-4 Burst I/O is the next obvious choice, as it offers
roughly half the resource utilization of the streaming architecture. While the throughput
for Radix-4 is nearly half that of the streaming architecture, the Radix-4, as well as all the
Radix based architectures, offer up to 12 channels of simultaneous FFT’s in one core.
Determining the best architecture option would require adding each core as many times
as possible, and measuring throughput for each. If this optimization method is to be inte-
grated in a dual-processor system, resource usage should be taken into account such that the
entire FFT co-processor core could still be fit into the fabric a second time, and interfaced
with the second processor.
Figure 7.3: XFFT Resource Usage vs. Throughput [21]
70
7.4 Migrate Power Calculations to FPGA Fabric
Migrating the power calculations to the FPGA fabric would provide a significant speedup,
if the software bottlenecks were addressed with an increased processor clock speed. This
method was not employed in the optimized hardware system because its benefit would not
be realized due to the overwhelming magnitude of the evaluation sub-function execution
times, as discussed in Section 6.6.2.
This method would involve adding additional multipliers to the FFT co-processor core.
These multipliers would be used to calculate the power of the resulting IFFT operation,
and could be done such that very little latency would occur. The only delay which would
be realized would be that of the very last power calculation, which is insignifigant in com-
parison to the potential speedup. The data being unloaded from the FFT core at each clock
cycle in the FFT UNLOAD state would be routed through the multipliers and the real and
imaginary segments of the calculation would be summed and loaded into the output fifo.
This would also cut the number of stores in half, meaning only half of the store transfers
through the APU to the processor core would be required. This would essentially reduce
the execution time of the power calculation to a negligible amount.
7.5 Migrate Entire Evaluation Function to FPGA Fabric
It has yet to be determined whether or not the entire fitness evaluation function could be
successfully migrated to the FPGA fabric for speedup. If deemed a viable method, this
could provide potential speedup and also a way to perform the fitness evaluation in parallel,
if there were enough slices for more than one fitness evaluation core to fit in the fabric.
There are sections of the fitness evaluation function which would not be ideal for hardware
speedup. Although, the benefits of having the entire function in hardware may outweigh
those operations which would see no benefit, or degraded performance.
There is also a trade-off between flexibility and performance, if this method was em-
ployed. With only pieces of the fitness evaluation function offloaded to hardware, it allows
71
the flexibility to quickly modify the surrounding C code and fundamental operations if need
be. If migrated entirely to hardware it will be much more time consuming to make such
changes.
In any case, the APU would be used for transfering data to the fabric, and user defined
instructions (UDI’s) would be defined for each type of fitness objective being determined.
This would provide a logical organization of the operations in hardware, such that fitness
objectives could be selected dynamically in the software without changing the hardware.
7.6 Hybrid FPGA Alternative Solutions
The thesis focused on a Hybrid FPGA implementation platform. Performance results did
not meet expectations using this method, and the previous sections discussed possible meth-
ods of improving performance, using a hardware/software approach on a Hybrid FPGA.
The following sections will discuss alternative implementation platforms to the Hybrid
FPGA. Cluster computing will be proposed in Section 7.6.1 and The Cell will be proposed
in Section 7.6.2.
7.6.1 Cluster Computing
This algorithm is well fit for cluster computing. The fitness evaluation could be distributed
amongst the CPU’s in a cluster. Communication costs would be low due to the small
amount of data dependency in evaluating the fitness of a single member, as mentioned
in Section 7.2. The performance would scale up with the number of processors, but may
result in diminishing returns if the communication latency overshadowed the speed benefits
in such a fine grained approach.
7.6.2 Cell Processor
The Cell processor offers a compromise between cluster computing and a Hybrid FPGA
implementation. Due to the architecture of the Cell, 2D IFFT operations can still be broken
72
down into 1D IFFTs and distributed amongst multiple processing elements. Thus, the Cell
could potentially provide run-time performance increase, in a single platform, rather than
a cluster of processors as discussed in Section 7.6.1.
The Cell processor supports multiple operating systems including Linux and consists of
a power processor element (PPE) and its L2 cache, multiple synergistic processor elements
(SPE) [10] that each has its own local memory (LS), a high-bandwidth internal element
interconnect bus (EIB), two configurable non-coherent I/O interfaces, and a memory inter-
face controller (MIC) [13]. An overview of the architecture of the Cell is shown in Figure
7.4.
The Cell was designed to be the core of the Sony PlayStation3 gaming console. A high
performance PowerPC core serves as the PPE and controls 8 SPEs. Various programming
models could be applied to such a system. For the purpose of improving the performance
of the SPEA2 software in this thesis the most obvious programming model would be data
parellelism where identical operations (such as 1D IFFTs or power calculations) are per-
formed on distinct data.
Figure 7.4: Overview of the Cell processor [15]
An evaluation of the implementation of a standard FFT algorithm on the Cell proces-
sor was done in [15]. This paper examined the approach of cooperatively executing 1D
FFT’s across the SPEs, as well as running the 1D FFTs of a 2D FFT on a single SPE, in
73
parallel. The performance results of 1D and 2D FFT’s on a Cell are shown in Figure 7.5.
The Cell FFT (naive implementation) performance is compared with that of tuned FFT
routines running on a superscalar (Opteron), VLIW (Itanium2), and a vector (X1E) archi-
tectures. Performance is also compared with a Cell+ platform which is a variant of the Cell
architecture, meant to yield better double precision performance.
Figure 7.5: Performance of 1D and 2D FFT in DP (top) and SP (bottom) [15]
Double precision FFT performance of the Cell is at least 12x faster than the Itanium2
for 1D FFT’s and nearly 30x faster for a large 2D FFT. Double precision performance more
than doubles for the Cell+ architecture over Cell for all cases. FFT performance on a Cell
improves as the number of points increases [15]. This is valuable because as the radar
simulation scenario is scaled up, and increased image resolution is required, the IFFT size
will need to be increased. Thus, the Cell exhibits characteristics which may make it a good
choice for an implementation platform for the scenario discussed in this thesis.
74
Chapter 8
Conclusions
This thesis began with MATLAB code from a previous simulation of SPEA2 applied to a
radar scenario, which was then translated to C code and tested on a PC, adapted to run on an
embedded PowerPC 405 core on a V4FX60 board, and then optimized in both hardware and
software. Much effort went into studying the computational complexities of the algorithm,
and their corresponding execution times. The main bottleneck of the system was addressed
using hardware, and new software bottlenecks then arose.
Ideally the hardware/software implementation of this algorithm on a V4FX60 would
outperform that of a PC, but this was not the case. Software bottlenecks, discovered very
late in the process of this thesis, caused a degradation of overall performance. Insufficient
processor clock speeds were identified as the culprit.
Assuming a system with a higher clock rate, many suggestions for future work to en-
hance the current implementation were discussed. These include utilizing the second Pow-
erPC 405 to distribute the evaluation function in software and hardware, executing multiple
1D IFFT’s in parallel in hardware, as well as migrating power calculations and possibly the
entire evaluation function to hardware. These methods may offer the performance needed
to achieve speedup over a PC implementation, but the current processor clock speed is too
low to realize significant benefit from attempting them.
Alternative methods were also offered, including cluster computing and a Cell imple-
mentation. It would be most beneficial to speed up this algorithm using a single system,
such as the V5FXT or a Cell, because this would allow for a more mobile and possibly low
75
power solution, as compared to a computing cluster. Ultimately the goal is to place this
system inside of an airplane and achieve real-time performance. This thesis was the first
investigation of the feasability of that goal.
76
Bibliography
[1] V. Amuso, P. Antonik, R. Schneible, and Y. Zhang. Evolutionary Computation Ap-
proach to Multi Mission Waveform Design. RADAR 2002, pages 454–458, October
2002.
[2] V. Amuso and J. Enslin. The Strength Pareto Evolutionary Algorithm 2 (SPEA2)
Applied to Simultaneous Multimission Waveform Design. Waveform Diversity and
Design Conference, pages 407–417, June 2007.
[3] V. Amuso, R. Schneible, Y. Zhang, and P. Antonik. A Strength Pareto Evolution-
ary Algorithm (SPEA) for Multi-mission Radar Waveform Optimization. Waveform
Diversity and Design Conference, November 2004.
[4] D. L. Bekker. Hardware and software optimization of fourier transform infrared spec-
trometry on hybrid-fpgas. Master’s thesis, Rochester Institute of Technology, 2007.
[5] S. Bleuler, M. Brack, L. Thiele, and E. Zitzler. Multiobjective Genetic Programming:
Reducing Bloat Using SPEA2. Congress on Evolutionary Computation 2001, 1:536–
543, May 2001.
[6] C. A. Coello, D. A. V. Veldihuizen, and G. B. Evolutionary Algorithms for Solving
Multi-Objective Problems. Kluwer Academic Publishers, 2002.
[7] D. Corne, J. D. Knowles, and M. J. Oates. The Pareto Envelope-Based Selection
Algorithm for Multi-objective Optimisation. In PPSN VI: Proceedings of the 6th
International Conference on Parallel Problem Solving from Nature, pages 839–848,
London, UK, 2000. Springer-Verlag.
[8] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjec-
tive genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation,
6(2):182–197, April 2002.
77
[9] J. W. Enslin. An Evolutionary Algorithm Approach to Simultaneous Multi-Mission
Radar Waveform Design. Master’s thesis, Rochester Institute of Technology, 2007.
[10] B. Flachs, S. Asano, and S. Dhong. A Streaming Processing Unit for a CELL Proces-
sor. In ISSCC Dig. Tech. Papers, pages 135–145, Feburary 2005.
[11] IBM, Hopewell Junction, NY. Product Overview PowerPC 405 CPU Core, September
2 2006.
[12] A. Osyczka and J.S.Gero(Ed). Design Optimization. Pearson Education, 1985. Chap-
ter 7: Multicriterion Optimization for Engineering Design.
[13] D. Pham, S. Asano, and M. Bollier. The Design and Implementation of a First-
Generation CELL Processor. In ISSCC Dig. Tech. Papers, pages 184–185, Feburary
2005.
[14] M. Soumekh. Synthetic Aperture Radar Signal Processing. John Wiley and Sons,
New York, 1999.
[15] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The Potential
of the Cell Processor for Scientific Computing. Technical report, Lawrence Berkeley
National Laboratory, November 6 2006.
[16] Xilinx, Inc., San Jose, CA. ML410 Embedded Development Platform, First Quarter
2005.
[17] Xilinx, Inc., San Jose, CA. PowerPC 405 Processor Block Reference Guide, July 20
2005.
[18] Xilinx, Inc., San Jose, CA. Accelerated System Performance with APU-Enhanced
Processing, September 28 2007.
[19] Xilinx, Inc., San Jose, CA. Designing Multiprocessor Systems in Platform Studio,
November 21 2007.
[20] Xilinx, Inc., San Jose, CA. Dual Processor Reference Design Suite, November 20
2007.
[21] Xilinx, Inc., San Jose, CA. Fast Fourier Transform v5.0, October 10 2007.
[22] Xilinx, Inc., San Jose, CA. Virtex-4 Family Overview, September 28 2007.
78
[23] Xilinx, Inc., San Jose, CA. Embedded Processor Block in Virtex-5 FPGAs, May 13
2008.
[24] Xilinx, Inc., San Jose, CA. Virtex-4 FPGA Data Sheet: DC and Switching Charac-
teristics, April 10 2008.
[25] Xilinx, Inc., San Jose, CA. Virtex-5 Family Overview, June 18 2008.
[26] E. Zitzler, M. Laumanns, and L. Thiele. SPEA2: Improving the Strength Pareto Evo-
lutionary Algorithm for Multiobjective Optimization. In Proc. EUROGEN 2001 Evo-
lutionary Methods for Design, volume Optimization and Control With Applications
to Industrial Problems, Barcelona, Spain, September 2002. CIMNE.
[27] E. Zitzler and L. Thiele. Multi-objective Evolutionary Algorithms: A Comparative
Case Study and the Strength Pareto Approach. IEEE Transactions on Evolutionary
Computation, 3(4):257–271, November 1999.
79
