University of Alabama in Huntsville

LOUIS
Theses

UAH Electronic Theses and Dissertations

2013

A hardware implementation of a traveling salesman problem
using genetic algorithm with migration
Jessica Mintz

Follow this and additional works at: https://louis.uah.edu/uah-theses

Recommended Citation
Mintz, Jessica, "A hardware implementation of a traveling salesman problem using genetic algorithm with
migration" (2013). Theses. 34.
https://louis.uah.edu/uah-theses/34

This Thesis is brought to you for free and open access by the UAH Electronic Theses and Dissertations at LOUIS. It
has been accepted for inclusion in Theses by an authorized administrator of LOUIS.

ACKNOWLEDGMENTS
Throughout this work, multiple people have offered assistance that, without such,
this thesis would not have been possible. First, I would like to thank Dr. B. Earl Wells for
his continuous guidance throughout my entire graduate school career. Additionally the
other members of my committee have been very gracious with their time and assistance.
Lastly, the faculty and staff in the Electrical and Computer Engineering Department and
Graduate School have been invaluable.
I would also like to thank my husband, family, and friends who have supported
and encouraged me throughout this degree.

v

TABLE OF CONTENTS

Page
List of Figures...................................................................................................................viii
List of Tables......................................................................................................................ix
Chapter
I.

INTRODUCTION..................................................................................................1
1.1 Optimization and Decision Problems................................................................1
1.2 The Non-Deterministic Polynomial-Time Hard Set of Problems.....................2
1.2.1 The Traveling Salesman Problem.......................................................3
1.3 The Genetic Algorithm......................................................................................4
1.3.1 Steady-State and Generational Genetic Algorithms..........................6
1.3.2 Differences in Genetic Algorithms versus Conventional Problem
Solving........................................................................................................8
1.3.3 Advantages and Disadvantages of Genetic Algorithms.....................8
1.3.4 The Parallel Genetic Algorithm.........................................................9
1.3.5 The Migration Operator....................................................................10
1.4 Project Overview.............................................................................................11

II.

RELATED WORK................................................................................................13
2.1 Introduction of a Hardware Genetic Algorithm...............................................13
2.2 Parallel Modifications to the Hardware Genetic Algorithm............................15
2.3 Other Algorithm Modifications.......................................................................18
2.4 Summary..........................................................................................................20

III.

A HARDWARE-BASED GENETIC ALGORITHM WITH MIGRATION........22
3.1 Solving the Traveling Salesman Problem with a Genetic Algorithm in
Hardware...............................................................................................................22

vi

3.1.1 Initialization Phase...........................................................................22
3.1.2 Genetic Algorithm Phase.................................................................26
3.2 User Interface.................................................................................................29
3.3 Single FPGA Implementations.......................................................................32
3.3.1 Serial Implementation......................................................................32
3.3.2 Parallel Implementation...................................................................34
3.4 Multiple FPGA Implementations with Migration...........................................36
V.

EXPERIMENTAL RESULTS.............................................................................38
5.1 Hardware Setup...............................................................................................38
5.1.2 Hardware Development Environment..............................................39
5.2 Resource Usage...............................................................................................40
5.3 Best Tour Evaluations.....................................................................................42
5.3.1 Preliminary Evaluations...................................................................42
5.3.2 Hardware Evaluations......................................................................45
5.4 Timing Estimations and Evaluations..............................................................51
5.4.1 Timing Estimations..........................................................................52
5.4.2 Timing Evaluations..........................................................................55
5.5 Project Comparisons.......................................................................................60

VI.

CONCLUSIONS..................................................................................................64
6.1 Project Review................................................................................................64
6.2 Future Research..............................................................................................66
6.2.1 Further Parallelization......................................................................66
6.2.2 Expandability...................................................................................67
6.2.3 Scalability........................................................................................68
REFERENCES.....................................................................................................69

vii

LIST OF FIGURES
Figure

Page

3.1

Systolic Sorting String...........................................................................................24

3.2

2-Point Crossover Example...................................................................................26

3.3

Mutation Example..................................................................................................27

3.4

User Interface.........................................................................................................30

3.5

Single FPGA Block Diagram.................................................................................33

3.6

Inside GA Core Block Diagram.............................................................................34

3.7

Implementation with 4-Sorts Block Diagram........................................................35

3.8

Implementation with Migration Block Diagram....................................................37

4.1

Logic Element Breakdown for All Designs...........................................................41

4.2

Increasing Population Size Effect on Best Tour....................................................43

4.3

Increasing Number of Generations Effect on Best Tour.......................................44

4.4

Best Tour over Generations in Single Serial Implementation...............................46

4.5

Best Tour over Generations in Single Parallel with 4 Sorts Implementation........47

4.6

Best Tour over Generations in Single Parallel with 8 Sorts Implementation........48

4.7

Best Tour over Generations in Multiple Serial Implementation............................49

4.8

Best Tour over Generations in Multiple Parallel Implementation.........................50

4.9

Genetic Algorithm Core and Population Fitness Timing for all
Implementations.....................................................................................................56

4.10

Sorting Components Timing for all
Implementations....................................................................................................57

4.11

Total Measured Execution Time for all
Implementations....................................................................................................58

4.12 : Timing Estimation Percent Overage all Implementations......................................59
viii

4.13 Execution time Comparisons with Previous Work....................................................61
4.14 Resource Usage Comparison with Previous Work....................................................63

ix

LIST OF TABLES
Table

Page

4.1

Clock Cycles Constant in All Design Implementations........................................53

4.2

Clock Cycles for Varying Sorting Components in All Design Implementations per
Population Member................................................................................................53

x

CHAPTER I

INTRODUCTION

1.1 Optimization and Decision Problems
In an optimization problem, it is necessary to find either the largest or smallest
value a function may take. The difficulty of these problems stems from the fact that there
is only one best solution. The time and methods required to find this best solution vary
from problem to problem and can range from trivial to highly complex. An optimization
problem can be divided into two categories, either continuous or combinatorial, based on
whether the variables are continuous or discrete. Discrete variables fall into the
combinatorial category where the solution, "takes the form of an integer, permutation, or
graph from a finite set [1]". An example of an optimization problem is modeling heat
distribution over a 2-D or 3-D surface [2].
In general, these problems may be constructed as design problems although not
readily apparent. In a decision problem, an algorithm must be able to answer "yes" or
"no" in a given case. For example, "is this a lower number?" The path followed is
different based on a "yes" or "no" answer. Being faced with two options and selecting
one constitutes the "decision" in a "decision problem" [3].

1.2 The Non-Deterministic Polynomial-Time (NP) Hard Set of Problems
Even with the computing power available today, solutions to some optimization
and decision problems would take upwards of years to find through exhaustively
searching and comparing every possible solution. For example, one such problem, the
Traveling Salesman Problem, would take upwards of 252,333,390,232,297 years to find
the best solution for a 30-city tour if each possible solution was evaluated with today's
computing power [4]. The time required to find a solution increases exponentially as the
problem's size increases reaching sometimes the millionth or trillionth year mark.
Verification of a solution may be done quickly; however, an effective method for locating
the solution in the first place is unavailable. These type optimization problems are
classified as NP-Hard. In order to understand what it means to be NP-Hard, an
understanding of polynomial-time (P) and non-deterministic polynomial-time (NP)
optimization problems must be verified. To be classified as a P-type optimization
problem, the solution must be found in a known polynomial-time algorithm on a
deterministic machine. NP-type optimization problems are categorized as problems
whose solution can be found in a known polynomial-time on a nondeterministic machine,
or a machine in which there is unlimited parallelism. A P-type optimization problem is
trivial in the NP-type. So, to be classified as NP-Hard, the optimization problem exists in
the NP-type realm, but it's unknown whether it exists in the P-type realm. In other words,
the solution may be found in a known polynomial-time algorithm on a nondeterministic
machine, but there is no known solution on a deterministic machine [5]. An example of a
NP-Hard optimization problem is the Traveling Salesman Problem in addition to the heat
distribution model.
2

1.2.1 The Traveling Salesman Problem
The Traveling Salesman Problem was initially constructed in 1930 and is one of
the more studied optimization problems [6]. It is also used as a benchmark for various
optimization methods. The problem has many practical applications in the fields of
modeling distribution systems, job scheduling, circuit board drilling, and DNA mapping
while its purest form lies in planning and logistics. In other applications, for example
DNA mapping, the concept of a city may represent DNA fragments and the concept of
distance is the measurement between DNA fragments. Additionally, other constraints
such a resources and/or time can increase the complexity of the given problem [7].
In this classical problem, a salesman travels from city to city, visiting each only
once, and ends back at the city he started. The goal of the problem is to find the shortest
overall route. Formally, the problem can be defined as a Graph G= (V,E) where vertices
correspond to cities and edges correspond to connections between cities [6]. Each edge is
assigned a weighted distance representing the distance between two cities. For any given
graph with N number of cities, this yields (N-1)! number of possible routes. This problem
may be constructed as a decision problem by asking, "is there a path through the graph G
starting at this vertex such that the path is at most k ?" Determining the shortest route for
a low number of cities is trivial; however, as the number of cities increases exponentially
the time required to exhaustively determine the shortest overall route greatly increases.
As calculating the distance for each possible route is impossible for larger numbers of
cities, solving the Traveling Salesman Problem, like other NP-Hard problems, must be
done through approximation algorithms that find an optimal, or "good," solution in a
reasonable amount of time [8-11].
3

1.3 The Genetic Algorithm
Adaptive heuristic search algorithms have been utilized to solve NP-Hard
problems. A heuristic search algorithm accomplishes optimization by improving a
possible solution through multiple iterations. They do not guarantee to find the overall
best solution, only to find an optimal solution in a shorter amount of time versus
exhaustively comparing all possible solutions. One such technique is the Genetic
Algorithm developed by Dr. John Holland and described in his paper, "Adaptation in
Natural and Artificial Systems" [12]. Additional techniques utilized to solve these
optimization problems are the Particle Swarm and hill climbing techniques such as
Simulated Annealing [13-16].
The overall goal of a genetic algorithm is to start with an initial group of random
solutions which are then merged and blended to produce, hopefully, better solutions that
will eventually converge to the best solution. The basic structure of a genetic algorithm is
the same for all problems. This structure may be broken down into four major stages:
creating the initial group of possible solutions, evaluating the suitability of each solution,
creating new solutions from existing solutions, and replacing old solutions with new ones
while keeping the total number of solutions fixed. These steps are repeated until a
termination point is reached and a sub-optimal solution has been found.
A search space holds all possible solutions to a given problem. Rather than
including the complete search space, which goes against the genetic algorithm goal, only
a subset of the search space is actually evaluated. This is called the population. These
solutions are randomly generated at the beginning of run-time. Traditional solutions are

4

represented as strings of 0s and 1s, but other representations are possible. The solutions
are referred to a chromosomes and each chromosome is made up of individual
components, or genes. Each chromosome is of the same size with the same number of
genes, typically. Other variations are possible, but this makes new member creation more
complex. Each chromosome must be ranked through an objective function that is specific
to the problem being solved. This ranking determines the overall suitability of a given
chromosome when compared to the other chromosomes in the population.
After the initial population is created and evaluated the algorithm transitions into
creating the next generation. First two or more chromosomes, referred to as parents, from
the population may be randomly selected or selected based on their rankings. Their genes
are then blended to create a new chromosome referred to as the child chromosomes. If
parents are selected based on their ranking, typically chromosomes with high rankings
are selected. This is analogous to the fittest members of a population surviving and
passing their traits to offspring. Ideally in this case, you would produce a child with a
ranking that is better or at most as good as the parents. In random selection, everything is
left to chance and the child's ranking may be higher or lower than either parent. This step
of creating the child chromosome is called crossover. In addition to crossover, a child
chromosome may or may not undergo mutation. During mutation, genes in the child
chromosome are swapped. Mutation is beneficial in that it introduces new gene orderings
into the population and prevents stopping at a local solution.
The last step before the end of a generation is replacement. As each new child
chromosome is created, an older chromosome may be replaced. This step is necessary
since the population remains fixed. Choosing a chromosome to replace may be random or
5

based on ranking. If based on chance, there is a possibility a weaker child chromosome
will replace a stronger older chromosome. Using a replacement method based on ranking
guarantees this will not happen as lower ranking older chromosomes are replaced by
higher ranking children. This is analogous to the weaker members of a population dying
out.
The above steps, crossover through replacement, will repeat until a termination
point is reached. The termination point may be as simple as executing for a given number
of cycles, called generations, when resources such as computation time have been
reached, or a solution meeting certain requirements has been found. Other termination
conditions may compare the best solution obtained from the new generation with the
current best solution. If additional generations do not produce better results, the algorithm
may terminate. Upon termination, the solution to the problem is found as the best ranked
member of the population. If solutions were not evaluated as part of the loop, they will be
evaluated at this time to determine the best member. The termination point is very
crucial; premature termination could result in an less-desirable solution [5-17].
1.3.1 Steady-State and Generational Genetic Algorithms
Two major strategies have emerged for reproducing population members in
genetic algorithms. One method, the generational genetic algorithm, replaces the entire
population each generation. In the other method a few population members are
reproduced at each generation. This is called the steady-state genetic algorithm. Although
the generational genetic algorithm results in no offspring being worse than the members
being replaced, it does suffer from a large generation gap. Generation gap refers to the

6

fraction of the population that is replaced during each generation. In the generational
strategy, this value is at 100% and causes a longer wait period before population
members can be used for mating and the advancement of the algorithm. Replacing fewer
population members, obviously, results in a lower generation gap and is a benefit of the
steady-state method. An additional benefit of the steady-state method is the size of
memory necessary to hold the population. The steady-state method holds all population
members in a single memory space that is referenced by both selection and removal
components of the genetic algorithm. In this method, there is no need to transfer the
entire population to another memory space; it is simply pulled from and written to as
necessary during each generation's selection and removal components and is valid for
successive generations. The generational method, because it must reproduce the entire
population each generation, must use double the memory space to hold both the current
and previous generation's population members. At the end of each generation, the current
generation memory becomes the new previous generation memory and the previous
generation memory is overwritten to hold the new current generation memory. While the
new current generation memory is being produced, all population members must also
have their fitness evaluated before they are written to memory. Therefore, the fitness
function must be evaluated for each population member for each generation. In contrast,
the steady-state method calls the fitness function for the entire population only once after
the initial population is created. Afterwards the fitness function is only needed to evaluate
new children during each generation. A last advantage of the steady-state method
compared to the generational is that the steady-state method will not require the pipeline
to be flushed at the end of each generation and may offer a better ability to pipeline

7

components [18]. Because of these advantages, the steady-stage genetic algorithm
method is the one chosen for this work. A generational genetic algorithm was studied in
[9] and will be used for comparison.
1.3.2 Differences in Genetic Algorithms versus Conventional Problem Solving
Genetic algorithms are a robust, power, and data independent problem solving
method. Compared to conventional problem solving methods there are four major
distinctions. One is that genetic algorithms do not code specific parameters but rather a
parameter set. From this set, the genetic algorithm will search for optimal solutions from
a population rather than a single point. Another distinction is the objective function
utilized to calculate how good a solution is does not use any auxiliary knowledge such as
derivation information used in calculus based method. Lastly, the genetic algorithm is not
deterministic; it uses probabilistic transition rules instead. For certain problems these
differences provide a better method to problem solving as opposed to the conventional
methods.
1.3.3 Advantages and Disadvantages of Genetic Algorithms
When compared to traditional optimization problems, the genetic algorithm brings
quite a few advantages to the field such as being able to optimize with continuous or
discrete variables, working with numerically generated data, experimental data, or
analytical functions, and doesn't require derivative information. Additionally the large
number of variables allows not only an optimal solution to be found but rather a set of
optimal variables that produce an optimal solution. A last advantage is that the genetic

8

algorithm is able to overcome being trapped into a local minimum thereby expanding
solution possibilities [17].
Even with these advantages, however, the genetic algorithm may not be the best
way to solve every problem. An example where traditional methods are better suited is
finding the solution of a well behaved convex analytical function with few variables. In
this case, calculus-based methods quickly find the minimum solution compared to the
genetic algorithm, but many realistic problems do not fall into this category.
Additionally, problems that are simpler may be better solved using faster methods [17].
1.3.4 The Parallel Genetic Algorithm
While a larger population size means more diversity and possibly better solutions
for creating future generations, it can also lead to bottleneck when each individual
solution must be examined and it's ranking calculated. Because every solution must be
sent through an objective function in order to be ranked, as population sizes increase this
becomes a time consuming process and can become the most time consuming process of
the whole algorithm when run completely in serial. The independence of each solution,
however, hints at the possibility of computing multiple solutions' fitness at the same time
or, in other words, in parallel.
Two different types of parallelism are data and task parallelism. In data
parallelism, data is distributed among computing nodes that run the same calculation
while task parallelism is when a processor executes a different thread or process on the
same set or different set of data. Because each solution will be evaluated using the same
object function, data parallelism best describes the type of parallelism being exploited in
9

the genetic algorithm [19]. By determining the rankings of multiple solutions at the same
time, the bottleneck occurred from a serial run can be reduced. The genetic algorithm is
best suited for exploiting these parallel computations; they allow for a greater population
size without greatly hurting execution time.
1.3.5 The Migration Operator
Traditional genetic algorithms work independently on their own isolated
subpopulations until they reach their final solution. Adding a migration operator to the
algorithm allows new solutions to be generated not only through the crossover and
mutation operations but also from the introduction of new members from other
simultaneously running genetic algorithms. This mutation operator must perform several
tasks such as selecting the population members to be sent, sending them, receiving them,
and then incorporating them into the receiving population. Convergence is fastest when
the best known solutions are sent and the worse known solutions are replaced. In addition
to the resources added to perform the necessary steps, communication overhead is also
introduced into the algorithm. Any improvement to the final result can determine if the
added costs of migration is worthwhile.
There are two types of migration modes. In the first, the island model, solutions
may be sent to any other genetic algorithm. This provides more freedom for the solutions'
movement, but at a higher communication costs. The other method, the stepping stone
model, only allows solutions to be sent to certain neighboring genetic algorithms. Here
the solutions experience less freedom, but there is less communication overhead [20].

10

1.4 Project Overview
In this work a genetic algorithm is applied to solve a non-trivial optimization
problem, specifically the Traveling Salesman Problem. This work follows previous
research as well as introduces new concepts for comparison. The foci of this work are the
comparisons between serial and parallel implementations including their execution times,
final results, and resource usage. An additional focus is the impact of adding a migration
state in the genetic algorithm flow. A last focus is the comparison of this project to
previous projects and their results.
The genetic algorithms, both serial and parallel, are implemented on Altera
Cyclone II FPGAs incorporated onto either the Altera DE2 or Altera DE70 development
boards. The growth in the number of configurable bits on FPGAs over the last twenty
years, the ability to quickly change circuitry, and their increasing clock speeds have
increased their possibilities for more complex problems such as genetic algorithms.
Additionally, the regular structure of a FPGA's internal combinational logic conforms
well to the regular structure of a genetic algorithm and allows for the parallelism present
in genetic algorithms to be exploited compared to a traditional multiple instruction set
parallel processor [10].
For communicating initial setup parameters and displaying final results, a user
interface on a host PC is established. Communication between the user interface and the
FPGA is established through a serial link between the host PC and the development
boards. An IP core on the FPGA side was created for sending and receiving information.

11

In addition to communication between the FPGA and host PC, communication
also occurs between multiple FPGAs through the migration operator and will occur after
a set number of generations of the genetic algorithm. During migration, best solutions
from each FPGA are communicated to others. A greedy algorithm determines if these
solutions are saved or discarded. The results obtained from this cooperative approach are
compared to the results from FPGAs computing independently. It is expected that this
cooperative approach will produce better results for the same amount of runtime.

12

CHAPTER II

RELATED WORK

2.1 Introduction of a Hardware Genetic Algorithm
As part of his Master's Thesis while at the University of Nebraska-Lincoln
Stephen Scott explored using a field programmable gate array (FPGA) to compute a
simple genetic algorithm. He was motivated by the increasing complexity of genetic
algorithms; this increased complexity had begun to overwhelm software implementations
causing unacceptable delays during the optimization process. Using hardware for solving
these problems was not feasible until the introduction of the FPGA as it is necessary to be
able to easily change parts of the design. In his simple example he created a population of
four strings composed of five bits. His goal was to optimize the objective function,
𝑓 𝑥 = 2𝑥, over the domain 0 ≤ x ≤ 31. For selection his algorithm he performed a
roulette wheel selection. In this type of selection a solution's chances of being selected
are proportional to their fitness; the better a solution's fitness the higher the chance of it
being selected. The crossover operation utilized was a one-point crossover operation
wherein a random crossover point is selected. Data before the crossover point is taken
from one parent while data beyond the crossover point is taken from the second parent.
13

This method created two children. In his mutation operation, a random bit was selected
and flipped; mutation had a 0.01% chance of occurring for any one child. After selection,
crossover, and mutation completed the new children were placed into the population.
Replacement did not occur in this example as a completely new population was created
each generation. This method resulted in increasing the maximum fitness by
approximately 25% each generation. Termination occurred after a set number of
generations had passed.
The hardware system employed by Scott consisted of a simple interface program
running on personal computer (PC) at the front end and the hardware-implemented
genetic algorithm on the back end. Hardware consisted of a BORG board with five Xilinx
FPGAs each with an 8 MHz clock oscillator. Two held user logic, another two held userspecified interconnects, and the last handled interfacing to the PC. The front end would
send a "Go" signal to the back end once initial parameters were loaded into shared
memory. Upon completing computations the hardware genetic algorithm would send a
"Done" signal back to the front end. The interface would then read final population
values from shared memory and write them to file for the user to view. His hardware
design used VHDL in order to allow the design to be specified behaviorally instead of
structurally. Modules were defined logically according to the simple genetic algorithm
described by John Holland. A main memory control unit interfaced to the outside world
and would receive the "Go" signal. Upon receiving this signal, it would notify the
additional modules so they could load parameters necessary for their operations. The
pipeline would then start fitness for each member would be calculated before going to the
selection module. Selection would perform the roulette selection and pass members to
14

crossover and then mutation. Afterwards, newly created children were evaluated by the
fitness module and written to memory. The process would then repeat until termination
when the "Done" signal was sent to the front end.
The design described in his work was a coarse grained pipelining as each module
would compute then immediately wait for more input to begin repeating. Each operation
did not have to be suspended while others ran. A proposed improvement set forth by
Scott was to have multiple sets of each module that could all feed into the shared
memory. He determined that the highest amount of parallelism would come if there were
banks of selection, crossover, mutation, and fitness modules so that multiple population
members could be computed at the same time [21].
2.2 Parallel Modifications to the Hardware Genetic Algorithm
The success of work done by Scott demonstrated that FPGAs could be applicable
for solving genetic algorithms and spring-boarded a new area of research and
development. The parallelization improvements suggested by Scott have been an area
explored in later years. One such work described in [9] incorporated different
parallelization techniques in order to optimize solving the Traveling Salesman Problem.
The hardware used in this project were Xilinx Virtex-5 FPGAs, and the programming
language was Handel-C. Initially a software version was written in C++ to find optimal
parameters for random number seeds, crossover rates, mutation rates, and population
sizes; the goal of the project was to optimize a hardware version with internal arrays and
parallelism to reduce the execution time when compared to the software version.

15

There were five major improvements based on hardware parallelism. The first
dealt with individual creation which illustrated data parallelism. In the initial version, the
complete solution had to be parsed sequentially to verify a city was not duplicated before
a new one was added to the tour. The modified version allowed this check to be done in
parallel; the new city to be added could be checked against all existing tour cities at the
same time. The second modification is unique to the Handel-C language. By introducing
par sentences, a code block could operate in parallel. This allowed for functional
parallelism as many sentences could be performed at the same time. The third
modification dealt with breaking apart the city distance and population memory. Prior to
this modification, both the population of solutions as well as the distance matrix holding
distance values between cities were stored in the same memory. Separating the two into
separate memories allowed each to be accessed at the same time. This was a second
example of data parallelism. Data parallelism was again exploited in modifications four
and five which improved the crossover and mutation operations. Instead of representing
buffers in internal RAM, these buffers were changed to be internal arrays. This allowed
for multiple accesses to different array positions rather than one access at a time in RAM.
The modifications listed above all utilized either task or data parallelism. From
the results in [9] it is determined that data parallelism obtained greater speedups in
execution times. The high number of data accesses necessary for the Traveling Salesman
Problem greatly influenced the execution time of the program. Being able to access
multiple data locations at once greatly improved overall execution time. Additionally
because mutation was used only 1% of the time, optimizing that module produced the
least improvements in execution time showing that improvements made are proportional
16

to the amount of time spent in the improved module. This is expected from Amdahl's
Law.
Another work that saw improvements in execution time when parallelism was
added is described in [10]. This work also used the Traveling Salesman Problem as a case
study and incorporated both task and data parallelism. The hardware used in this project
were seven Xilinx Virtex II FPGAs and the hardware language was a graphical hardware
description language called VIVA. This language supported a structural object-oriented
design philosophy that allowed for higher-level behavior attributes, like polymorphism or
recursion. Data flow and demand driven paradigms drove synchronization. In data flow
objects will begin their operation only after being signaled; a signal to start the next
module can only be sent if the previous module is complete.
Upon analysis of the original serial design, it was found that the objective
function used to calculate overall distance of a tour led to a major bottleneck. By
replicating this module, data parallelism was achieved as multiple solutions could have
their finesses evaluated at the same time. This idea falls in line with Scott's original
thinking of replicating modules and the methods described in the previous work. By
replicating the module, speedup factors of over 10 were achieved when compared to a
serial C++ version.
In addition to replicating a time-consuming process, the work also included
pipelining similar to Scott's original simple implementation. To achieve temporal
parallelism, each of the genetic algorithm modules of selection, crossover, mutation, and
replacement added "busy" and "wait" signals. This allowed the modules to synchronize

17

and broadcast their status for neighboring modules. Knowing the status of other modules
allowed each module to send off their information when appropriate and receive
information when they were ready. This allowed each module to continuously be
working unless waiting for another module to finish [10].
2.3 Other Algorithm Modifications
In addition to research on methods of parallelization, improvements to genetic
algorithm operations has also been a topic of discussion. Because the Traveling Salesman
Problem must maintain a unique ordering of cities and each city may only be visited
once, this can cause the crossover operation to be cumbersome. If a one-point crossover
operation is performed, the child solution will be the same as the first parent up to the
crossover point. This is just a simple copy from the parent to the child. After the
crossover point, however, the second parent cannot simply be copied over to the child as
in a more traditional genetic algorithm. If that occurred, it could be possible to have
duplicate cities in the child's tour. To avoid duplicate cities, the child must parsed
sequentially each time a city is about to be added from the second parent; only cities not
yet present in the child may be used. For a very long tour this sequential parsing can
become a bottleneck.
One method proposed to help alleviate this problem is the partially-mapped
crossover operation described in [22]. In this method, memory maps are used to keep
track of which cities have been used in the child already. Initially, the memory maps all
have values of "0" indicating that the city has not yet been used in the child. Two
crossover points may then be randomly chosen. The portions of one parent before the

18

first crossover point and after the second crossover point are copied to the child. This is
the same as the previous method; it is just a simply copy. All cities that are written to the
child update their memory maps to reflect "1." The child's values between the crossover
points come from the second parent but instead of parsing the child sequentially
whenever an addition is attempted, the second parent can simply reference the memory
map. If the city has a value of "1," it is not added. If the city has a value of "0" it is added
and the value updated to reflect the city's use. This method eliminates the sequential
parsing necessary before any new additions by the second parent [22].
Another such method for alleviating the crossover problem is described as using a
rank ordering and is the method utilized in [10]. In this method instead of ordering
solutions as a tour of cities, solutions represent the rank each city has in the tour. For
example, if City #4 was visited first, a value of "0" would be stored in the fourth slot of
the chromosome. Although this method may not be the most natural way to think of
ordering the solutions, it greatly improves the crossover operation. In this method, a
random crossover point is selected and the first parent is examined. If the parents' rank of
a city is less than or equal to the crossover point, it is copied to the child as is. If the rank
is greater than the crossover point, the rank of the second parent is copied to the child
with an offset added. The resulting child from this method, as in the previous methods,
has the same city-visit order as the first parent up to the crossover point and the same
city-visit order as the second parent after the crossover point. The major benefit of this
method is that all comparisons of ranks to the crossover point can be done at the same
time drastically cutting down the time possible for the crossover operation. A

19

disadvantage of this method though is that the rank ordering must be somehow converted
to a city-visit ordering [10].
2.4 Summary
In this chapter a myriad of works relating to genetic algorithms in hardware were
described. The first work by Stephen Scott in 1995 provided the stepping stone into this
research area. Although his example was simple with a small population, it proved that
FPGAs can handle the increasing complexity of genetic algorithms as well as or better
than software implementations. His pipelining efforts and notes about replicating
modules to allow for faster execution times provided insight for future research.
Both data and temporal parallelism have now been invoked in projects on genetic
algorithms and the influences studied. Specifically with the Traveling Salesman Problem,
a large population means many solutions must have their fitness calculated. This
calculation requires many memory accesses and when only one solution's fitness is
calculated at one time this can cause a bottleneck. Because each solution is independent,
their fitness can be calculated in parallel and reduce this bottleneck. This data parallelism
technique was examined in both [9] and [10]. Pipelining a genetic algorithm solving the
Traveling Salesman, like Scott did in his simple example, may also reduce the execution
time. This technique was also explored in [10].
In addition to parallel techniques, improvements to genetic algorithm operations
provided some reduced execution times. The crossover operation in particular can be a
time-consuming process. One method to reduce this time is to use a partially-mapped
crossover. In this method when cities from the first parent are coped to the child a
20

memory map records which cities are now a part of the child. When cities from the
second parent are copied instead of traversing the whole child to check whether the city
has been added, only the memory map must be referenced. Another method is instead of
ordering the solutions in city-visit order, rank-ordering may be used. This method
compares the rank of each city to the crossover point and makes a decision based on
whether the rank is less than or greater than the crossover point. This is a very fast
method as all comparisons can be done at the same time.

21

CHAPTER III

A Hardware-Based Genetic Algorithm with Migration

3.1 Solving the Traveling Salesman Problem with a Genetic Algorithm in Hardware
In this section the specifics to the genetic algorithm implemented in this project
are discussed. It provides a general overview of the flow of the genetic algorithm
implemented in all implementations discussed in this work. The remaining sections of
this chapter provide more details to other areas of the design, such as the user interface,
as well as the differences between implementations with and without the migration
component. Hardware setup and experimental results are saved for the following chapter.
The algorithm can be divided into two major phases as outlined below.
3.1.1 Initialization Phase
The initialization phase includes the components necessary to receive initial
parameters from the user interface, create the initial population members, calculate their
fitness, and store the population members and their fitness in corresponding memories. It
is repeated for every member of the population and is completed once all population
members' fitnesses are stored in memory. This phase occurs once at the beginning of

22

runtime, and does not include any genetic algorithm operations such as selection,
crossover, or mutation. These operations occur during the genetic algorithm phase.
The first step of the initialization phase is to receive user input parameters. A user
interface running on a personal computer collects this information and sends it via a serial
UART connection to the FPGA. A detailed look at this interface is provided later in the
chapter. These parameters include the population size, probability of mutation, random
number seed, and number of generations. These different parameters are utilized by
various components throughout runtime, and are standard parameters used in many
genetic algorithms.
Upon receiving and storing initial parameters, the initial population is ready to be
created. In this work, a 64-city tour is always implemented in each program variation.
Each city is represented by 8-bits resulting in each chromosome having a total of 512bits. Initially a 48-bit linear-feedback shift register generates 64 random numbers ranging
from hex values "00" to "FF" which are stored in the chromosome. The random number
seed input can take any value between decimal values "0000" to "9999;" these 14 bits are
configured into the lower 14 bits of the initial 48-bit seed stored in the hardware. This is
referred to as a weighted chromosome because it does not reflect the final city-visit order
of the tour. Duplicate values from the random number generator are accepted at this
stage. The weighted chromosome is then passed to a sorting component which generates
the city-visit order.
To generate the city-visit order from the weighted chromosome, a systolic sort is
performed on the weighted chromosome. This component contains 64 individual sorters

23

corresponding to the 64 cities in the tour. Figure 3.1 shows the block diagram of one of
the individual sorters.

Figure 3.1: Systolic Sorting String

Inputs to the sorters included a randomly generated number from the weighted
chromosome, DATAIN, and the index value, INDEXIN, of where the random number is
stored in the chromosome. For example, if the fifth number in the chromosome is F2,
then the inputs would be the random number F2 and the index value 5. Upon receiving
inputs, each sorter compares the random number to its internally stored comparison value
which is initialized to hex value "FF". If the random number is greater than the stored
comparison value, the random number and index are passed to the next sorter on the
INDEXOUT and DATAOUT databuses. If the random number is less than or equal to the
stored comparison value, the stored index value and stored comparison value are passed
to the next sorter while the incoming random number is stored as the comparison value
and the incoming index value is stored as the sorter's index value. The index values
24

correspond to cities because each index value is unique. This method results in cities with
higher weights in the weighted chromosome occurring later in the tour. Once all the
individual sorters are complete, the city-visit order is extracted from the stored index
value in the sorters, and the weighted chromosome is overwritten to now reflect the cityvisit order. The chromosome remained 512 bits, but now each city only needed 6 bits to
be represented and zeros are added at the beginning for padding.
After the chromosome holds the city-visit order, the overall distance of the tour
can be calculated by the fitness function. To accomplish this, the fitness function
references a distance matrix stored in memory that holds all distances between
corresponding cities. One city represents the column location while the other represents
the row location; their intersection on the distance matrix holds the distance between the
two cities. Due to the fact that the fitness function has to traverse each location in the
chromosome sequentially, this is a critical path in the current design. The overall fitness
of each chromosome is simply the sum of all distances.
Once each chromosome has its fitness evaluated, it is stored in the population
memory and the corresponding fitness stored in fitness memory. Although these were
two separate memories, the fitness of a particular chromosome is stored at the same
memory address in fitness memory as the chromosome in population memory. Having
separate memories allowed reading and writing to both at the same time. The complete
initialization process then starts over at the linear feedback shift register until all
population members had been created, evaluated, and stored [10][11].

25

3.2.2 Genetic Algorithm Phase
The genetic algorithm phase repeats each generation and each iteration is
independent. A master controller signals individual components when to proceed and the
components return signals once they are completed. The first component to perform is
selection. In this project, random selection is employed as preliminary testing found this
to be best when dealing with larger population sizes. Two random memory addresses are
generated and the chromosomes at these locations are selected as parents. Once two
parents are selected, crossover is signaled.
The crossover operation utilized is the partially-mapped crossover operation as
described in [5]. Two randomly generated crossover points are chosen and the child
obtains a copy of the parent up to the first point and following the second point. An
example depicting an 8-city tour using this method is shown in Figure 3.2 below.

Figure 3.2: 2-point Crossover Example

26

When a city is added to the child, a value of "1" is written to the memory map signaling
the city has been placed in the child. The second parent is then sequentially traversed. If a
city is reached that is not yet in the child, it is copied to a spot between the two crossover
points and its mapping updated. If the city is already used, it is passed without copying.
This had to be done in order to preserve city uniqueness as required by the Traveling
Salesman Problem.
The mutation component follows crossover; first a random number is generated
and compared to the probability of mutation input parameter. If it is less than or equal to
the parameter, mutation occurs. Values greater than the parameter result in mutation not
occurring for that generation. When mutation occurs, two random numbers are generated
to reflect a removal and insertion point. At the removal point, a city is taken from the
child and placed at the insertion point. Figure 3.3 shows mutation with a removal point of
"1" and insertion point of "4." This is a simple and quick mutation operation [10][11].

Figure 3.3: Mutation Example

27

After visiting the mutation component, the child's fitness is evaluated before
moving to removal. The same fitness function used during the initialization stage is
repeated; however, it only operates on the one child. For this work, tournament removal
was found to work best with random selection for large population sizes. In this
tournament removal, three chromosomes' fitness are randomly selected from the fitness
memory. Comparing all three, the worse fitness moves forward. Two more fitness values
are selected from memory and the comparison repeats. Finally after a third round of
comparisons, the worse fitness of the whole tournament is found. The child and it's
fitness are then written to the memory locations in the population and fitness memory
housing the worse-fit chromosome from the tournament. It should be noted that even if
the child has a worse fitness than the one it replaces, the child is still placed in memory
following the guidance of a steady-state genetic algorithm [10].
To determine if the complete program is done, a running generation counter is
compared to the number of generations input parameter. Obviously if the generation
counter is less than the input parameter the genetic algorithm phase repeats. If the counter
equals the input parameter the program begins termination. After the genetic algorithm
phase terminates, the last remaining steps include writing the best fitness, best city tour,
and total number of clock cycles to the user interface back through the serial UART. This
allows the user to review the final results of the program both as a best tour and overall
distance measure as well as a timing measurement.

28

3.2 User Interface
The user interface runs on a personal computer and allows the user to enter initial
runtime parameters and also to view the final results once computation is complete. The
only requirement of the personal computer is that is must be capable of running Microsoft
Visual Studio C# edition as this is the compiler for the user interface. The user interface
communicates with the FPGAs on the development boards over a USB to RS-232 serial
UART cable. The interface is written in C# and contains the protocol necessary for
encoding and decoding serial communications. The interface itself is show in Figure 3.4
below.

29

Figure 3.4: User Interface

As seen in Figure 4 on the preceding page, the input parameters are the random number
seed, population size, probability of mutation, and number of generations. The number of
generations and random number seed have the largest ranges for input values. Any value
between decimal values "0000" to "9999" is valid for these parameters. Due to memory
size constraints, population size is restricted between decimal value "001" to "512." A
value greater than 512 yields an out of hardware memory error. Lastly, probability of

30

mutation may be entered as a percentage meaning that "100" would equal 100% mutation
whereas a value of "10" represents a mutation value of 10%.
Once all input parameters are set, the "Start" button is clicked to transmit
information to the FPGA. In the "Device Communications" block messages such as
"Sending Data" and "Receiving Data" alert the user to the communication status. Once all
output data is received, the "Stop" button may be clicked to disconnect the serial
communication. New input parameters may be entered and the process restarted or the
window closed to end.
Output parameters include Tour Distance, Shortest Tour, and Clk Tcks. Tour
Distance displays the shortest fitness value found and Shortest Tour displays the
chromosome with that fitness. This corresponds to the best tour and it's total distance. Clk
Tcks outputs the total number of clock cycles during runtime. Coupled with the clock
rate, the total runtime of the program can be calculated for timing evaluations.
The user interface also provides communication to and from multiple FPGAs at
the same time. Currently, the setup is for two FPGAs or one FPGA. This is valuable if
different input parameters are wished to be tested and evaluated; one FPGA could run
one set and the other could run another. The final results could then be compared.
Additionally the FPGAs could be running different algorithm implementations such as a
serial and parallel version. They could both be loaded at the same time with the same or
different parameters and their final outputs compared side by side. When introducing the
migration operator it is also necessary to load multiple FPGAs at the same time for them
to work cooperatively. In this case, because the best fitness and tours are shared between

31

the FPGAs, the final results should be the same. By collecting information from both, the
user interface verifies the migration is being performed. The user interface is simply a
tool for relaying information to and from the FPGAs; it is independent of the hardware
setup or the algorithm running on the FPGAs.
3.3 Single FPGA Implementations
When the migration component is not added to a design, one FPGA will compute
its own best tour and fitness without interacting with others. This is the traditional
approach taken solving a genetic algorithm-one machine will work on its own
computation until completion. In this section both a serial and parallel version of this
single FPGA implementation are discussed.
3.3.1 Serial Implementation
The serial implementation computes all information sequentially as opposed to
the same computation being performed on multiple sets of data at the same time or using
pipelining techniques. If fast computation is not a priority or space limits design sizes,
this may be the best approach.
As shown in Figure 3.5, the main pieces of the design are the send and receive
communication blocks, the genetic algorithm core, the sort component, memories, and a
component to measure clock cycles. There are very few inputs due to the fact that the
genetic algorithm parameters are fed through the serial UART connection; the only other
two inputs are the internal FPGA clock and a "Go" signal that signifies the initial
parameters have been sent and the program is ready to compute. The only output for this
single FPGA implementation is the serial out.
32

Figure 3.5: Single FPGA Block Diagram

Inside the genetic algorithm core lies the linear-feedback shift register, a master
controller, selection, mutation, and crossover components. The master controller signals
which phase the program is running in and controls dataflow between components,
reading and writing to memories, and interfacing with the send and receive components.
During the initialization phase, the controller alerts the linear-feedback shift register
when to generate weighted chromosomes and sends them to the sorter to produce the
city-visit order. It then calculates the fitness of the chromosome and stores it into
memory. During the genetic algorithm phase, it signals when selection, crossover,
mutation, and replacement occur and handles checking whether the program is ready to
terminate. Figure 3.6 below shows the inside block diagram for the genetic algorithm
core.

33

Figure 3.6: Inside GA Core Block Diagram

3.3.2 Parallel Implementation
In the serial implementation, only one sort component is present as seen in the
bottom portion of Figure 5. Because sorting the weighted chromosome must be done
sequentially down the chromosome, it can be a very time consuming process equal to
approximately double the length of the city tour. For this work, it was decided to replicate
the sort component in order to try to achieve speedups in overall execution time for the
program. Additionally, because each weighted chromosome is independent and sorting
one does not depend on the outcome of another this also hinted at parallelizing this part
of the program. Sort components were added in sets of four; an example of a 4-sort
implementation of the genetic algorithm core and sorter is shown in Figure 3.7 below.

34

Figure 3.7: Implementation with 4-Sorts Block Diagram

Sorting more chromosomes at one time also allowed more chromosomes to be
available for calculating fitness afterwards. In the 4-sort example of Figure 3.7, four
rather than one chromosome are ready to have their fitness calculated once the sort
component is finished. However, this is not a parallel calculation as only one read could
occur from the distance matrix at a given time. When calculating the fitness although four
may be evaluated at one time, the fitness function is still performing sequentially on one
chromosome at a time. Writing the chromosomes and their fitness to the respective

35

memories is also still done sequentially. The remaining areas of the parallel design mirror
those of the serial design.
3.4 Multiple FPGA Implementations with Migration
The single FPGA implementation discussed above is completely replicated for the
multiple FPGA implementation. The only differences in the two are the additions of the
migration component and communication interfaces for FPGA to FPGA communication.
Minus these additions, the two implementations are identical for both serial and parallel
implementations over multiple FPGAs.
The migration component is added to the genetic algorithm phase of the program.
It is controlled by the controller just like the other genetic operations and resides inside
the genetic algorithm core along with the selection, crossover, mutation, and removal
components. Instead of performing each generation, migration only occurs after a
specified number of generations. This is because it halts the program in order to receive
and evaluate any incoming chromosome and fitness before moving to the next
generation. Although communication between boards may be slow, it is a rare
occurrence. When the program is halted, the component calculating clock cycles is also
halted in order for it to only count computation clock cycles and not wait time. If the
received chromosome is better than the current best it will be added to memory through
tournament removal and have a chance of selection at the beginning of the next
generation. If the received chromosome is not better than the current best it is discarded.
Chromosomes are communicated between boards using the same protocol as the
serial UART, but instead of traveling over serial cable, signals simply travel over
36

electrical wire connected to input and output pins. Each FPGA has its own input and
output pin; data does not travel to and from one FPGA on the same wire. An overall
block diagram below in Figure 3.8 of a multiple FPGA implementation with 4-sorts
shows the additional communication interfaces for migration. Note the rest of the
program remains the same as the single FPGA 4-sort parallel implementation of Figure
3.7.

Figure 3.8: Implementation with Migration Block Diagram

37

CHAPTER V

EXPERIMENTAL RESULTS

5.1 Hardware Setup
There are only a handful of hardware components necessary in this work. First a
personal computer runs the user interface allowing for communication between the user
and all connected development boards. The personal computer may be any Windowsbased machine capable of running Microsoft Visual Studio C# with multiple USB ports.
The user may enter initial parameters and view final results through this interface as
discussed in the previous chapter. Communication between computer and development
boards is accomplished via a USB to RS-232 serial UART cable. A receive and send
interface is stored on both the personal computer and FPGA for decoding and encoding
message packets. If more than one development board is used, such as the case when
implementing the migration components, the development boards are also connected to
each other via electrical wiring and use the same send and receive protocol as the UART.
In this manner, migration communication between FPGAs is done strictly between
themselves and does not flow through the personal computer. When multiple
development boards are used, all will receive initial parameters and communicate final

38

results to the user interface on the personal computer the same as if only one is
connected. The user interface may be expanded as more development boards are added.
The development boards chosen for this project are the Altera DE2 and Altera
DE70 development boards. Both incorporate a Cyclone II FPGA onboard; the DE2 has a
Cyclone II 2C35 FPGA with 35000 logic elements while the DE2-70 has a Cyclone II
2C70 with 70000 logic elements. A logic element is Altera's resource measurement unit.
It corresponds to one look-up table, one flip flop, and logic. In contrast Xilinx's resource
measurement unit is a slice which corresponds to two look-up tables, two flip flops, and
logic. Due to the use of Altera throughout the project, the logic element was chosen to be
the measurement unit for this project and all resource measurements were converted to
this standard. Both have a 50MHz and 28MHz clock oscillator and the RS-232
transceiver and 9-pin connector necessary for the USB to RS-232 serial UART cable. For
communicating between FPGAs, the 40-pin expansion headers are utilized wherein a
send signal from one FPGA is mapped to a receive signal on the other FPGA. In this
manner, the signals are one way; one pin on each FPGA is designated for sending and
another is designated for receiving [23][24].
5.1.2 Hardware Development Environment
Altera's Quartus II 12.0 service pack 2 Web Edition was the chosen development
environment for the Cyclone II FPGAs with the hardware descriptive language VHDL
chosen for programming over Verilog. This is a free software provided by Altera and
contains the necessary interface for downloading and executing designs on their
development boards. Additionally, it contains interfaces such as the Signal Tap Logic 2

39

Analyzer that allows for monitoring of electrical signals throughout the program in realtime. Single-port memories may be configured to work with the In-System Memory
Content Editor to allow real-time viewing of memory contents. This two tools are
beneficial in the debugging process as well as verifying correct operation of the program.
Lastly, pre-built Altera megafunctions may be added to designs to facilitate
programming. These include phase-lock loops, arithmetic functions, and communication
interfaces as well as others. The main megafunctions utilized in this project were phaselocked loops and adders. The majority of the programming was implemented by hand.
5.2 Resource Usage
In total for this project, six different variations of the main hardware genetic
algorithm were implemented. The variations were a serial implementation with one
sorting component and two parallel implementations with four and eight sorting
components, respectively. Additionally for each of the implementations just described a
variation with and without migration was created. This yields the six different variations.
To adequately compare the resource usage for each of the six variations, the logic
elements used in total for each implementation were recorded as well as a breakdown of
the resources for each segment of the design such as the sorting component, UART
communications, and the genetic algorithm core.
This resource usage related to logic elements is shown graphically in Figure 4.1
on the following page. In each implementation the resources necessary for
communicating with the user interface is the same. This segment is identical in each
implementation so the identical resource measurements are expected. The only difference
40

in resource usage between a variation without migration and its migration counterpart is
the addition of the resources necessary for the secondary communication protocol used to
communicate between FPGAs.

Logic Element Breakdown
60000

Logic Elements Used

50000

Miscellaneous
User Interface Communications

40000

Migration Communications
GA Core

30000

Sort Component(s)

20000

10000

0

Single
Serial

Mult.
S4S
M4S
S8S
M8S
Serial Parallel Parallel Parallel Parallel

Figure 4.1: Logic Element Breakdown for All Designs

41

In the remaining segments, the genetic algorithm (GA) core holds logic necessary for the
linear feedback shift register, the master controller, selection, mutation, and crossover
components. This component measures the same resource usage when comparing a
nonmigration to its migration counterpart and increases slightly as more sorting
components are added. This is due to the extra logic necessary to control multiple sorting
components. Lastly, the sort component is again the same for a nonmigration and its
migration counterpart. The resource usage for four sorting components is approximately
four times the size of a single sort component and half that of the sort component with
eight sorters. This is also as expected since the same component is simply replicated in
parallel implementation.
5.3 Best Tour Evaluations
The overall goal of the genetic algorithm in this project, as stated previously, is to
find an optimal solution for the traveling salesman problem. Recording the optimal
solution, the best tour city by city as well as the overall distance of the tour, was made
possible by viewing results via the user interface. Before evaluating the design in
hardware, the genetic algorithm design chosen for this project was evaluated in software
to verify correct operation of the algorithm itself. Afterwards, evaluations were
performed on the hardware. Results from both sets of evaluations are produced in the
following subsections.
5.3.1 Preliminary Evaluations
In order to verify expected operation of the genetic algorithm utilized in this
project, preliminary evaluations confirmed the assumptions that larger populations, or
42

larger search spaces, produce better results and that allowing the genetic algorithm to
work longer by increasing the number of generations also produces better results. The test
project used in these evaluations was the same one used to determine selection and
removal techniques are described in the preceding chapter.
Increasing the population size is believed to produce better results because the
search space is larger; this trend is illustrated in Figure 4.2.

Best Tour

Best Tour Trends compared to Population Size
23000
22000
21000
20000
19000
18000
17000
16000
15000
0

100

200

300

400

500

600

700

800

Population Size

Figure 4.2: Increasing Population Size Effect on Best Tour

Here the best tour comprised of 64 cities was found with varying population sizes; all
other parameters such as number of generations, mutation probability, and the random
number seed were kept constant. Also, the same initial population was generated. As
43

expected, the genetic algorithm produces increasing better results as the population size is
increased; odd marks on the graph occurs when an increase in overall tour distance is
found from a population size of 100 to 200 as well as a slight increase in overall tour
distance between population sizes of 300 and 400. Both of these increases are followed
by further decreases and an overall decreasing trend can be observed overall.
Increasing the number of generations also produces better results as this allows
the genetic algorithm longer to refine population members. As before, certain parameters
such as mutation probability and the random number seed were kept constant and the best
tour for a 64-city tour was determined. Now however the number of generations were
varied and the population size was fixed to 512. These results are shown in Figure 4.3
[11].

Best Tour Trends compared to Generations
27000

Best Tour

25000
23000
21000
19000
17000
15000
0

100

200

300

400

500

600

700

800

900

Number of Generations

Figure 4.3: Increasing Number of Generations Effect on Best Tour

44

1000

5.3.2 Hardware Evaluations
Confirmation of the algorithm by the preliminary software evaluations lead to
evaluating the design in hardware. The five evaluations performed are outlined here. In
the first three in Figures 4.4-4.6, the nonmigration implementations are outlined starting
with the serial implementation and following with the parallel implementation with four
sorts and then eight sorts. The last two implementations in Figures 4.7 and 4.8 depict two
serial implementations and two parallel implementations performing migration. In all
implementations, the population size was set to 512. This was the maximum population
size allowed by resource constraints; because it was shown that a larger population size
produced better results it was chosen to always use this maximum value. Additionally,
the mutation probability and the two random number seeds were consistent among all
implementations as well as the initial population. The only varying parameter here was
the number of generations which started at the minimum value of 1 and ended at the
maximum value 9999.

45

Best Tour over Generations in Single Serial Implementation
140000
130000

Best Tour

120000
110000
Random Seed 1
Random Seed 2

100000
90000
80000
0

2000

4000

6000

8000

10000

Number of Generations

Figure 4.4: Best Tour over Generations in Single Serial Implementation

46

Best Tour over Generations in Single Parrel with 4 Sorts
Implementation
140000

Best Distance

130000
120000
110000
Random Seed 1
Random Seed 2

100000
90000
80000
0

2000

4000

6000

8000

10000

Number of Generations

Figure 9.5: Best Tour over Generations in Single Parallel with 4 Sorts Implementation

47

Best Tour over Generations in Single Parallel with 8 Sorts
Implementation
140000

Best Distance

130000
120000
110000
Random Seed 1
Random Seed 2

100000
90000
80000
0

2000

4000

6000

8000

10000

Number of Generations

Figure 4.6: Best Tour over Generations in Single Parallel with 8 Sorts Implementation

In all the nonmigration implementations shown in Figures 4.4 through 4.6 above,
the trend of better solutions being found as the number of generations increase holds as
expected. Further exploring the cause of the steep drop-offs noted 8200 generation mark
for the first random seed and 2200 and 3700 generations for the second random seed
reveled multiple successive generations where crossover and mutation produced, by
chance, the best seen solution. Additionally, for the first random number seed, the best
distance begins leveling out to approximately 90000 at or after 8000 generations while
for the second random seed the best distance begins leveling out to approximately

48

108000 at 6000 generations. This fact that each implementation has similar results is due
to the fact that each is running the same algorithm with the same parameters; the only
difference is in the sort component.
For the implementations with migration, Figure 4.7 depicts two serial
implementations communication across the migration component. In the first run, each
has a different random number seed; in the remaining two runs they both have the same
random number seed.

Best Tour over Generations in Multiple Serial Implemetation
115000
Random Seeds 1 & 2
Random Seed 1

110000

Random Seed 2
Best Distance

105000
100000
95000
90000
85000
80000
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of Generations

Figure 4.7: Best Tour over Generations in Multiple Serial Implementation

49

10000

In Figure 4.8, the DE2 board ran the parallel implementation with four sorts while the
larger DE2-70 ran the parallel implementation with eight sorts. These two
implementations communicated through the migration component as well. In the first
run, the parallel implementation with four sorts was given the first random seed; this one
was given the second random seed in the second run. The parallel implementation with
eight sorts was given the opposite random seed in the first two runs. In the third and
fourth run, both implementations were given the same random number seed.

Best Tour over Generations in Multiple Parallel Implementation
115000
Random Seed 1 & 2
110000

Random Seed 2 & 1
Random Seed 1

Best Distance

105000

Random Seed 2

100000
95000
90000
85000
80000
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Number of Generations

Figure 4.8: Best Tour over Generations in Multiple Parallel Implementation
50

It is interesting to note that in both implementations with migration there is not a
clear-cut better random number seed as seen in the nonmigration implementations with
random seed number 1. However, the overall best tour found by the two migration
implementations is better than the overall best tour found by their nonmigration
counterparts. The overall best tour found by the migration implementations ranges from
approximately 81000 to 82000; this is 8000 to 9000 less than the nonmigration results.
This shows that communication and cooperative working is beneficial to the genetic
algorithm and that the migration component does add value to the design.
Additionally, due to the replacement policy, in some runs the previous and
following generation produces worse solutions. In the used replacement policy, only the
best solution does have a chance of being overwritten and therefore unable to be used in
successive generations. In these situations, the algorithm was finding a better solution,
but it was being overwritten in later generations and lost.
5.4 Timing Estimations and Evaluations
In addition to monitoring the overall best tour, tour distance, and the effect
migration played in the genetic algorithm, estimating and measuring timing from both
serial and parallel implementations determined whether the extra logic and hardware
resources necessary to construct parallel operations were beneficial to the overall design.
Here timing equations estimating the total number of clock cycles for the serial and both
parallel implementations as well as the measured number of clock cycles are outlined.
When migration was utilized, counting of clock cycles was stopped so that one design
would not record clock cycles spent waiting on the other to send information. Because of

51

this, the timing estimation and measurements for both the migration and their
nonmigration counterparts were the same and only one result will be reported for a
migration and nonmigration pair.
5.4.1 Timing Estimations
By using temporary variables to hold clock cycle counts and probing these values
at specific points of the design with the Signal Tap Logic II Analyzer available in
Quartus II, the number of clock cycles required for each segment in a design could be
found. Using this data, coupled with the expected number of clock cycles determined by
analyzing the logical code, equations representing the total number of clock cycles for
each implementation could be derived. These equations were further refined to take into
account any amount of parallelism through the use of multiple sorting components.
Eventually, one single equation was found for estimating the total number of clock cycles
for both serial and parallel implementations; multiplying this value by the clock rate
yielded a final timing estimation for the implementation. The two clock rates standard on
both the DE2 and DE2-70 boards are a 50MHZ and 28MHZ clocks; however, the design
was tested with a clock rate of 100MHZ derived by using phase-lock looping hardware.
Table 4.1 outlines the segments of all implementations that shared the same
number of clock cycles; this included all segments except the sorting components. The
genetic algorithm core is broken into stages and an overall total is given as well. As the
fitness calculation performed following sorting is done sequentially for each member of
the population, when deriving the timing estimation equation, the value 131 must be
multiplied by the number of population members. Additionally, the genetic algorithm

52

core repeats however many generations are required causing its total of 280 clock cycles
to be multiplied by the number of generations in the final timing estimation equation.

Table 4.1: Clock Cycles Constant in All Design Implementations
Fitness
Selection Crossover Mutation
Fitness
Replacement
Total
Calculation
Calculation
Genetic
Per
For Child
Algorithm
Population
Core Per
Member
Generation
131
2
128
7
131
12
280
(max)

In Table 4.2, the number of clock cycles per population member that each different
sorting design required. This was obtained the same way as the previous clock cycle
values via analyzing logical code and verifying with the Signal Tap Logic II Analyzer.

Table 4.2: Clock Cycles for Varying Sorting Components in All Design Implementations
per Population Member
Serial
Parallel with Four Sorts
Parallel with Eight Sorts
140
103
86

53

The measurements outlined in Table 4.2 hint that with each addition of four sorting
components, the overall clock cycles for sorting decrease by approximately 25%. This
pattern is very beneficial in determining the final timing estimation equation recorded in
equation 5.1.

𝑇𝑜𝑡𝑎𝑙 𝐶𝑙𝑜𝑐𝑘 𝐶𝑦𝑐𝑙𝑒𝑠 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 140 − 0.25 140𝑥 + 131 + 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 280 ,

(5.1)

In equation 5.1, the genetic algorithm core's clock cycle value of 280 is simply
multiplied by the number of generations as described earlier. The number of clock cycles
required to compute the fitness of a population member after sorting, 131 clock cycles, is
also simply multiplied by the population size. The more interesting piece of the equation
comes from the sorting component factor. Because the serial implementation requires 140
clock cycles for sorting a member of the population, and a pattern of decreasing this
value by 25% for every four sorting components emerged, subtracting an appropriate
amount of clock cycles from the 140 total yields an approximation for the number of
clock cycles required for sorting. In equation 5.1, x corresponds to the number of sets of
four sorting components. In the serial implementation, this value drops out and only the
total of 140 remains. In the parallel implementation with four sorts, x takes a value of 1
and 35 clock cycles are subtracted from 140 yielding 105 which is approximately the
measured 103 clock cycles. Replacing the number of generations and population size
with the appropriate values allows the equation to be used to estimate any of the

54

implementations discussed in this project. A comparison of the estimated values using the
equation and the actual measured values from runtime is outlined in the next subsection.
5.4.2 Timing Evaluations
Measuring the execution time of each implementation was accomplished by
recording the number of clock cycles presented on the user interface and then dividing by
the appropriate clock rate. For all implementations, a phase-locked loop generated a
100MHz clock signal from the on-board 50MHz signal. This was to test the design at
higher clock frequencies over previous works.
As stated previously in Table 4.1, the genetic algorithm core and calculating the
fitness for each population member during the initialization phase both took the same
number of clock cycles when comparing the serial, parallel with four sorts, and parallel
with eight sorts implementations. For a fixed population size, calculating the population
fitness after sorting remained the same regardless of the number of generations. The
genetic algorithm core's number of clock cycles increased as the number of generations
was increased. These trends can be seen graphically in Figure 4.9. For all
implementations, calculating the population fitness took approximately 24 microseconds
while the genetic algorithm core started at 925 microseconds for 100 generations and
increased to 25632 microseconds for 9999 generations. These measured values can be
verified by the estimated values in Table 4.1. Multiplying the estimated 131 clock cycles
by the population size of 512 for the total clock cycles necessary for calculating all
population members' fitness and adding this to the product of the estimated 280 clock
cycles for the genetic algorithm core times, for example, 100 generations yields a total

55

estimation of 95072 clock cycles. Converting this to microseconds based off a 100MHz
clock yields an estimated time of 950 microseconds.

Genetic Algorithm Core and Population Fitness Timing
25000

Microseconds

20000

15000
GA Core
10000

Population Member's Fitness

5000

0
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Generations

Figure 4.9: Genetic Algorithm Core and Population Fitness Timing for all
Implementations

The sorting components, on the other hand, differed from implementation to
implementation as expected and estimated in Table 4.2. The value of 710 microseconds
measured for the serial implementation can be verified by the estimated values given in
Table 4.2. At 140 clock cycles per population member multiplied by a population size of
56

512 and converted into microseconds with a 100MHz clock yields an estimation of 716
microseconds. The measured values for the parallel implementations can be verified
following the same methods.

Sorting Component(s) Timing
800
700

Microseconds

600
500
400
300
200
100
0
Serial

4 Sorts

8 Sorts

Figure 4.10: Sorting Component Timing for all Implementations

All together, the total execution time for the serial, parallel implementation with
four sorts, and parallel implementation with eight sorts in given in Figure 4.11. Note how
there is little difference in the overall timing of each implementation. This, as well as the
57

design segment measurements, hints that the majority of time was taken up by the genetic
algorithm core and not the sorting component or that multiple sorting components did not
produce quality speedups. Refining the areas of parallelization is a focus for further
study.

Total Measured Execution Time
30000

25000

Microseconds

20000

15000
Serial
10000

Parallel with 4 Sorts
Parallel with 8 Sorts

5000

0
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of Generations

Figure 4.11: Total Measured Execution Times for all Implementations

58

10000

Lastly, how well the timing estimation equation predicted accurate measured
times is given in Figure 4.12. In this measurement, the equation estimated between 2%
and 10% over the actual measured value for all implementations. This can be contributed
to the equation using the maximum number of clock cycles possible for design
components such as 128 for crossover when in actuality it was observed a more accurate
average during runtime was 75-80 clock cycles. The fact that the overage becomes larger
as the number of generations increases pinpoints the over-estimation is coming from
mainly the genetic algorithm core part of the equation as this is the part that becomes
dominate with an ever-increasing number of generations.

Timing Estimation Percent Overage
14
12
10

Percentage

8
6
4
2
Serial

0
-2

0

1000

2000

3000

4000

5000

6000

7000 Parallel
8000 with
9000
10000
4 Sorts
Parallel with 8 Sorts

-4
-6

Number of Generations

Figure 4.12: Timing Estimation Percent Overage all Implementations
59

5.5 Project Comparisons
This project was compared to another hardware genetic algorithm used to solve
the Traveling Salesman Problem. In the comparison project, a generational genetic
algorithm was employed where the entire population was replaced each generation as
opposed to just removing one chromosome and replacing it with the child in this project.
A drawback of the generational genetic algorithm method is that is suffers from a large
generational gap meaning that there is a longer wait period before population members
can be used for mating and the advancement of the algorithm. Additionally, another
drawback is that double the memory is necessary to hold the population in a generational
genetic algorithm as one memory must hold the current working population while another
holds the previous generation; this previous generation will be filled with the next
current working population before the start of a new generation. Like the project
discussed here, it calculated tour distance for a city tour comprised of 64 cities and
implemented parallelization techniques in both the sorting and fitness calculation
components.
Coupled with the algorithm differences, the hardware environment in this
comparison project also differed. Here, four Xilinx Virtex II 6000 FPGA were utilized
with the VIVA graphical hardware programming language as the development
environment. This development environment combines a traditional hardware descriptive
language with a schematic capture package to yield a graphical programming
environment. The timing frequency in this project was also 66MHz compared to 100MHz
in this project [9].

60

To adequately compare execution times between the two different
implementations, the number of generations executed in this project needed to be set
equal to the population size times the number of generations used in the comparison
project. This was due to the fact that the comparison project created a new population
from the old population each generation. Speedups over various generations are given in
Figure 4.13.

Timing Speedup: VIVA vs Altera
60

Speedup Factor

50
40
30

Serial
Parallel 4 Sorts

20

Parallel 8 Sorts

10
0
0

2000

4000

6000

8000

10000

12000

Generations

Figure 4.13: Execution Time Comparisons with Previous Work

61

Seen above the greatest speedup came from the serial implementation with factors
reaching over 50 times faster than the previous implementation. While the parallel
implementations did not have as high a speedup, they ranged from 1 to 7 times faster for
the parallel implementation with four sorts and 1 to 4 times faster for the parallel
implementation with eight sorts, the positive increase is promising. The faster clock rate
shows that an increase in execution time may be observed with faster hardware.
Additionally, the speedup could also be contributed to the algorithm differences; not
reproducing the entire population, sorting it, and calculating fitness each generation allow
for a lower generational gap which helps increase the speed of the overall algorithm.
In addition to timing considerations, resource usage was also compared between
the two designs. In order to compare properly, the resource usage reported in the Xilinx
environment was converted to Altera's logic element standard. While Altera's logic
elements consist of one look-up table, one flip flop, and logic, Xilinx's resource
measurement unit is a slice which corresponds to two look-up tables, two flip flops, and
logic. After this conversion, the total percentage of resources used if the designs were
placed on an Altera DE2-70 was determined and outlined in Figure 4.14. Here it can be
seen that initially for the serial and parallel implementation with four sorts the resource
usage for these designs in this project are approximately 10% less than the comparison
project. However, as additional sorting units are added the resource usage becomes more
equal.

62

Percentage Resource Usage Comparison
80
70

Percentage of LEs on DE-70

60
50
40
30
20
10
0
Single Serial Nonpiped
OBJ:

Parallel 4 Piped 4 OBJ Parallel 8 Piped 8 OBJ
Sort
Sort

Figure 4.14: Resource Usage Comparison with Previous Work

63

CHAPTER VI

CONCLUSIONS

6.1 Project Review
In this thesis a genetic algorithm was applied to solve a non-trivial optimization
problem, known as the Traveling Salesman Problem in reconfigurable hardware.
Previous works with reconfigurable hardware were expanded and scaled to today's
hardware and development environments. Comparison of this thesis with these previous
works, as well as software implementations, shows the advantages hardware brings to
genetic algorithms over their software implementations. This is due to the increasing
complexity and size of genetic algorithms potentially overwhelming software
implementations. Parallelization of areas in the genetic algorithm showed positive
speedups, but at a higher resource cost. Higher resource costs were also encountered
when the migration component was added, but more accurate results were obtained.
All implementations were executed on Altera Cyclone II FPGAs incorporated
onto either an Altera DE2 or Altera DE2-70 development board. The amount of logic
elements used for all designs ranged from just under 20000 for the nonmigration serial
implementation to just under 60000 for the parallel migration implementation with eight
64

sorts. When comparing these measurements to the amount of logic elements available on
the larger DE2-70, these implementations used between 21% to 75% of the total logic
elements; the parallel implementations both with and without migration were too large to
fit on the smaller DE2 development board.
When comparing the best distances calculated between migration and their
nonmigration counterparts, the migration component did prove beneficial as it decreased
the overall best tour distance by a range of 8000 to 9000 less than the nonmigration
versions when executed over the same number of generations. The migration component
itself only accounted for approximately 4000 logic elements making it the third smallest
component in all implementations. Overall, the small resource footprint and better results
show that migration can be a valuable component to add to a genetic algorithm.
Lastly, the timing evaluations did show improvement when parallelization was
applied to the sorting component speed ups were obtained. For all implementations, the
genetic algorithm core's and calculating population member's fitness ran the same amount
of time for the same population size. The change in execution times among the
implementations came from the parallelization of the sorting components. In the analysis
it was determined that for every four sort components added, the sorting time decreased
by 25% from the previous version. Although this was a positive speedup, it hints that
other parallelization techniques in other areas of the designs are necessary to achieve
exceptional overall speed ups among implementations.

65

6.2 Future Research
The research done from Scott's early experimentation with hardware genetic
algorithms to now still presents future challenges for further research. Most notably are
further parallelization techniques to decrease runtime for large problem sets.
Additionally, as hardware improves and expands larger problems may be migrated to
hardware environments to overcome the shortcomings discussed by software. Lastly,
scaling the number of communicating FPGAs would provide a better comparison of the
benefits and costs of adding the migration operator.
6.2.1 Further Parallelization
One area where parallelization may be expanded is in calculating the fitness of
each population member. As calculating the fitness of one population member may be
done independently of another, this hints at running multiple calculations in parallel.
However, because there is only one memory that holds the distances between cities
hardware, such as a priority encoder, would need to be added in order to facilitate reading
from this memory and avoid multiple chromosomes trying to send their address to
memory and reading incorrectly paired data. Reading from the distance memory with the
added hardware to facilitate turns could create a bottleneck effect as chromosomes could
not read from memory until it was their turn, but once chromosomes received their data
from memory they could complete their additions in parallel before requesting another
read from memory. Once a set was complete, the chromosomes would then have to
request a write to the population and fitness memory similar to their distance memory
read request. This could be another point of potential bottleneck.

66

Another possible method for further parallelization involves the genetic algorithm
core. After the initialization phase and the population and fitness memories are filled,
multiple instances of the genetic algorithm core could possibly execute in parallel. In this
way, multiple generations would be executing simultaneously. Like with the previous
case of calculating fitness, additional hardware would be necessary to facilitate reading
parent chromosomes and their fitness from memories as well as writing child
chromosomes and their fitness back to the population memory. Because during the timing
evaluations the genetic algorithm core became the longest part of execution with
increasing generations, this method could help decrease the total execution time.
However, having multiple generations executing simultaneously may stray from the
traditional genetic algorithm model where children created in a previous generation are
available to the next. In this case, depending on the number of instances of the core,
multiple generations may pass before a new child is available for selection and crossover.
6.2.2 Expandability
The Cyclone II 2C70 on the DE2-70 development board holds a total of 70000
logic elements while the Cyclone II 2C35 on the DE2 development board only holds a
total of 35000 logic elements. Altera's most powerful and dense FPGA currently
available on the market is the Stratix V FPGA. This chip has a total of 622000 logic
elements along with a 50MHz, 100 MHz, 125MHz, and 148.5MHz clock oscillator [X].
If parallelizing the sorting component continued to double its resource footprint for every
set of four sorts, it would be possible to fit 128 sorts at 480000 logic elements, plus the
30000 logic elements for the remaining components of the design, on one Stratix V for a
total of 510000 logic elements utilized. If multiple instances of the genetic algorithm core
67

were also added, 5 instances at 20000 logic elements each could be added to this
hypothetical design. Although having 128 sorting and 5 genetic algorithm components
might be a stretch, it is interesting to consider the possibility.
6.2.3 Scalability
In addition to providing larger genetic algorithms on single FPGAs, further
research may focus on increasing the number of spatiality connected devices utilizing the
migration component. This thesis was limited to only two connected FPGAs and
therefore could not conform to either the island or stone-stepping model specifically.
Because this thesis did show more accurate results when populations were non-isolated, it
would be interesting to evaluate not only how results vary with an increasing number of
FPGAs, but also how results and timing evaluations compare among the two migration
models.

68

REFERENCES
[1] S. Boyd and L. Vandenberghe, "Optimization Problems," in Convex Optimization,
United Kingdom: Cambridge University Press, pp 127-130, 2004.
[2] C. L. Karr, I. Yakushin, K. Nicolosi, "Solving inverse initial-value, boundary-value
problems via genetic algorithm," in Engineering Applications of Artificial Intelligence,
pp. 625-633, Vol. 13, 2000.
[3] M. Bellare and S. Goldwasser, "The complexity of decision versus search," SIAM J.
on Computing, Vol. 23, No. 1, February, 1994.
[4] M. LaLena, "Traveling Salesman Problem Using Genetic Algorithms," [Online].
Available: http://www.lalena.com/ai/tsp/.
[5] P. Cheeseman, B. Kanefsky, and W. M. Taylor, "Where the Really Hard Problems
Are,"Proceedings of the 12th IJCAI, 1991.
[6] F. Greco, The Traveling Salesman Problem, Vienna, Austria: I-Tech Education and
Publishing, September, 2008.
[7] S. Katkoori, P. Fernando, H. Sankaran, A Stoica, D. Keymeulen, and R. Zebulum,
"Programmable Genetic Algorithm IP Core for Sensing and Surveillance Applications,"
SPIE, Vol. 7347.
[8] Z. Yan-cong, GU Jun-hau, D. Yong-feng, and H. Huan-ping, "Implementation of
Genetic Algorithm for TSP Based on FPGA," Chinese Control and Decision Conference,
pp. 2226-2231, 2011.
[9] M. Vega-Rodriguex, R. Gutierrez-Gil, J. Avila-Roman, J. Sanchex-Perez, and J.
Gomez-Pulido, "Genetic Algorithms Using Parallelism and FPGAs: The TSP as Case
Study," International Conference on Parallel Processing Workshops," 2005.
[10]. B. E. Wells, C. Patrick, L. Trevino, J. Weir, and J. Steincamp, "Applying a Genetic
Algorithm to Reconfigurable Hardware- a Case Study," 2004.
[11] “Applying a Genetic Algorithm to Reconfigurable Hardware using a Traditional
HDL Approach” J. Mintz and B. E. Wells, University of Alabama in Huntsville, 36th
Annual SIAM Southeastern Atlantic Section Conference, March 24 - 25, 2012.
[12] J.H. Holland, Adaptation in Natural and Artificial Selection, MA: MIT Press, 1975.

69

[13] B.E.Wells, "A Reconfigurable Computing Framework for the Effective Application
of the Life Cycle Model to Continuous and/or Discrete Stochastic Optimization
Problems."
[14] T. Krink and M. Lovbjerg, "The Life Cycle Model: Combining Particle Swarm
Optimization, Genetic Algorithms, and Hill Climbers," Lecture Notes in Computer
Science (LNCS), pp.621-630, No. 2436: Proceedings of Parallel Problem Solving from
Nature VII, 2002.
[15] J. Kennedy, R. Eberhart, and Y. Shi, Swarm Intelligence, Morgan Kaufmann
Academic Press, 2001.
[16] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, "Optimization by Simulation Annealing,"
in Science, pp. 671-680, vol. 220, no.4598, 1983.
[17] R. Haupt, and S.E. Haupt, “Practical Genetic Algorithms,” John Wiley and
Sons, Inc., 2ed, 2004.
[18] A. Rogers and A Prugel-Bennett, "Modeling the Dynamics of a Steady State Genetic
Algorithm," in Foundations of Genetic Algorithms, San Francisco, CA: Morgan
Kaufmann Publishers, Inc., pp. 57-63, 1991.
[19] Konfršt Z. “Parallel Genetic Algorithms,” p. 18, Gerstner Laboratory Report 82/99,
CTU FEE, Prague, 1999.
[20] B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications
Using Networked Workstations and Parallel Computers, Upper Saddle River, NJ: Peason
Prentice Hall, pp.419-420. 2005.
[21] S.D. Scott, A. Samal, and S. Seth, "HGA: A Hardware-based Genetic Algorithm,"
New York, New York: ACM Press: 1995.
[22] I. Skliarova and A. Ferrari, "FPGA-based Implementation of Genetic Algorithm for
the Traveling Salesman Problem and its Industrial Application."
[23] Altera, "DE2 Development and Education Board User Manual," version 1.6, 2012.
[24] Altera, "DE2-70 Development and Education Board User Manual," version 1.08,
2009.

70

