High-Performance Parallel Implementation of Genetic Algorithm on FPGA by Torquato, Matheus F. & Fernandes, Marcelo A. C.
ar
X
iv
:1
80
6.
11
55
5v
1 
 [c
s.D
C]
  2
0 J
un
 20
18
High-Performance Parallel Implementation of Genetic
Algorithm on FPGA
Matheus F. Torquato and Marcelo A. C. Fernandes
Department of Computer Engineering and Automation
Research group on Embedded System and Reconfigurable Computing (RESRC)
Federal University of Rio Grande do Norte (UFRN)
Natal, Brazil
Abstract
Genetic Algorithms (GAs) are used to solve search and optimization problems
in which an optimal solution can be found using an iterative process with proba-
bilistic and non-deterministic transitions. However, depending on the problem’s
nature, the time required to find a solution can be high in sequential machines
due to the computational complexity of genetic algorithms. This work proposes
a parallel implementation of a genetic algorithm on field-programmable gate
array (FPGA). Optimization of the system’s processing time is the main goal
of this project. Results associated with the processing time and area occupancy
(on FPGA) for various population sizes are analyzed. Studies concerning the
accuracy of the GA response for the optimization of two variables functions were
also evaluated for the hardware implementation. However, the high-performance
implementation proposes in this paper is able to work with more variable from
some adjustments on hardware architecture.
Keywords: Parallel implementation, FPGA, Genetic algorithms,
Reconfigurable computing.
1. Introduction
In the last years, the increasing number of critical applications involving real
time systems in conjunction with the growth of integrated circuits density and
the continuous reduction in the power supply voltages transformed the devel-
opment of new suitable computational solutions an even harder task to achieve.
Due to the intense demand in the electronics goods market for high processing
speeds at smaller time frames, without neglecting the energy savings, the tech-
nology industry has faced an extremely competitive and challenging scenario in
terms of designing hardware solutions to meet this constantly growing demand.
Email address: mfernandes@dca.ufrn.br (Matheus F. Torquato and Marcelo A. C.
Fernandes)
One way found by researchers and developers to address such demands is by
using algorithm parallelization techniques. Parallel processing is used to manip-
ulate data concurrently, so that while computing one section of the algorithm,
other stations perform similar operations on another set of data [1]. Combining
the hardware implementation with the parallelization of algorithms is often a
satisfactory solution for high performance and higher speed applications when
compared to sequential solutions.
The Field Programmable Gate Arrays (FPGAs) are reconfigurable hardware
devices suited to this scenario due to the nature of its architecture. Given that
FPGAs are huge configurable gates, they can be programmed to operate as
multiple parallel paths in hardware. In this way, there is a real parallelization
in which the running operations do not need to compete for the same resources
since each one will be executed by different gates [2]. The increasing density
and price reduction of FPGAs expand the opportunities for developers and
researchers to use higher density FPGA devices for hardware implementations
[3] considering the use of such devices is advantageous since the development
time and costs are significantly reduced [4].
The convergence among genetic algorithms, parallelization techniques and
reconfigurable hardware implementation results in this work which presents a
proposal of parallel implementation of a genetic algorithm on FPGA. This paper
focuses on high-performance and critical applications that require nanoseconds
time constraints to be satisfied. On the other hand, in applications where pro-
cessing speed is not the critical factor or it is less limiting than the necessity
for low power consumption, it is possible to decrease the energy utilization by
reducing the clock cycles rate, considering that the dynamic power utilization is
diminished when an operating frequency lower than the maximum theoretical
one is used [5]. Applications that process a large flow of data can be bene-
fited and accelerated by this implementation here developed. Some applications
examples are: data mining, tactile internet, massive data processing and bioin-
formatics.
1.1. Related Work
Genetic algorithms and Artificial Intelligence (AI) have long been used in
applications of the most diverse areas to optimize and find satisfactory solutions
in computing, engineering and other fields. More recently, a wider range of
applications and variations of genetic algorithms such as parallel and distributed
applications, hardware implementations, new proposals for genetic operators,
and hybrid (software and hardware) implementations of genetic algorithms have
been observed within the research scenario.
In [6], it is proposed an implementation of a customizable Intellectual Prop-
erty Core (IP Core) for FPGA that implemented a general-purpose genetic
algorithm. In this work, the authors have focused on the genetic algorithm
programmability implemented in the IP core. The customization could be done
regarding population size, generation number, crossover and mutation rates,
random number generators seeds and the fitness function. One of the work’s
highlights is the support for a multiple of these functions. The proposed IP can
2
be programmed with up to eight fitness functions which could be synthesized
in conjunction with the GA and implemented in the same FPGA device. The
proposed core also has additional input/output ports that allow the user to
add further fitness functions that have been implemented on a second FPGA
device or some other external device. The implementation utilized 13% of the
available logical cells of a Xilinx Virtex II Pro (xc2vp30-7ff896). However, since
a trade-off between performance and flexibility exists, and once the authors
focused on flexibility over performance, the speedup over analogous software
implementation was only of ×5.16.
Hardware Genetic Algorithms implementations can also be observed in [7],
[8], [9] and [10]. The work detailed in [10] showed the OIMGA which its strat-
egy was to retain only the ideal individual from the population making the
memory requirements drastically reduced. The paper [7] presented a compact
implementation of a genetic algorithm on FPGA that represented the popu-
lation of chromosomes as a vector of probabilities. The work focused on the
lower consumption of memory, power and space resources in hardware, but it
was not fully implemented on FPGA as it used a software written in C++ to
compute the values from the fitness function. The work [9] proposed a high-
speed GA implementation on FPGA. The implementation was based on the
HGA proposed by [11], the first known GA implementation on FPGA, and the
authors claimed that the developed system surpassed any existing or proposed
solution according to their experiments. The P-HGAv1, version developed by
[9] of the HGA claimed to be parametric, have low silicon requirements and
support multiple fitness functions. Although the authors have focused on the
speed of the algorithm and reached a time of 0.021 milliseconds for each GA
generation, this speed may not be compatible with real-time applications that
require low latency.
The works presented by [12], [13] and [14] showed applications for ground
mobile robots using GAs, where these first two were embedded implementa-
tions on FPGA. [12] developed, according to the authors, the first GA-based
hardware implementation of a simultaneous localization and mapping (SLAM)
system for ground robots. The authors achieved significant hardware acceler-
ation compared to software implementation by exploiting the pipelining and
parallelization capabilities of reconfigurable hardware. In this project the GA’s
genes that made up the population represented possible robot movements based
on the previous position. Later, In the work developed by [13], the goal was
to determine the optimal movements considering various aspects such as route
tracking and low energy consumption, avoiding obstacles collision. The authors
pointed out that the implementation was suitable for real-time use and stated
that all GA stages have been implemented in hardware modules. The solution
presented in this work offered a convergence time of less than 2 milliseconds, it
used 17124 out of the 17600 (97%) Lookup tables (LUTs) available in the FPGA,
but the frequency obtained after the synthesis process was not informed. [14]
developed a genetic algorithm with a coevolutionary strategy for global trajec-
tory planning of several mobile robots. According to [15], co-evolution is the
process of mutual adaptation of two or more populations simultaneously, and
3
it was used to reflect the fact that all species are simultaneously co-evolving
in a given physical environment. The implementation of [14] promised an im-
provement in the genetic operators of conventional GAs and proposed a new
operator of genetic modification, but these developments were not implemented
in hardware.
The implementations seen in [16], [17] and [18] were GA applications in dig-
ital signal processing and control systems embedded on FPGA. [16] presented
a real-time GA for adaptive filtering application with all modules implemented
in hardware such as fitness function, selection, crossover, mutation and random
number generator functions. The implementation was designed in hardware and
after its synthesis, a rate of 320 thousands generations per second was achieved.
Meanwhile, [17] proposed a GA for multi-carrier dynamic systems based on fil-
ter banks. The authors of [18] proposed a design and an implementation of a
PID controller based on GA and FPGA. The researchers stated that the design
method of the intelligent PID controller based on FPGA and GA was success-
fully verified and had some advantages such as flexible design, automatic online
tuning, high reliability, short technical development cycle and high execution
speed. For this case, each GA chromosome was coded with the controller set of
gains Kp, Ki and Kd. Details of FPGA area occupancy and obtained through-
put were not reported.
Lastly, [19] and [20] presented parallel and distributed implementation of GA
using FPGAs. [19] proposed a solution for parallel genetic algorithms in multi-
ple FPGAs. Using multiple populations in parallel GAs was based on the idea
that population isolation can maintain greater genetic diversity, while commu-
nication between them can cause GAs to work together to find good solutions.
The implementation of [19] was applied to three different benchmarks, including
the traveling salesman problem, and the authors stated from the experimental
results that in a configuration of 4 FPGAs an average acceleration of 30 times
over a multi-core processor GA was achieved. [20] introduced GRATER, an au-
tomated design workflow for FPGA accelerators that leveraged imprecise com-
putation to increase data-level parallelism and achieved higher computational
throughput. In this work the main idea was establishing a negotiation among
circuit that involved area, energy and performance in exchange for precision
reduction. This was achieved through an imprecise implementation of specific
hardware blocks such as adder and multiplier, since hardware area reduction re-
sulted in better data parallelism utilization and, therefore, increased the yield.
Also in [20], genetic programming was used to evolve variants of the input ker-
nel until one was found with ideal assignments which reduced the synthesized
kernel area while still stochastically satisfying the output desired quality. The
synthesis results in an Altera Stratix V FPGA showed that the reduced area of
the approximate kernels produced a gain of ×1.4 to 3.0 higher with less than
1% of quality loss compared to benchmarks.
It is essential to observe that the works presented in the literature propose
the solutions based on software and hardware on FPGA. This kind of the an-
swer increase de flexibility but decrease the throughput. Thus, Differently of
the papers in the literature, this work proposed a high-performance parallel im-
4
plementation of GA. The implementation uses the full-parallel strategy in all
GA operations in order to maximize the throughput.
1.2. Paper organization
In section 2 a theoretical foundation about the meta-heuristic used here will
be explored as well as the genetic algorithms, its main characteristics, its ad-
vantages and its different applications. Section 3 will present a detailed descrip-
tion of the architecture development and implementation, describing the various
hardware modules used to construct the parallel genetic algorithm. Later, in
section 4, a careful analysis of the results obtained from the implementation de-
scribed in the previous section will be performed. Simulation results, synthesis
in the FPGA and the validation of the proposed architecture will be presented.
The analysis will be carried for parameters such as occupation area and sam-
pling frequency, taking into account different configurations of the proposed
architecture embedded in the reconfigurable hardware. Following, Section 5
shows a comparison of the obtained with other similar works found in the state
of the art. Finally, Section 6 will present final considerations, conclusions on
the results obtained.
2. Genetic Algorithms
The GAs are used to solve search and optimization problems where an op-
timal solution can be found through an iterative process in which the search
starts from an initial population and then, combining the best representatives
of it, obtains a new population that replaces the previous one [21].
GA is an iterative algorithm that is started from a population of N chromo-
somes randomly created. N is even, in the case of this proposed work, in order
to facilitate the implementation. In every k-th iteration, also called generation
or epoch, the N chromosomes are evaluated, selected, recombined and mutated
to form a new population also of N chromosomes, that is, the entire popula-
tion of parents is replaced by the new offspring. Then, the new population is
used as input to the algorithm’s next iteration (generation), and this procedure
of population updating is repeated K times, where K is the GA generations
number.
The Algorithm 1 displays the pseudo code of a GA. This code details all the
variables and procedures that will be used in the implementation to be presented
in the following sections. The variable xj [m](k) represents the j-th chromosome
of m bits in the k-th generation and X[m](k) is a vector that stores all the N
chromosomes, that is,
X[m](k) =


x1[m](k)
...
xN [m](k)

 . (1)
After the initialization process, the fitness function, called FF (Line 4 of
Algorithm 1), calculates the fitness value of the N chromosomes xj [m](k) of
5
Algorithm 1 Genetic Algorithm
1: Initialise X[m](k) with random values
2: for k ← 1 to K do
3: for j ← 1 to N do
4: yj [a](k)← FF (xj [m](k))
5: end for
6: for j ← 1 to N do
7: wj [m](k)← SF (Y[b](k),X[m](k))
8: end for
9: for i← 1 to N/2 do
10:
[
z2i−1[m](k)
z2i[m](k)
]
← CF
([
w2i−1[m](k)
w2i[m](k)
])
11: end for
12: for v ← 1 to P do
13: xv[m](k)← MF (zv[m](k))
14: end for
15: end for
the population. This operation is applied to all chromosomes and results in a
respective value yj [a](k) for each j-th chromosome, where b is the number of bits
representing the fitness value. The better the value yj [a](k) of the chromosome
xj [m](k), the more likely it is to continue in the new generations. The fitness
values of all N individuals are stored in
Y[a](k) =


y1[a](k)
...
yN [a](k)

 . (2)
After calculating the fitness value of each j-th chromosome of the k-th gen-
eration, the selection operation is performed. In GAs, the selection’s purpose
is to highlight the chromosome xj [m](k) alongside its respective fitness values,
yj[a](k), in order to produce better future populations. There is a great variety
of selection methods in the literature and among them it can be mentioned: the
method of selection by ranking, by tournament, roulette selection and elitism.
The tournament selection method used in this implementation is one of the
most used [22] and it makes a competition between two or more randomly cho-
sen chromosomes from the population stored in X[m](K). This competition
consists of comparing the strength (fitness), yj [a](k), of all participating chro-
mosomes and the one who holds the best respective value in Y[a](K), proceeds
in the algorithm to pass their genes forward. The selection function, called here
SF (Line 7 of the Algorithm 1), has the vectors Y[a](k) and X[m](k) from the
k-th generation as it inputs and, for each input value, it outputs the variable
wj [m](k) that can assume the value of any of the N chromosomes stored in
6
X[m](k). All wj [m](k) values are grouped in
W[m](k) =


w1[m](k)
...
wN [m](k)

 (3)
in order to be used in the crossover stage.
The crossover stage in the k-th generation occurs after the selection of the
most fit chromosomes in the population (stored in W[m](k)) by the selection
function and aims to originate new chromosomes of which will, after the mu-
tation stage, compose the next GA generation updating the vector X[m](k)).
There are several crossover techniques presented in the literature and the strat-
egy adopted in this implementation was the single point crossover. The crossover
operation, called here CF (Line 10 of the Algorithm 1), has as input pairs of
elements from vector W[m](k) of the k-th generation and as output, pairs of
Z[m](k) =


z1[m](k)
...
zN [m](k)

 (4)
which stores the chromosomes after crossingover, that is, the new k-th offspring.
The last GA’s step is the mutation operation that changes the value of a
group P chromosomes, in order to provide greater diversity to the population
avoiding its solution to stabilise in local minimums. The mutation rate, MR, is
the parameter responsible for controlling the amount of mutated chromosomes.
Normally, theMR ranges from 0.1% to 2%. The P value can be easily calculated
by the expression
P = ⌈N ×MR⌉. (5)
The mutation operation, referred here as MF, is presented in the Line 13 of
the Algorithm 1 and detailed in Equation 6.
x = (¬z ∧Rand) ∨ (z ∧ ¬Rand). (6)
Where z is the chromosome to be mutated and Rand is a random variable. The
result of this exclusive OR operation between z and Rand (Z ⊕ Rand) results
in x.
3. Hardware Proposed
Figure 1 presents the general architecture of the proposed GA hardware im-
plementation. The entire algorithm was developed using a parallel architecture
focusing on accelerating the processing speed, taking advantage of the available
hardware resources, similarly to [23]. The Figure details in block diagram the
main subsystems of the proposed implementation, which in turn were encapsu-
lated in order to make the general visualization of the architecture less complex.
7



RX1
RX2
RXn
FFM1
FFM2
FFMn



SM1
SM2
SMn



CM1
CMn/2
MM1



x1[m](k)
x2[m](k)
xn[m](k)
y1[a](k)
yn[a](k)
w1[m](k)
w2[m](k)
wn-1[m](k)
wn[m](k)
z1[m](k)
z2[m](k)
zn-1[m](k)
zn[m](k)
SyncM
Figure 1: General architecture of the proposed parallel genetic algorithm implementation.
It is possible to observe a population of N chromosomes of m bits in which
xj [m](k) represents the j-th chromosome of the population in the k-th genera-
tion, according to the Algorithm 1. Each j-th chromosome xj [m](k) is stored in
a m-bit register, called here RXj whose value is updated by the new population
of N chromosome produced after the processes of selection, crossover and mu-
tation. This updating process occurs every time the synchronization module,
called here SyncM, enables the registers to store new values.
Given that the implementation optimizes two-variables functions, each reg-
ister RXj stores the values of both binary inputs for the fitness function using
bits concatenation for such storage. The first m
2
bits represent the first input
of the fitness function, pxj [
m
2
](k), while the second block of m
2
bits stores the
second input for the fitness function, qxj [
m
2
](k). Thus,
xj [m](k) = pxj
[m
2
]
(k)‖qxj [m
2
](k) (7)
where ‖ is the concatenation operator.
The initial population of the algorithm is randomly chosen. All random
values from the present implementation is generated by pseudo random number
generators based on Linear Feedback Shift Register (LFSR) [24] and [25]. 32
bits independent LFSRs based on the polynomial r32 + r22 + r2 + 1 [25] were
used. Each generator is characterised as CCLFSRlj whose CC, l and j are
labels for its position in the circuit. Every k-th generation a random variable of
32 bits, called here CCrlj [32](k), is produced by each LFSR. To avoid the same
sequence of values, each generator LFSRCClj has a different initial value of 32
bits, called CCseedlj [32].
The notation used in the following diagrams will be in the x[m](c) form,
8
where x is the variable, m is the bit word width and c represents the generation
of the genetic algorithm ranging from 1 to K. In some cases only the bracketed
notation, [m], will be shown to represent the amount of bits transferred on the
bus.
The implementation consists of five main modules called: Fitness Function
Module (FFM), the Selection Module (SM), Crossover Module (CM), Muta-
tion Module (MM) and Synchronization Module (SyncM). Each module has its
specific implementations that will be detailed in the following sections.
3.1. Fitness Function Module - FFM
The Fitness Function Module (Figure 2) has the purpose of calculating the
fitness value of each j-th chromosome from a fitness function f(·). The pro-
posed structure has N FF modules and each j-th module, called here FFMj ,
is associated to an individual xj [m](k) and generates as output in every k-th
generation a fixed-point fitness value expressed by
yj [a](k) = FFMj(xj [m](k)), (8)
where a represents the bit width (equivalent to the 4 line of the Algorithm 1).
Not only for the Fitness Function Module, but for all other stages, the pro-
posed architecture is capable of solving one or two variable problems. Regardless
of the case, the user will not need to make any adjustments to the input data.
The difference between these two options reflects only on how the data is ma-
nipulated by the subsequent modules, but this does not change the performance
of the system and is done invisibly to the operator.
Figure 2 details the operation of the jth FFM. The FFM input value,
xj [m](k) stored in the RXj register, is divided into two halves of
m
2
bits,
pxj [
m
2
](k) and qxj [
m
2
](k), by the bit splitters FFMDIV1j and FFMDIV2j so
that it is thus possible to operate each variable independently in the case of two
variables problems.
After split, the variable pxj [
m
2
](k) is directed to the ROMmemory FFMROM1j
which implements the α function through a Look-Up Table (LUT) and the vari-
able qxj [
m
2
](k) is directed to FFMROM2j which implements the β function in
the same fashion.
After this, both values are added by the FFMADDj adder resulting in the
δj [d](k) variable, where
δj [d](k) = α(pxj)[c](k) + β(qxj)[c](k). (9)
xj[m](k)



 
pxj[m/2](k)
qxj[m/2](k)
+
α(pxj)[c](k) FFMADDj	


	



β(qxj)[c](k)
δj[d](k) yj[a](k)	


Figure 2: Fitness Function Module - FFM
9
The variable δj [d](k) is then directed to the LUT FFMROM3j where the γ
function will be implemented, hence
yj [a](k) = γ(δj [d](k)). (10)
In general, the FFM shown in Figure 2 is able to solve any one or two
variables problem in the format
yj [a](k) = γ(α(pxj [c](k)) + β(qxj [c](k))). (11)
Expressions with product between the two variables are not possible in this
current approach, but it would be possible through a change in the structure of
the FFM.
3.2. Selection Module - SM
The selection module (SM) implements the tournament selection method,
as mentioned in Section 2, by doing a competition between two chromosomes.
Similarly to the FFM, there are N SMs for a group of N chromosomes. As
detailed in Figure 3, each j-th SM, here called SMj, has as input the N fit-
ness values, yj[a](k), and N chromosomes, xj [m](k), from the k-th generation
(equivalent to the 7 line of the Algorithm 1).
SMCOMPj




	


y1[a](k)
y2[a](k)
yj[a](k)
D0
D1
DN
SMLFSR2j
[Log2(N)]
SMMUX2j



y1[a](k)
y2[a](k)
yj[a](k)



D0
D1
DN
SMLFSR1j
[Log2(N)]
SMMUX1j



x1[m](k)
x2[m](k)
xj[m](k)



D0
D1
DN
[Log2(N)]
SMMUX3j
D0
D1
SMMUX4j
D0
D1
SMMUX5j
D0
D1
SMMUX6j
wj[m](k)
[a]
[a]
SMMAXMINj
[1]
A
B
A>B
Figure 3: Selection Module - SM
Each j-th SM has two random generators called SMLFSR1j and SMLFSR2j.
In addition to the random generators, this module is formed by three N input
10
multiplexers called here SMMUX1j, SMMUX2j and SMMUX3j, a m bits com-
parator, called SMCOMPj and three two-input multiplexers, called SMMUX4j,
SMMUX5j and SMMUX6j.
The SMMUX1j and SMMUX2j multiplexers are driven by the SMLFSR1j
and SMLFSR2j generators output signal, (SMr1j [32](k) and SMr2j [32](k)), re-
spectively. As shown in Figure 3 the output signal of each generator (SMr1j [32](k)
and SMr2j [32](k)) is truncated in the most significant ⌈log2(N)⌉ bits in order to
match the population size. The SMMUX1j and SMMUX2j multiplexers select
one fitness value each, which is related to its correspondent chromosome by its
index value.
Finally, SMMUX3j selects the chromosome associated with the best fitness
function value from the output of SMMUX6j which selects whether the goal
is to maximize or minimize the evaluation function through the SMMAXMINj
variable.
3.3. Crossover Module - CM
The crossover module detailed here in this section implements single point
crossover. The architecture proposed here contains N
2
crossover modules and
each one consists of four bit splitters, two identical crossover submodules, and
two concatenators. Similarly to the FFM described in Section 3.1, the CM also
has chromosome splitters in order to manipulate the two variables stored in
w[m](k) independently.
As seen in Figure 4, the two input chromosomes, wj−1[m](k) and wj [m](k),
are split into two halves, each. The first wj−1[m](k) half is sectioned by the split-
ter CMDIV1j which is renamed pwj−1[
m
2
](k) and the second half of that same
variable is sectioned by the splitter CMDIV2j and becomes qwj−1[
m
2
](k). The
same happens with the chromosome wj [m](k) which is sectioned into pwj [
m
2
](k)
and qwj [
m
2
](k) through the divisors CMDIV3j and CMDIV4j , respectively.
Separating the variables of each chromosome, they are forwarded to the CM
submodules CMPQ1j and CMPQ2j so then the crossing is performed. This is
conducted in such a way that the crossover is performed between similar vari-
ables, that is, the first variable pwj−1[
m
2
](k) from the chromosome wj−1[m](k)
wj-1[m](k)




wj[m](k)


	





ff




fi
pwj-1[m/2](k)
qwj-1[m/2](k)
pwj[m/2](k)
qwj[m/2](k)


fl


ffi
pzj-1[m/2](k)
qzj-1[m/2](k)
pzj[m/2](k)
qzj[m/2](k)
zj-1[m](k)
zj[m](k)
Figure 4: J-th Crossover Module (CMj)
11





D0
D1
DN
CMPQLFSR1j
CMPQMUXj
>>m/2
>>2 s[m/2](k)
pwj-1[m/2](k)
pwj[m/2](k)
hpwj-1[m/2](k)
tpwj[m/2](k)
tpwj-1[m/2](k)
hpwj[m/2](k)
pzj-1[m/2](k)
pzj[m/2](k)
12 2 −
m >>1
]1
2
[ 2 





+
mLog
Figure 5: J-it Crossover Module (CMPQ1j)
will be crossed with the first variable pwj [
m
2
](k) from the chromosome wj [m](k).
In the case of single variable problems the system works in an equivalent
way. Only the least significant half of the variables wj−1[m](k) and wj [m](k)
will contain useful data and only block CMPQ2j will handle nonzero data.
Figure 5 presents in detail the circuit of the j-th crossover submodule named
CMPQ1j. It is composed by a
m
2
-input MUX, called here CMPQMUXj , whose
purpose is to randomly select one of the m
2
possible cutting points. The selec-
tion of each CMPQMUXj is controlled by the pseudo random number gener-
ator CMPQLFSR1j whose output signal, CMPQr1i[32](k), is truncated in the⌈
log2(
m
2
+ 1)
⌉
more significant bits before entering the MUX selector.
The selection of the CMPQ1j cut-off point is done relying on the mask
originated from the constant 2
m
2 − 1. This constant creates a vector of 1s of
the size of the chromosome to be crossed, in the case m
2
. Then, a random
and zero-padding right shift is performed according to CMPQLFSR1j value.
This displacement will transform the vector of 1s into a vector of 0s and 1s
concatenated and still of size m
2
. This mask and its inverse will be responsible
for carrying out the crossover operation aided by the AND and OR logic gates
shown also in Figure 5.
Equations 13 and 14 exemplify a case wherem = 20 and CMPQMUXj shifts
the value of 2
m
2 − 1 three times
2
m
2 − 1 = 1111111111; (12)
si
[m
2
]
(k) = 0001111111; (13)
¬si
[m
2
]
(k) = 1110000000. (14)
In each k-th generation, the two entries of the module CMPQ1j, the variables
12
pwj−1[
m
2
](k) and pwj [
m
2
](k) are divided into head
hpwj−1
[m
2
]
(k) = ¬sj
[m
2
]
(k) ∧ pwj−1
[m
2
]
(k) (15)
hpwj
[m
2
]
(k) = ¬sj
[m
2
]
(k) ∧ pwj
[m
2
]
(k) (16)
and tail
tpwj−1
[m
2
]
(k) = sj
[m
2
]
(k) ∧ pwj−1
[m
2
]
(k), (17)
tpwj
[m
2
]
(k) = sj
[m
2
]
(k) ∧ pwj
[m
2
]
(k), (18)
where s[m
2
](k) is the CMPQMUX output. After this step, the crossover will
be performed by concatenating the head of parent 1, hpwj−1[
m
2
](k), with the
parent’s tail 2, tpwj [
m
2
](k), and the parent head 2, hpwj [
m
2
](k), with the par-
ent’s tail 1, tpwj−1[
m
2
](k), thus giving rise to two new chromosomes of the new
population,
pzj−1
[m
2
]
(k) = hpwj−1
[m
2
]
(k) ∨ tpwj
[m
2
]
(k). (19)
and
pzj
[m
2
]
(k) = hpwj
[m
2
]
(k) ∨ tpwj−1
[m
2
]
(k). (20)
For the CMPQ2j submodule the equivalent happens. In this case, the input
values will be qwj−1[
m
2
](k) and qwj [
m
2
](k) and the outputs will be qzj−1[
m
2
](k)
and qzj [
m
2
](k).
After the similar variables have been crossed within each submodule, they
are directed to the outputs of each respective CMPQs where the concatenators
CMCCAT1j and CMCCAT2j will give rise to new individuals (chromosomes)
from the population by concatenating both the parts forming them, pz[m
2
](k)
and qz[m
2
](k) (Figure 4).
It is important to emphasize that after N
2
CMs have performed their op-
erations, N new chromosomes that will form a new population will have been
created. Some of these individuals will pass through the MM (to be described
in Section 3.4) before the start of the next generation, but always at the end
of each iteration, N new individuals will have been created so that the GA
population will always remain with N chromosomes.
3.4. Mutation Module - MM
As in the Algorithm 1 in Line 13 the mutation operation will be performed
on a group of P individuals, that is, there are P mutation modules and each
j-th module, MMj , changes the value of the chromosome to be mutated through
an XOR operation with a number created randomly by an associated generator
called MMLFSRj (Figure 6). The P MM will modify the first P individuals of
the population as shown in Figure 1.
13
xj[m](k+1)
zj[m](k)
 !
MMLFSRj
MMrj[m](k)
Figure 6: Mutation Module - MM
The output of the j-th, MMj , module in every k-th generation can be ex-
pressed by
xj [m](k + 1) = (¬zj [m](k) ∧MMrj [m](k))
∨ (zj [m](k) ∧ ¬MMrj [m](k)) (21)
where MMrj [m](k) represents the pseudo random number generated by j-th
MMLFSRj.
In the case of single-variable problem optimization, this mutation operation
will possibly assign non-zero values to the m
2
unused bits of the mutated chro-
mosome. However, this will not be a problem since these m
2
bits will be zeroed
when passing through the FFM in the following generation.
3.5. Syncronization Module - SyncM
Finally, the last module is the synchronization module. It aims to enable
the registers, responsible for storing the population chromosomes of the genetic
algorithm, to receive new values. These new values result from the mutation
and crossover processes of the previous generation and are stored in the RX
registers to initiate a new iteration of the algorithm.
This module contains a counter, a constant value and a comparator as shown
in Figure 7. The variable enable is enabled when the comparison returns a true
value, that is, when the counter value matches the value stored in the constant.
The SyncV al[2] value is obtained according to the implemented design, and
it is adjusted according to the delay that the system needs to perform all its
operations and provide a new set of chromosomes. The output value of this
module is a boolean value, and the values of the counter and constant output
are 2-bit values. This number was chosen because it was the maximum delay
found in the implementation for an entire generation, a delay for each ROM of
the FFM.
SyncMConst
enable[1]
SyncVal[2]
SyncMCount 


CountVal[2]
Figure 7: Syncronization Module - SyncM
14
In all the tests performed in this work the GA operations were performed at
a sampling rate
Rg =
3
Tg
(22)
where Tg is the time for each k-th generation be finished.
Although Rg is the maximum possible sample rate to operate the system
and Tg is the minimum equivalent time, the equation 22 divides these values
by 3 since only every three clocks a new population is originated in the GA,
since there are two delays in the architecture between the beginning of the
k generation and the end of it. Thus, these two delays caused by the LUTs
contained in the FFM (Section 3.1) make the frequency 3
Tg
the one which will
process the population k + 1 from an earlier population k.
Generally, if the architecture contained any η components that caused system
delays, the sample rate Rgη of this system would be
Rgη =
η
Tgη
. (23)
4. Results
Aiming to validate the proposed implementation of the GA on FPGA, sim-
ulations, analyses and syntheses were performed in the optimisation of different
functions for various population sizes. The first function, called here F1, used
in the tests to validate the proposal was an one variable function expressed as
f(x) = x3 − 15x2 + 500, (24)
The second function, called here F2, was
f(x, y) = 8x− 4y + 1020, (25)
and, lastly, the last function, here called F3, was the function
f(x, y) =
√
x2 + y2. (26)
This work has implemented and synthesised on FPGA the three functions
previously mentioned for populations of size N = 4, N = 8, N = 16, N = 32
and N = 64 and for chromosomes with size m = 20, m = 22, m = 24, m = 26,
and m = 28 bits.
It is important to emphasise that these functions were chosen for comparison
reasons, since they have already been used in the state of the art in previous
works that will be shown next. However, the implementation proposed is capa-
ble of implementing any function in the format shown by Equation 11 requiring
only the modification of the values stored in the memories.
All results were obtained using the development platform and a FPGA Virtex
7 xc7vx550t-1ffg1158. The Virtex 7 FPGA used has 86, 600 slices that group
15
692, 800 flip-flops, 554, 240 logical cells that can be used to implement logical
functions or memories and 2, 880 DSP cells with multipliers and accumulators.
As previously mentioned, three different functions were minimised for the
validation of the implementation. The first one was the function F1 presented
in Equation 24 and shown in Figure 8.
-1 -0.5 0 0.5 1
x ×10-8
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
3
−
15
x
2
+
50
0
Figure 8: Fitness Function F1: f(x) = x3 − 15x2 + 50.
This function was chosen because it was previously used by [9] to validate its
proposal which developed a high-speed Genetic Algorithm on FPGA. Regarding
the implementation of this function as described in Equation 11, the following
associations can be made:
α(px) = 0, (27)
β(qx) = (qx)3 − 15(qx)2 + 50, (28)
γ(δ) = δ, (29)
Therefore, F1 can be represented by
y(px, qx) = 1 ∗ ((qx)3 − 15(qx)2 + 50 + 0). (30)
The second function is presented in Figure 9 and has been previously used in
[6] to validate the implementation of a customisable GA IP core for general
purposes.
Regarding the implementation of F2 as described in Equation 11, the fol-
lowing associations can be made:
α(px) = 8(px), (31)
16
-0.2
0.05
0
0.2
0.4
8
x
−
4
y
+
1
0
2
0
0.05
0.6
0.8
y
0
x
0
-0.05
-0.05
Figure 9: Fitness Function F2: f(x, y) = 8x− 4y + 1020
β(qx) = −4(qx) + 1020, (32)
γ(δ) = δ, (33)
Therefore, F2 can be represented by
y(px, qx) = 1 ∗ (8(px)− 4(qx) + 1020). (34)
Finally, Figure10 shows the last function used to validate the proposal pre-
sented here. This function could be seen previously with the same use in [19]
and [14]. Both works use GA, but only [19] implements the algorithm on FPGA.
Similarly, the F3 in the parameters of the equation 11, can be seen as follows:
α(px) = (px)2, (35)
β(qx) = (qx)2, (36)
γ(δ) =
√
δ, (37)
Therefore, F3 can be represented by
y(px, qx) =
√
(px)2 + (qx)2. (38)
The parameters used in the experiments were based on configurations of
previous experiments found in the literature together with some empirically
obtained configurations. For the number of generations k, it was experimen-
tally observed that for all evaluated functions, the minimum value sought was
obtained before the 100 GA generations were reached. This number is in agree-
ment with what was found in the literature, as can be seen in [9], for example.
Thus, k = 100 was adopted as default value for the optimization experiments
performed here.
17
10
1
0.2
0.4
√
x
2
+
y
2
x
0.6
0
y
0.8
0
1
-1-1
Figure 10: Fitness Function F3: f(x, y) =
√
x2 + y2
Similarly, the GA population sizes to be implemented and synthesised in
the FPGA were determined. As seen in [23] the population size was N = 32.
In [11] the population had size N = 16 and in [24] GA was implemented with
populations of sizesN = 64 andN = 128. Thus, the architecture of the proposal
presented here was implemented for the five population sizes already mentioned
before. The aim behind these different sizes of N was to compare how much this
parameter influences the convergence, speed and area occupation in the FPGA.
Finally, for the same purpose of comparing how much a parameter influences
certain convergence and synthesis characteristics, the bit length m varied for all
population sizes as also quoted previously. The Figures 11 and 12 picture the
operation of the proposed GA for the fitness functions F1 and F3, respectively.
The fitness function 1 (Equation 24) was minimized using the GA with a
population size of N = 32 and m = 26. Thus, [m
2
] bits were used for each
variable from the fitness function. Given that this is a single-variable problem,
the function made use of only [m
2
] bits. For the minimization shown in Figure 11
the range of the F1 was of f(−212) to f(212 − 1). Thus, the minimum possible
value in the range is f(−212) = −6.8971∗1010. As depicted, it is noticed that the
global minimum was reached approximately in the half of the 100 generations,
thus proving the functionality of the proposed system.
Similarly, the fitness function 3 (Equation 26) was also minimised, as shown
in Figure 12 but with a population size of N = 64 and m = 20. In this case,
f(x, y) only allows results greater than or equal to zero when working in the
domain of real numbers, so the smallest possible value is zero. The parallel
Genetic Algorithm implemented on FPGA proposed herein has managed to
minimize F3 in a little over 20 iterations of the algorithm. This is not a fixed
value, since the GA is a stochastic algorithm, but with this value it is possible
to have an idea of the number of generations required for convergence.
18
0 20 40 60 80 100
k-th Generation
-7
-6
-5
-4
-3
-2
-1
0
f
(x
,y
)
=
x
3
−
15
x
2
+
50
0
×1010
Figure 11: Optimising F1.
Both results were obtained from the average of multiple results. It is also
important to emphasise that parameters such as the range of values to be calcu-
lated, bit width (m), decimal precision and the possibility of exploring negative
numbers are all parameters of the LUT (Section 3.1) and configurable by the
user. As already mentioned, the option of maximising or minimising the func-
tion to be optimised is also another configurable variable.
The Table 1 presents the synthesis results in the target FPGA for various
population sizes and m = 20. It is clear in all scenarios that the area of oc-
cupation, clock consequently the number of generations per second, Rg, are
parameters considerably sensitive to the population size, N . Here, the Rg rep-
resents the number of generations performed in the GA per second, which can
also be interpreted as some possible solutions which the system provides in that
interval. Equation 22 states that this number is equal to 3
Tg
, that is, the in-
verse of the time of each GA generation divided by three. This is explained
because the system generates two delays when placing two ROM memories in
series in the FFM described in Section 3.1. Consequently, a new GA population
is generated only after three system clocks.
The clock is the maximum frequency the system performs when implement-
ing this architecture, and it does not take into account the delays required to
generate a new population. The clock represents only the hardware speed for
that specific implementation, so it is 3× faster than the number of generations
performed in the GA per second.
The area occupation spent in registers (Figure 13), presented in the second
column of the Table 1, is due to the storage of population values in RX (Figure
1) and the pseudo random number generators, mainly. This occupation in-
creases linearly according to N , since the larger the population, the greater the
number of RX registers required, as well as operations that require the pseudo
random number generators. Figure 13 shows this growth graphically with a
19
0 20 40 60 80 100
k-th Generation
0
2
4
6
8
10
12
f
(x
,y
)
=
√
x
2
+
y
2
Figure 12: Optimising F3.
Table 1: GA synthesis on FPGA for m = 20.
N Registers Logic Cells Clock Generations
Flip-flops (LUTs) (MHz) Per Second ×1000
4 457 592(1%) 50.28 16.76
8 839 1.558(1%) 49.32 16.44
16 1.616 4, 400(1%) 49.32 16.44
32 3.225 15, 908(4%) 48.51 16.17
64 6.598 58.875(16%) 34.56 11.52
linear interpolation.
The logical cells (LUTs) occupation, presented in the third column of the
Table 1, was increasing and not linear with N , as can be seen in Figure 14.
This nonlinear growth is caused by the selection module (Subsection 3.2) that
for each j-th module, SMj , there are three N inputs multiplexers ( SMMUX1j ,
SMMUX2j and SMMUX3j ).
According to [26], each Virtex 7 logical cell can construct four 1-input MUXs,
thus to build a a N -inputs multiplexers , approximately N
4
logical cells are
required, totalling approximately 3N
4
cells for each SMj (SMMUX4j , SMMUX5j
and SMMUX6j) have not been considered). Since there are N SM modules,
there are approximately 3N
2
4
logical cells for each bit of the input bus of the
MUX. Thus, the exponential growth result from the use of the logical cells is
explained.
In this context, it is important to note that implementation with N = 64 in-
dividuals does not reach even one fifth of the FPGA cells (around 16% of Virtex
7). This is a positive indicator for implementations with larger populations.
Finally regarding the table, the last two columns show the Clocks and the
20
4 8 16 32 64
Population Size - N
0
1000
2000
3000
4000
5000
6000
7000
N
u
m
b
er
of
R
eg
is
te
rs
Figure 13: Registers’ occupation in the FPGA varying with N .
number of generations performed in the GA per second for each value of N and
there is speed reduction according to the population growth. Theoretically, if all
the modules were independent (specific for each individual) this reduction should
not happen, however, it is observed that in the selection modules, SMj (Figure
3.2), there is a dependency between the N chromosomes (due to information
sharing) causing a join in the circuit and thus, an increase in processing time. On
the other hand, it is also observed that the reduction rate is not linear, which
favours the implementation. Another important information to note is that
even with the reduction, each GA generation of 64 chromosomes is generated in
Tg ≈ 87 ns, in other words 87 millions of generations to every 1 ms. This result
has a very significant impact and makes the use of GA possible in several real-
time embedded applications such as robotics, telecommunications and others.
The Figure 15 represents the influence of the bit width m on the Clock for a
GA with N = 32. It is noted a decrease of the processing speed with the increase
of the number of bits, however this fall is not significant. The clock variation
is only slightly more than 1 MHz when the implementation is compared using
m = 20 with the implementation using m = 28. The interpolation shown in the
Figure suggests a linear fall .
The last Figure (16) illustrates the relationship between the increase of LUTs
used in the FPGA and the variation of the bit width m for three different popu-
lation sizes. A larger difference is observed between the quantities of LUTs used
in m = 28, mainly due to the nonlinear growth of these components comparing
to N . However, as already seen in Figure 15 the increase of m is also a factor
responsible for slowing the processing speed, Rg.
Analyzing the synthesis results, it was noticed that different fitness functions
such as F1, F2 and F3 did not result in significant differences in the LUTs
consumption and Registers in the FPGA, as well as no significant differences
were observed in Rg. This result was already expected, since the only variation
21
4 8 16 32 64
Population Size - N
0
1
2
3
4
5
6
N
u
m
b
er
of
L
og
ic
C
el
ls
×104
Figure 14: FPGA LUTs occupancy varying with N
that occurs when changing the fitness function is the content of the FFM LUTs.
Thus, it is possible to extend this thinking and assert that the values of the
Table 1 are true for any other function, in the parameters of Equation 11, using
m = 20 bits.
5. Comparisons with state of the art works
Following, comparisons of the results obtained by the proposed implemen-
tation with equivalent results found in works belonging to the state of the art
are presented. The comparisons which will be shown below and which are sum-
marised in the Table 2 were made with the greatest similarity of parameters as
possible. The table presents a column that presents the comparative references,
the next two columns show the parameters of the GAs compared, then the times
obtained by the works of the state of the art are shown and, finally, the results
obtained by the implementation presented here and the respective speedups are
displayed.
The system presented by [9], a high-speed implementation of GA on FPGA,
demonstrated a runtime of 0.21 milliseconds for a GA implemented on FPGA
with k = 100 generations and a population of sizeN = 32. For the same settings,
the system proposed here achieved a time of≈ 6.18 microseconds, which proves
to be ≈ 34× faster.
Similarly, the implementation presented by [24] also presented an GA on
FPGA with population sizeN = 32, chromosome size of m = 16 and k =
60 generations. The implementation validated its proposal with the traveling
salesman problem and resulted in a running time of 1.702ms. Although a test in
the same parameters of [24] has not been performed here, a comparison can still
be made due to the versatility of problems solved by different LUT as shown in
the FFM. An AG with k = 60, N = 32 and m = 20 can be solved in ≈ 3.71
22
20 22 24 26 28
Word Width - m
14.8
15
15.2
15.4
15.6
15.8
16
16.2
R
g
(M
H
z)
Figure 15: Decrease of Rg when varying m.
microseconds in the work presented here, meaning a time ≈ 459× faster than
in [24].
In similar fashion, the work of [6] presented a highly programmable GA IP
core on FPGA. For a setting of k = 32 and N = 32 the authors stated a speedup
of 5.16× over an equivalent software implementation which achieved a running
time of 37.615 milliseconds. In a comparison, the implementation presented
here performed the equivalent situation in a time of ≈ 1.98 microseconds, which
represents a speedup of ≈ 19007× over the serial implementation shown in [6].
In other words, a time ≈ 3683× less than the GA IP core proposed by [6].
Finally, the implementation here proposed can also be compared to the work
published in [10]. As already mentioned in Section 1.1, this article presents the
OIMGA, an implementation of a monogenetic FPGA algorithm that retains
only the best chromosome of the generation. In one of the tests performed
to validate the implementation, the authors optimised a one variable function
with a population of N = 64 in ≈ 0.8 seconds. In a scenario where the pro-
posed parallel GA take this time to solve the same function it would process
k = 9.2 million of generations. Of course, this value is unreasonable for that
function. As shown previously in the results, k = 100 generations was the de-
fault value to optimise functions of one or two variables, thus, even if the number
of generations needed to optimise the same function was k = 500 (a generous
estimate), the time resulting from the implementation proposed by [10] would
still be ≈ 18432× higher.
6. Conclusion
After the presentation of the results in Section 4 it can be affirmed that the
implementation proposal was in fact validated and fulfilled with its objective
of being a parallel implementation of high-performance of a GA. The synthesis
23
20 22 24 26 28
Word Width - m
0
0.5
1
1.5
2
2.5
N
u
m
b
er
of
L
og
ic
C
el
ls
×105
K=64
K=32
K=16
Figure 16: Relation between the LUTs usage with the increase of m.
Table 2: Comparative table with state of the art works
Reference N k
Reference
Time
Obtained
Time
Speedup
[9] 32 100 0.21 ms 6.18µs 34
[24] 32 60 1.702 ms 3.71µs 459
[6] 32 32 7.29 ms 1.98µs 3683
[10] 64 500 0.8 s 43.40µs 18432
results confirmed that the present proposed parallel implementation of AG on
FPGA is able to optimize a wide range of functions in a viable time for critical
applications that require short time constraints or a large amount of data to be
processed in a short interval.
Comparisons with other implementations found in the literature in Section
5 reinforce the high speed achieved by the implementation developed. This
enables the use of this system in a commercial context for applications such as
Internet Touch, robotics, real-time applications and medical applications. In
addition, this system has proven to be an acceleration tool for any hardware
system that makes use of genetic algorithms.
As well as the high-performance achieved, the small area consumption of the
implementation developed here is a notorious feature. This makes it possible
for other systems to also be embedded in the FPGA, since the on-board GA
occupies less than 1
5
of the Virtex 7 logic cells used as a test. This logical
cells low consumption feature is essential for applications where the area is the
biggest constraint as spatial applications, for example.
The experiments carried out proved that the sizes of N tested are sufficient
to solve most of the practical problems as the literature says. It has been found
that the duration in iterations (k) of GA does not need to be greater than a few
24
hundred. It has been proven that a few hundred generations or even k = 100 is
a reasonable number of generations for a GA. The parameter m proved to be of
great importance, since it directly affects the GA convergence speed, the area
occupied on the FPGA, the response precision, as well as the achieved Rg.
References
[1] A. Rodriguez, F. Moreno, Evolutionary computing and particle filtering:
A hardware-based motion estimation system, IEEE Transactions on Com-
puters 64 (2015) 3140–3152.
[2] N. Instruments, 2011, Understanding parallel hardware: Multiprocessors,
hyperthreading, dual-core, multicore and fpgas, URL: http://www.ni.
com/tutorial/6097/en/.
[3] Y. Jewajinda, P. Chongstitvatana, Hardware architecture and fpga imple-
mentation of a parallel elitism-based compact genetic algorihm, in: TEN-
CON 2009-2009 IEEE Region 10 Conference, IEEE, 2009, pp. 1–6.
[4] V. Tirumalai, K. G. Ricks, K. A. Woodbury, Using parallelization and
hardware concurrency to improve the performance of a genetic algorithm,
Concurrency and Computation: Practice and Experience 19 (2007) 443–
462.
[5] J. P. Uyemura, Introduction to VLSI circuits and systems, Wiley India,
2002.
[6] P. Fernando, H. Sankaran, S. Katkoori, D. Keymeulen, A. Stoica, R. Zebu-
lum, R. Rajeshuni, A customizable fpga ip core implementation of a general
purpose genetic algorithm engine, in: Parallel and Distributed Processing,
2008. IPDPS 2008. IEEE International Symposium on, IEEE, 2008, pp.
1–8.
[7] T. C. Oliveira, V. P. Júnior, An implementation of compact genetic algo-
rithm on fpga for extrinsic evolvable hardware, in: Programmable Logic,
2008 4th Southern Conference on, IEEE, 2008, pp. 187–190.
[8] F. Mengxu, T. Bin, Fpga implementation of an adaptive genetic algorithm,
in: 2015 12th International Conference on Service Systems and Service
Management (ICSSSM), IEEE, 2015, pp. 1–5.
[9] M. Vavouras, K. Papadimitriou, I. Papaefstathiou, High-speed fpga-based
implementations of a genetic algorithm, in: Systems, Architectures, Model-
ing, and Simulation, 2009. SAMOS’09. International Symposium on, IEEE,
2009, pp. 9–16.
[10] Z. Zhu, D. J. Mulvaney, V. A. Chouliaras, Hardware implementation of a
novel genetic algorithm, Neurocomputing 71 (2007) 95–106.
25
[11] S. D. Scott, A. Samal, S. Seth, Hga: A hardware-based genetic algorithm,
in: Third International ACM Symposium on Field-Programmable Gate
Arrays, 1995, pp. 53–59. doi:10.1109/FPGA.1995.241945.
[12] G. Mingas, E. Tsardoulias, L. Petrou, An fpga implementation of the smg-
slam algorithm, Microprocessors and Microsystems 36 (2012) 190–204.
[13] L.-M. Ionescu, A. Mazare, A.-I. Lita, G. Serban, Fully integrated artificial
intelligence solution for real time route tracking, in: 2015 38th International
Spring Seminar on Electronics Technology (ISSE), IEEE, 2015, pp. 536–
540.
[14] H. Qu, K. Xing, T. Alexander, An improved genetic algorithm with co-
evolutionary strategy for global path planning of multiple mobile robots,
Neurocomputing 120 (2013) 509–517.
[15] J. R. Koza, Genetic evolution and co-evolution of computer programs,
Artificial life II 10 (1991) 603–629.
[16] H. Merabti, D. Massicotte, Hardware implementation of a real-time genetic
algorithm for adaptive filtering applications, in: Electrical and Computer
Engineering (CCECE), 2014 IEEE 27th Canadian Conference on, IEEE,
2014, pp. 1–5.
[17] N. Sehatbakhsh, M. Aliasgari, S. M. Fakhraie, Fpga implementation of
genetic algorithm for dynamic filter-bank-based multicarrier systems, in:
Design & Technology of Integrated Systems in Nanoscale Era (dtis), 2013
8th International Conference on, IEEE, 2013, pp. 72–77.
[18] Y. Chen, Q. Wu, Design and implementation of pid controller based on
fpga and genetic algorithm, in: Electronics and Optoelectronics (ICEOE),
2011 International Conference on, volume 4, IEEE, 2011, pp. V4–308.
[19] L. Guo, A. I. Funie, D. B. Thomas, H. Fu, W. Luk, Parallel genetic algo-
rithms on multiple fpgas, ACM SIGARCH Computer Architecture News
43 (2016) 86–93.
[20] A. Lotfi, A. Rahimi, A. Yazdanbakhsh, H. Esmaeilzadeh, R. K. Gupta,
Grater: An approximation workflow for exploiting data-level parallelism in
fpga acceleration, in: 2016 Design, Automation & Test in Europe Confer-
ence & Exhibition (DATE), IEEE, 2016, pp. 1279–1284.
[21] J. H. Holland, Adaptation in natural and artificial systems: an introductory
analysis with applications to biology, control, and artificial intelligence., U
Michigan Press, 1975.
[22] M. R. Noraini, J. Geraghty, Genetic algorithm performance with different
selection strategies in solving tsp, World Congress on Engineering 2011 Vol
II (2011).
26
[23] N. Nedjah, L. de Macedo Mourelle, An efficient problem-independent hard-
ware implementation of genetic algorithms, Neurocomputing 71 (2007)
88–94.
[24] K. Deliparaschos, G. Doyamis, S. Tzafestas, A parameterised genetic algo-
rithm ip core: Fpga design, implementation and performance evaluation,
International Journal of Electronics 95 (2008) 1149–1166.
[25] M. Goresky, A. Klapper, Pseudonoise sequences based on algebraic feed-
back shift registers, IEEE Transactions on Information Theory 52 (2006)
1649–1662.
[26] K. Chapman, Multiplexer design techniques for datapath performance with
minimized routing resources, Application Note: Spartan-6 Family, Virtex-6
Family, 7 Series FPGAs, 2014.
27
