University of Rhode Island

DigitalCommons@URI
Open Access Master's Theses
2022

Parallelized implementation of a Genetic Algorithm on a Field
Programmable Gate Array (FPGA) to provide heuristic solutions
for the Capacitated Vehicle Routing Problem (CVRP)
Maximilian Jakob Heer
University of Rhode Island, maximilian_heer@uri.edu

Follow this and additional works at: https://digitalcommons.uri.edu/theses

Terms of Use
All rights reserved under copyright.
Recommended Citation
Heer, Maximilian Jakob, "Parallelized implementation of a Genetic Algorithm on a Field Programmable
Gate Array (FPGA) to provide heuristic solutions for the Capacitated Vehicle Routing Problem (CVRP)"
(2022). Open Access Master's Theses. Paper 2240.
https://digitalcommons.uri.edu/theses/2240

This Thesis is brought to you for free and open access by DigitalCommons@URI. It has been accepted for inclusion
in Open Access Master's Theses by an authorized administrator of DigitalCommons@URI. For more information,
please contact digitalcommons@etal.uri.edu.

PARALLELIZED IMPLEMENTATION OF A GENETIC ALGORITHM ON A
FIELD PROGRAMMABLE GATE ARRAY (FPGA) TO PROVIDE
HEURISTIC SOLUTIONS FOR THE CAPACITATED VEHICLE ROUTING
PROBLEM (CVRP)
BY
MAXIMILIAN J. HEER

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
ELECTRICAL ENGINEERING

UNIVERSITY OF RHODE ISLAND
2022

MASTER OF SCIENCE THESIS
OF
MAXIMILIAN J. HEER

APPROVED:
Thesis Committee:
Major Professor

Resit Sendag
Manbir Sodhi
Yan Sun
Brenton DeBoef
DEAN OF THE GRADUATE SCHOOL

UNIVERSITY OF RHODE ISLAND
2022

ABSTRACT
This thesis reports on the implementation of a Genetic Algorithm for finding
heuristic solutions to the Capacitated Vehicle Routing Problem (CVRP) on HighPerformance Field Programmable Gate Arrays (FPGAs). The heuristic approach
to the class of NP-hard problems related to the Travelling-Salesman-Problem
(TSP) has a long tradition in computer science, with this thesis being directly
based on an implementation of a specialized Genetic Algorithm for a Graphic
Processing Unit (GPU). In my work, I describe how the algorithmic structure
can be directly transformed into an adapted computing architecture with special
focus on the highest possible degree of parallelism. For that, the proposed design
makes use of the flexibility of FPGAs and self-defined hardware in terms of data
representation, parallelized computation and manipulation as well as memory
management. The resulting architecture is tested in behavioral simulation for
verification purposes, and bit-streamed to an actual high-performance FPGA to
re-ensemble a potential Application Specific Integrated Circuit (ASIC) for logistic
planning. In this configuration, state-of-the-art benchmark problems are run on
the device to compare the hardware implementation with the already existing
approaches on GPU and CPU (Central Processing Unit) in terms of speed, quality
of solutions and energy efficiency.

ACKNOWLEDGMENTS
In the first place, I would like to express my deepest gratitude for my main
thesis advisor Resit Sendag and Manbir Sodhi, without whose help, inspiration
and experience this work would not have been possible. At the same time, I also
would like to thank my academic co-workers Marwan Abdelatti, whose ideas on
the GPU-implementation of the problems were the starting point for this research,
and Jose Quevedo, who always was there to validate my results and exchange new
approaches on our joint efforts.
Furthermore, I would like to express my gratitude for Yan Sun and Koray Özpolat,
who were willing to participate in the thesis defence on short notice.
Special thanks must go to Willard Simoneau, whose incredible skills helped me
with one or the other FPGA problem.
My time at University of Rhode Island would not have been the same great
experience without the living community of friends and housemates in the TI
and IEP Houses. Whenever I came back from hours of programming and testing,
there was someone to talk, play cards or have fun.
Overall, this unique experience of my final year in the United States would not
have been possible without the support by both my German home university
TU Darmstadt and the University of Rhode Island, for which I would like to
thank those responsible in both institutions. Special thanks must also go to the
Cusanuswerk, which generously supported me with a scholarship for my studies
abroad.
Finally, I would like to thank my parents, grandparents and siblings, who have
always supported me with good advice and company since I first told them about
my plans to come to Rhode Island for this final year of my studies.

iii

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.1

Significance of the Capacitated Vehicle Routing Problem today .

3

2.1.1

Logistics and its demand for better algorithms . . . . . .

4

Finding solutions for Vehicle Routing Problems . . . . . . . . .

5

2.2.1

Genetic Algorithms in CVRP - Related Work . . . . . .

6

2.2.2

FPGAs in GA and CVRP - Related Work . . . . . . . .

7

FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

8

3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

2.3

3.1

General description of the problem - real world example . . . . .

9

3.2

Mathematical Notation and Problem Definition . . . . . . . . .

9

4 Genetic Algorithms for the CVRP and their implementation
on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

4.1

Description of a Genetic Algorithm for the CVRP on GPU and
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

12

Page
4.1.1

General Idea of Genetic Algorithms . . . . . . . . . . . .

12

4.1.2

Adaption for the CVRP . . . . . . . . . . . . . . . . . .

13

GPU-based algorithm . . . . . . . . . . . . . . . . . . . . . . . .

15

5 Necessary Adaptions of the Algorithm for a FPGA-based
implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

4.2

5.1

FPGA Architecture and Properties . . . . . . . . . . . . . . . .

17

5.2

Data Representation and Distance Calculation . . . . . . . . . .

18

5.3

Algorithm implementation . . . . . . . . . . . . . . . . . . . . .

21

5.3.1

Macro-Algorithm: Handling of the population . . . . . .

21

5.3.2

Micro-Algorithm: Processing of two individuals . . . . .

27

6 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . .

36

6.1

Overall structure - overview of all used modules . . . . . . . . .

36

6.2

Central Controller . . . . . . . . . . . . . . . . . . . . . . . . . .

37

6.2.1

Main FSM - the Algorithmic Controller . . . . . . . . . .

39

6.2.2

Procedural control and memory communication - the
Processing Node-Controller . . . . . . . . . . . . . .

40

6.2.3

Linear Feedback Shift Register . . . . . . . . . . . . . . .

56

6.2.4

UART-Controller . . . . . . . . . . . . . . . . . . . . . .

59

6.3

Population Memory . . . . . . . . . . . . . . . . . . . . . . . . .

60

6.4

Processing Nodes . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.4.1

Communication Interface . . . . . . . . . . . . . . . . . .

64

6.4.2

Data I/O in the Parent-FSM . . . . . . . . . . . . . . . .

69

6.4.3

Algorithmic Processing in the Children-FSM . . . . . . .

74

6.4.4

Square-root calculator . . . . . . . . . . . . . . . . . . .

84

v

Page
6.4.5

Evolution of the ProcessingNode-Design during
development . . . . . . . . . . . . . . . . . . . . . . .

86

Design for Experimentation - the CVRPWrapper . . . . . . . .

89

7 Workflow of Design and Testing . . . . . . . . . . . . . . . . . .

92

6.5

7.1

Selection of a FPGA device . . . . . . . . . . . . . . . . . . . .

92

7.2

Hardware development toolchain . . . . . . . . . . . . . . . . . .

93

7.3

Benchmark problems and software tools for adaptation . . . . .

96

7.3.1

Composition of a benchmark problem . . . . . . . . . . .

96

7.3.2

Software-toolchain for problem adaptation . . . . . . . .

97

8 Hardware Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.1

Development stages of the Processing Node . . . . . . . . . . . . 106

8.2

Hardware utilization of the overall design for different sizes . . . 107

8.3

Impact of elitism in the main controller . . . . . . . . . . . . . . 109

8.4

Problem-dependent memory utilization . . . . . . . . . . . . . . 110

9 Comparison of the algorithmic results of a GA for CVRP on
GPU, CPU and FPGA . . . . . . . . . . . . . . . . . . . . . . . . 113
9.1

9.2

Design-decisions for an appropriate comparison of GPU, CPU
and FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.1.1

Test design . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.1.2

Selection of benchmark problems . . . . . . . . . . . . . 115

Comparison results . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2.1

X-n101-k25 - without elitism and with non-euclidean
distance calculation for 2-opt . . . . . . . . . . . . . 115

9.2.2

X-n101-k25 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 120

vi

Page
9.2.3

P-n50-k7 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 122

9.2.4

P-n70-k10 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 124

9.2.5

X-n125-k30 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 125

9.2.6

X-n139-k10 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 127

9.2.7

X-n148-k46 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 128

9.2.8

X-n167-k10 - with elitism and all-euclidean distance
calculations for 2-opt and fitness . . . . . . . . . . . 129

10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 131
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . 135
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

vii

LIST OF FIGURES
Figure
1

2

3

4

5

6

7

8

Page
Example for a map with 10 different cities and depot node placed
in the middle . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Composition of the identifier for a single city in the FPGAimplementation . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Multiple cities form a whole Individual. The set depot-bits at
several cities mark the end of subtours, where the truck has to
return and start again at the depot. . . . . . . . . . . . . . . .

20

Flowchart of the algorithm. The macro-algorithm deals with
the handling of the population and the selection of individuals
for processing as specified by the micro-algorithm. It involves
steps such as crossover, mutation and 2-opt. . . . . . . . . . .

22

Comparison of GPU/CPU- and FPGA-implementation: The
upper process resembles the generation-based population
handling as implemented in the GPU, while the lower flow shows
the idea of pooling, which is tailor-designed for the properties
of a FPGA-platform. . . . . . . . . . . . . . . . . . . . . . . .

24

Detailed processing flow for a pooling concept including an
elitist list: The memory spaces of elitist lists are blocked and
can’t get overwritten, while the elitist list is updated after each
crossover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

Visualization of the composition of the whole algorithm flow,
consisting of initialization crossovers and standard crossovers.
To get an idea of the distribution of both groups, one can have a
look at a realistic example with a population of 2000 individuals:
In this case, it turned out to be a good set-up to have 5000
initiation crossovers and 95000 standard crossovers. . . . . . .

27

Example of a crossover process as performed in the FPGA by
copying a chunk of cities from one parent and then filling up the
empty spots with missing nodes from the second parent. . . . .

30

viii

Figure
9

10

11

12

13

14

15

16

17

18

19

Page
Example of a mutation process as performed in the FPGA by
reversing the orders of the cities within a randomly defined
chunk of nodes within the child-individual. . . . . . . . . . . .

31

Example of a depot insertion process as performed in the FPGA
by iterating through the child and adding the node demands
to compare them to the delivery capacity of the truck in the
problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

Example of the application of the 2-opt-heuristic on a subtour,
in this case with the squared distance metric. . . . . . . . . . .

34

Example of distance calculation as performed in the FPGA
by iterating through the child and calculation all Euclidean
distances between subsequent nodes and depot stops. . . . . .

35

Overview of the whole implemented computational design,
including a central controller that transmits data from the
population memory to the Processing Nodes, which are in charge
of the micro algorithm as discussed before. Data exchange with
a connected PC is ensured via a UART-communicator. . . . .

38

Logic design of the run counter used to measure the number of
executed crossover operations. . . . . . . . . . . . . . . . . . .

41

Finite State Machine of the main controller used for
implementation of the macro-algorithm and control of the
workflow of the whole design. . . . . . . . . . . . . . . . . . . .

42

Data flow for the selection and approval of possible parents for
a next crossover run in one of the Processing Nodes if no elitism
is applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Selection of prospective parent individuals from the general
population as well as specifically from elitist list. . . . . . . . .

47

Calculation of the approval signal for the prospective parent
individuals as well as generating an indication for their status
as elitist individuals. . . . . . . . . . . . . . . . . . . . . . . .

48

Data and control flow for the calculation of the access priority
weights and the determination of memory access rights for the
different Processing Node FSMs. . . . . . . . . . . . . . . . . .

50

ix

Figure
20

21

22

23

24

25

26

27

28

29

Page
Main part of the Finite State Machine of the main controller
used for controlling the communication with the Processing
Nodes, in which the micro-algorithm is implemented. Only bried
descriptions of the several states are shown, and also the states
for elitism are included. . . . . . . . . . . . . . . . . . . . . . .

54

Main part of the Finite State Machine of the main controller
used for controlling the communication with the Processing
Nodes. Full details regarding all used variables are shown. The
states related to the Elitist list are not depicted. . . . . . . . .

55

Second part of the Finite State Machine of the main controller
in charge of controlling the Processing Nodes, which specifically
handles the data exchange with the UART-modules. . . . . . .

57

Basic structure of the Linear Feedback Shift Registers with an
XNOR-gate as feedback implementation. . . . . . . . . . . . .

59

Example of the storage organization of a single individual with
7 nodes and 20 bits per node in the 72-bit memory lines of the
BRAM used as population memory. . . . . . . . . . . . . . . .

64

Structural overview of the Processing Node, including a common
main FSM and two separate FSMs for parallel processing of
two children. Remarkable is the representation of parents and
children in register files, manipulated by those FSMs. . . . . .

65

Data path for the comparison of the fitness values to determine
the dominance of individual solutions in comparison to the three
other solutions present in the Processing Node . . . . . . . . .

73

Main Finite State Machine of the Processing Node, which is in
charge of handling the data I/O with the main controller and
storing the incoming values in the parent-register files . . . . .

75

Simplified depiction of the FSM for the microalgorithmic
processing in the Processing Node. Includes all waiting states
for the calculation of euclidean distance values. . . . . . . . . .

76

Data flow for the comparison of a node from parent 2 to be
inserted into the next empty spot of child 1. . . . . . . . . . .

78

x

Figure
30

31

32

33

34

Page
First part of the Finite State Machine of the Processing Node
that is in charge of processing the children, including crossover,
mutation and depot insertion. Complete overview of all used
variables and registers. . . . . . . . . . . . . . . . . . . . . . .

80

Calculation network implemented to check for the necessity
of swaps within the 2-opt algorithm. All calculation-related
circuits, especially the adders and multipliers, are later enrolled
on the dedicated DSP-slices of the FPGA. . . . . . . . . . . . .

83

Second part of the Finite State Machine of the Processing Node
that is in charge of the 2opt-heuristic and distance calculation
for the children individuals. The waiting-states for the squareroot calculation are not included. . . . . . . . . . . . . . . . .

85

Visualization of the critical points of the different design
iterations for the Processing Nodes. . . . . . . . . . . . . . . .

88

Design diagram for the wrapper module and its connections to
the main controller of the hardware architecture . . . . . . . .

91

35

Processing flow in the “main float.py”-python script: Based on
the poblem-file and the optimal tour, the basic parameters and
the memory configuration file for the hardware adaption as well
as a coordinate list for later visualization are created. . . . . . 100

36

Example of the coordinate-adaption mechanism in the
“main float.py”-python script: Coordinates are shifted and
scaled to integers. Then, the final data representation can be
calculated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

37

Processing work flow in the UART-reader software: With
information taken from the “problem name processed.txt” file,
the received bitstream is split up and interpreted, so that the
feasibility of the transmitted route can be checked. . . . . . . . 104

38

Diagram of the hardware utilization of the Processing Node in
the various design stages. For Design #6, no data was recorded
for the clock cycles. . . . . . . . . . . . . . . . . . . . . . . . . 108

xi

Figure

Page

39

Diagram of the population memory utilization for a 20n, a 10n,
a 3n and a 1n population for the relevant Uchoa-test problems.
Obviously, the usage of the larger population sizes is limited to
the small problems of that set. . . . . . . . . . . . . . . . . . . 112

40

Plotting of the problem sets of M-n101-k10 and X-n101-k25
(Uchoa-benchmark set). Obviously, the nodes are grouped in
the old M-n101-k10 test map, while the more modern X-n101k25 map shows a clear distribution of the nodes. . . . . . . . . 116

41

Performance of GPU, CPU and FPGA for the X-n101-k25
problem. Displayed is the quality gap to the best known solution
over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

42

Routing for X-n101-k25 after the first optimization run with
100.000 crossovers on the FPGA and after the last of such
optimization runs. . . . . . . . . . . . . . . . . . . . . . . . . . 118

43

Comparison of the crossover processes calculated on FPGA,
GPU and CPU in one second for the current X n101 k25
problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

44

Comparison of the power consumed by FPGA, GPU and CPU
for the current X n101 k25 problem. Data for the FPGA is
taken as an estimation from the VIVADO-CAD-tool. . . . . . 120

45

Performance of GPU, CPU and FPGA for the X-n101-k25
problem. This time, the FPGA was set up with an Elitist
list and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 123

46

Performance of GPU, CPU and FPGA for the P-n50-k7
problem.
The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 124

47

Performance of GPU, CPU and FPGA for the P-n70-k10
problem.
The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 126

xii

Figure

Page

48

Performance of GPU, CPU and FPGA for the X-n125k30 problem. The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 127

49

Performance of GPU, CPU and FPGA for the X-n139k10 problem. The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 128

50

Performance of GPU, CPU and FPGA for the X-n148k46 problem. The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 129

51

Performance of GPU, CPU and FPGA for the X-n167k10 problem. The FPGA was set up with an Elitist list
and Euclidean distance metrics for both fitness- and 2-opt
calculation. Displayed is the quality gap to the best known
solution over time. . . . . . . . . . . . . . . . . . . . . . . . . . 130

52

Design study for a hybrid computer system consisting of
CPU for the macro-algorithm and one or multiple FPGAs for
the micro-processing steps, connected via a high-bandwidth
interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

xiii

LIST OF TABLES
Table
1

2

Page
Assignment of the individuals to the re-transmission slots due
to their dominance regarding the fitness values . . . . . . . . .

74

Comparison of the technical specifications of the highperformance FPGAs by Intel-Altera and Xilinx . . . . . . . . .

93

3

Hardware footprint of a single processing node in the various
Design steps, configured for the problem set n16 k8. For LookUp-Tables (LUTs), Flip Flops (FFs) and the Digital Signal
Processing slices (DSPs), the percentage usage of all available
hardware ressources of that kind is specified. . . . . . . . . . . 107

4

Percentage hardware utilization and maximum reachable clock
speed for the full architecture with two Processing Nodes and
BRAM-based population memory, configured for two of the
Uchoa test problems. It is important to point to the different
population sizes for the two architectures as well as the fact,
that the X-n376-k94 problem was synthesized with a slightly
different architectural implementation from an older point of
development. The latest version of the hardware would cause a
slightly higher hardware usage, especially for the LUTs. . . . . 108

5

Percentage hardware utilization and maximum reachable clock
speed for the full architecture with two Processing Nodes
and BRAM-based population memory, configured for the Xn101-k25 Uchoa-problem. Shown is the difference between a
design with and without Elitist list, where the Elitist list holds
references to 100 nodes and therefore 5% of the population. . . 109

6

Various aspects of the memory usage of the GA-hardware
configured for the relevant problems of the Uchoa-test set. . . . 111

xiv

CHAPTER 1
Introduction
One of the defining trends in the 21st century is without any doubt the
ongoing globalization, which not only requires, but also supports the further
development of worldwide logistics. Since there is not only economical pressure,
but also the quest for minimizing fuel consumption and the emission of carbon
dioxide in the field of logistics today, companies work on the optimization of
their supply chains and the delivery of products [1, 2].

However, they very

often face problems of complexity and efficiency in computational approaches.
One of the best-known challenges in this field of logistic-related issues is the
Capacitated Vehicle Routing Problem [3].

It seeks the optimal route of a

truck with limited delivery capacity to serve multiple customers at different
positions with different delivery demands, that must be served at once and not
in multiple runs. Therefore, the main difference to the even older “Travelling
Salesman Problem” is the necessity to also include stops at a depot, whose
position is known in advance and which can be visited to refuel the truck for
further deliveries. Theoretical research has proven that this problem is NP-hard,
thus cannot be solved in polynomial time for large numbers of customers. For
this reason, research has mostly concentrated on approaching the CVRP with
heuristical algorithms, which by some kind of directed “trial and error” attempt
to optimize the overall distance of the delivery route in acceptable time. Only
such a time-efficient implementation for large real-world problems would allow to
do real-time plannings of - for example - the best driving route of a mail truck or
delivery truck.
In recent years, research has mainly progressed in two approaches to this problem:

1

On the one hand, bionic algorithms such as the Ant-colony-algorithm or the
large group of Genetic Algorithm have been examined, while on the other hand,
especially the latest publications have also focused on the question how Machine
Learning algorithms can be applied to the CVRP. From the perspective of
computer architecture and hardware engineering it is remarkable that FPGAs
have only been chosen as computing platform very rarely, while especially GPUs
have seen a dramatic rise in the published works as devices of choice. Such a
GPU-based approach is also under development at University of Rhode Island,
which led to the initial idea of a comparative implementation on FPGA and also
CPU to examine the strengths of those three different architectures for the given
problem [4].

In this thesis, special emphasis is placed on the transfer of the core ideas of
this specific algorithm into a specialized hardware architecture, targeting a highperformance FPGA chip. This involves crucial changes in memory management,
calculation steps and even algorithmic procedures to make best use of the
capabilities of the target technology. Based on the description of this approach, the
results of this FPGA-based implementation are presented with large benchmark
problems with up to a few thousand customers, and compare those outcomes to
what can be achieved with GPU and CPU. Furthermore, the hardware utilization
and energy efficiency of the proposed design are also analyzed. Finally, this thesis
concludes with a review and discussion of open questions and possible ways into
further research in the field.

2

CHAPTER 2
Literature Review
2.1

Significance of the Capacitated Vehicle Routing Problem today
First of all, I would like to point out why it is of outermost importance for the

connected world to have efficient ways to solve Routing Problems, and especially
the Capacitated Vehicle Routing Problem. Furthermore, it seems important to
give an overview of the ongoing research activities in this field at the University of
Rhode Island which initiated the idea to implement a specialized hardware solver
on FPGA:
• First of all, the Vehicle Routing Problem has a special place in the field of
logistics, which is ever gaining importance in our fully connected world of the
21st century. Especially under the pressure to reduce not only costs, but also
emission of carbon dioxide, companies try to optimize the delivery of their
goods, but often run into problems due to the complexity and computational
challenges in terms of efficiency, run time and feasibility in a real-world
environment. Since 1959, the problem is known in the research community,
but only the progress in computer hardware of the last decades made it
possible to really step forward in practical use of the algorithms [3]. One
can think of the optimization of the last-mile delivery in trucks on streets
as well as the route optimization of robotic autonomous vehicles in highly
automatized warehouses - in any case, it makes sense to find the shortest and
therefore most economic and optimal route.
• Right now, a research team at the University of Rhode Island is developing
a software implementation of Genetic Algorithms for CVRP, that uses
the possibilities of parallel execution on Graphic Processing Units (GPUs)
3

for efficient and fast optimization of routes [4].

At the same time, an

implementation of the parallelized GA-approach for modern multi-core CPUs
was initiated. Therefore, implementing the same mechanisms in hardware
gives the chance to compare the three different computational platforms
and study the feasibility of adaptive hardware in real-world applications.
This might answer questions for energy efficiency, a realizable degree of
parallelism on design-level and specific problems and advantages of the
different technologies.
The class of Vehicle Routing Problems (VRP) consists of a large number of different
manifestations of the phenomenon, like the already mentioned Capacitated Vehicle
Routing Problem (CVRP), but also a periodic VRP (PVRP), a dynamic VRP
(DVRP) or a time-critical problem called VRPTW (where TW is for Time
Windows) [5]. The CVRP as defined by a strict limit of truck capacity and onetime visits to all city-nodes is not only the most realistic one for a large amount
of logistic applications, but also provides some benefits for computation and is
therefore chosen for this project: While being more realistic than the traditional
Travelling-Salesman-Problem and adding an additional layer of complexity in
comparison to this oldest known problem in the field, it still restricts the possible
degrees of freedom and therefore enables a reasonable playing field for algorithmic
optimization.
2.1.1

Logistics and its demand for better algorithms

One of the most important driving forces of the globalization has been an ever
evolving field of logistics which allows not only worldwide distributed production
lines but also completely new forms of business: With the formation of global
players like amazon.com or eBay in the 1990s, e-commerce and the consequently
growing need for home-deliveries, logistics finally wasn’t seen as a necessary
”
4

evil“ in business, but as a valuable asset and field for research and development.
Statistics can prove that point: In the United States alone, logistics expenses have
increased from roughly 420 billion dollar in 1980 to almost 1400 billion dollar in
2017 [1]. The importance of home-delivery can be stressed by the increase of
worldwide delivered packets from 43 billion in 2014 to 131 billion in 2020 [6].
Especially the last two years under the influence of the Covid-pandemic have
fueled this development. However, this increased amount of transportation also
has unwanted side effects - most important today fuel-consumption and emission
of CO2 and other climate-relevant gases: The CO2 emissions of road freight
vehicles alone increased from 1.7 gigatons in 2000 to 2.4 gigatons in 2018 [2]. Long
transportation routes do not only lead to increased pollution, but also to increased
production costs - especially in an age of skyrocketing fuel prices. Therefore,
it takes no wonder that especially the big players in the market heavily invest
in R&D to improve their logistics: Amazon partnered with the Massachusetts
Institute of Technology to set up a science contest for route optimization based
on real-world data from Amazon-trucks [7], and Nvidia just recently announced a
new cooperation with Domino’s pizza to demonstrate how their GPU-accelerators
can help to improve delivery routes [8].
2.2

Finding solutions for Vehicle Routing Problems
For all kind of Vehicle Routing Problems, exact solutions can only be found if

the problems are very limited in size, since the VRP belongs to the class of NP-hard
problems. In that case, it requires linear programming, dynamic programming
or tree search - basically composing all possible solutions and comparing their
outcomes [9, 10]. With a growing number of nodes in the problems, this also
leads to an exponential increase of possible solutions and therefore a non-finite
calculation time. Thus, Genetic Algorithms only consider a small, initially random

5

selection of those possible routes and try to recombine them to keep their positive
properties in terms of fitness. All in all, this principle, best known as survival of
”
the fittest“ leads to an ever improving quality of solutions, without an exceptional
workload for calculations [11], [4]. To avoid a loss of diversity during execution,
which would lead into a kind of algorithmic blind end, randomization is inserted
via mutation of the solutions generated by recombining existing routes.
Operating on a set of legal solutions for the problem instead of complete
randomness is what gives this approach a benefit in comparison to other heuristic
algorithms.
2.2.1

Genetic Algorithms in CVRP - Related Work

With the emerging field of parallel computation, which first allowed a
realistic outview towards heuristic solutions of different kind of routing problems,
the possibility of Genetic Algorithms for that use case was carefully examined
[12, 13, 14, 15, 16, 17, 18, 19, 20]. First attempts searched for the best way to ensure
that all chromosomes in the population of a Genetic Algorithm follow the rules set
for the specific problem [13, 12]. After this challenge was mastered with sufficient
efficiency in computational load, research was mostly focused on improving the
results of the heuristic approaches. It turned out that especially the diversity of the
population and the prevention of duplicates within the chromosomes contributes
to better outcomes of the algorithm. Approaches like local search for such identical
solutions in the population [13], distance algorithms for measuring similarity
between chromosomes [14, 15] and other coefficients for likeness [18] showed some
success for the initial target of shorter routes, but faced problems with high
computational workload.
As part of this trial for better solutions and efficient calculations, also several
different operators for crossover and mutation in the actual algorithm were

6

evaluated [21].

They differ not only in the amount of information that is

passed from parent-chromosomes to child-individuals, but also in the required
computational resources like storage space or calculations. Therefore, a tradeoff between low complexity and high diversity of generated solutions can be found.
On the other hand, those findings on the high computational workload also lead to
the examination of the possibilities of parallel computing for Genetic Algorithms
for routing problems, which directly paved the way for the main idea of this study:
[22] used GPUs for adding new stops to a pre-planned routes, while [23] combined
the control-flow of a CPU with the parallel calculation power of a GPU for search
algorithms and local optimization.
Local optimization itself became a large topic since it was discovered how
drastically it can improve the quality of solutions found by Genetic Algorithms
[24, 4]. A good trade-off between efficiency and computational workload can
be found in the 2-opt-algorithm [25], which uses the theoretical idea of triangle
inequality to shorten routes by just swapping single nodes within given routes.
2.2.2

FPGAs in GA and CVRP - Related Work

As part of the movement towards parallelization of computation for the
Capacitated Vehicle Routing Problem, not only GPUs, but also FPGAs came
under observation as devices of choice for specialized algorithmic calculations.
Some of the earliest attempts utilized hybrid workstation-settings, where a
relatively small FPGA is installed together with a powerful CPU [26, 27]. This
also leads to a hybrid algorithmic approach, where most of the complex control
flow is run on the processor, while the calculation-intensive parts - especially the
crossover and mutations functions - are run on the FPGA itself. Later research
[28, 29, 30] then headed to a full implementation of a GA on the FPGA, but all of
those studies always struggled with the limitations of available hardware resources

7

in their time. Especially the high-bandwidth access to fast memory turned out
as a crucial point of failure, since FPGAs were not equipped with large memory
elements back in the days.
2.3

FPGA Architecture
Field Programmable Gate Arrays (FPGAs) belong to the most versatile and

flexible products of the semiconductor business [31]. Based on the wish to provide
adaptable computer hardware - so called “glue logic”, those chips were designed to
be completely re-configurable by the end-user. A logic set-up has to be defined and
uploaded to give the device any desired functionality. While early chips were only
able to fit small and simple digital circuits, modern FPGAs belong to the latest
class of high-end semiconductor products and therefore allow the implementation
of multiple CPU cores, so called soft cores. With this expansion of capability, the
role of FPGAs switched from simple “on-the-fly” circuits towards a potent solution
for heterogeneous and adaptive computing, where a specially designed hardware
might help to increase speed and efficiency of calculation.
The architecture of FPGA is often described as “island style” [32], since it consists
of multiple functional logic cells, that are placed like islands in a surrounding
of interconnects. The typical FPGA used for parallel computation is based on
SRAM-configuration, and therefore the logic is based on so called Look Up Tables
(LUTs). Each LUT can hold a function of up to five variables, which is possible
through its inner structure: Build as a cascade of multiplexers in different sizes, the
configuration of the FPGA is set via the input values fed from the SRAM banks to
the inputs of those switching elements. Furthermore, FPGAs contain additional
functional elements like Digital Signal Processing slices (DSP-slices), that provide
additional hardware for calculation, or Block-RAM memory cells.

8

CHAPTER 3
Problem Setting
3.1

General description of the problem - real world example
As discussed before, the general idea of the Capacitated Vehicle Routing

Problem can be described very easily by transferring it to a real-world example:
One can imagine a certain area with n households who have ordered heating oil
for the winter season. The supply company therefore has a list with the exact
location of all of these households as well as the demand of oil they ordered
(Figure 1). For the convenience of their customer, the company promises to fulfill
each order with a single delivery - for which it is obviously necessary that even
the largest possible demand of a single household does not exceed the capacity of
the company’s tank truck. However, this truck is certainly not voluminous enough
to satisfy all demands on a single trip. Therefore, it must return to the single
central oil depot of the company from time to time to refill and prepare for further
delivery. Now, it is important for the company to find the most economical and
therefore shortest route for this truck to deliver oil to all of their customers: Not
only does the distance between the single customers count, but also the way in
which they are assigned to the different tours based on their delivery demands.
While the first partial problem - only finding the fastest way through a number
of given points - would be the classic Travelling Salesman Problem, the additional
complexity of the demand-based tour assignment is the decisive point which turns
it into a Capacitated Vehicle Routing Problem.
3.2

Mathematical Notation and Problem Definition
With the standard definitions of R and R+ for real numbers and positive real

numbers as well as I and I+ for integers and positive integers, the problem of

9

Y
8

X = 30
Y = 30
d=5

Delivery capacity = 10

1

X=3
Y = 28
d=3

250

5

221

X = 12
Y = 15
d=4

4

r1

146

X = 19
Y = 20
d=2

9

3

r2

X=8
Y = 17
d=2

53

Depot
X = 15
y = 15

10

125

X = 26
Y = 13
d=1

245

6

121

X=1
Y=8
d=3

65

41

r4

365
365

r3

9

X = 15
Y=4
d=2

85
7

X=7
Y=1
d=4

2

X = 29
Y=2
d = 10

73

x

Figure 1: Example for a map with 10 different cities and depot node placed in the
middle

10

Capacitated Vehicle Routing can be defined as decribed in [4]:
Definition 3.2.1 An initial data set for CVRP consists of a single depot node {0}
at position (x0 , y0 ) and a set of n consumer nodes {1, 2, 3, .., n} at their respective
positions {(x1 , y1 ) , (x2 , y2 ) , (x3 , y3 ) , . . . , (xn , yn )} where x0 , . . . , xn , y0 , . . . , yn ∈ R,
I+ . Between the depot and the customer nodes, an undirected graph G = (N , A),
where N = {0} ∪ C is the depot node and A = {(i, j) : i, j ∈ N , i < j} are the
customer nodes, can be set up. Each consumer node has a non-negative demand di ,
that must be served at once, so that each customer can only be visited once. The
quality of the selected graph is calculated via the summarized costs of all edges,
where each edge is assigned its Euclidian distance as cost:
q
cij = (xi − xj )2 + (yi − yj )2

(1)

Therefore, the optimal solution to a CVRP has the minimum summarized distance
cost:
∗

l = argmin
l∈P

n
X

cij , ∀ i 6= j

(2)

i,j=0

The actual composition of the graph is defined by the fulfillment of capacity demands
of the single customer nodes: There are K trucks available with the identical
capacity Q > di , ∀i ∈ C, where K is not pre-defined and may be determined by
the final graph. Each vehicle serves a subtour of customer nodes between two node
stops, so that the sum of capacity demands of the nodes in such a subtour may not
exceed the vehicle capacity.

11

CHAPTER 4
Genetic Algorithms for the CVRP and their implementation on GPU
4.1

Description of a Genetic Algorithm for the CVRP on GPU and
CPU
As described before in the review of the existing literature on the topic,

very popular and therefore commonly used algorithms for finding approximative,
heuristic solutions to NP-hard problems such as the Capacitated Vehicle Routing
Problem often have a bionic background, thus try to mimic processes from nature
to implement a natural improvement of solutions over time. Besides obviously
animal-inspired solutions such as ant- or beehive-algorithms, especially the class
of Genetic Algorithms has to be named as an important attempt of translating
biological strategies to other fields of problems. For this reason, also the original
studies at the University of Rhode Island [4], on which this thesis is based on,
chose a specialized GA for the Capacitated Vehicle Routing Problem:
4.1.1

General Idea of Genetic Algorithms

The basic idea of Genetic Algorithms can be traced back to Charles Darwin’s
famous book “On the Origin of Species by Means of Natural Selection, or the
Preservation of Favoured Races in the Struggle for Life” [33] and further to the
formula that in a certain way summarizes this work - ”Survival of the fittest” [34]:
Individuals in a population produce offspring, that - due to the mixture of genetic
information of their parents and natural influences like mutation - have their own,
very specific properties. These properties may or may not be advantageous for
the new offspring in their habitat, which then causes them to produce more or
less children on their own. This adaption to the conditions of the natural living
environment is what is commonly referred to as “fitness” in a biological meaning.

12

Thus, one can expect that the whole population over time will tend towards a
higher fitness and will therefore include only individuals with a high degree of
adaption. This simple mechanism is called ”evolution” and as such the mainspring
of the genesis of all today known species, including us humans.
The bionic idea is now to formulate an external development goal - therefore
a measurement for fitness - and then implement the most important steps of
evolution - thus crossover of genetic information and mutation for the generation
of new offspring as well as some kind of “survival mechanism” for the selection of
those individuals with the best fitness metrics as parents of a new generation.
4.1.2

Adaption for the CVRP

The main ideas for the adaptation of Genetic Algorithms for the Capacitated
Vehicle Routing Problem can be found in the implementation for parallel execution
on GPU [4]. Several important terms have to be introduced to understand how
this specific algorithmic approach can be applied towards a logistics-problem, that
on the first view does not seem to have much in common with the evolutionary
development of the genome and therefore phenotype of plants and animals:
• An individual in the CVRP-algorithm is a feasible solution to the given
problem: It is therefore a tour through all cities, which are all visited only
once. Each subtour within this individual is a row of cities that is visited
without a depot stop in between. The summed up demand of those cities
may not exceed the maximum delivery capacity of the truck in the problem.
• Each individual has a given fitness, which is the sum of the euclidean
distances between all stops, including cities as well as returns to the depot,
within that individual. The target of the algorithm is to produce shorter
routes, therefore find individuals with a lower fitness value.

13

• The sum of all individuals at any given point in time of execution of the
algorithm is called population.

It is therefore available as reservoir for

individuals that are chosen as parents for new offspring, and also under
ongoing change due to the replacement of parent individuals with worse
fitness by their new generated offspring individuals with lower over-all
distance and thus better fitness values. The size of a population, technically
described as populationSize, is of critical importance for the speed and quality
of the outcome of the algorithm, since it limits the variability of parents and
their genetic information. Diversity, as this property of a population is also
called, is crucial for the development towards better fitted individuals: Only
a diverse genome pool allows for new influences in solutions and prevents
the algorithm to run in a local minimum of fitness evolution, that could
effectively turn out as a dead-end in development.
• Specifically the GPU-solution defines a generation as a time-related state of
the population: The whole process of selecting parents, producing children
and comparing their fitness-values is tied to a certain starting point in
population. Parents are selected all at once from this generation state of the
population, and later also the exchange of individuals based on fitness values
is performed at once for the whole population to produce a new generation.
This follows strictly the idea of generations as we may encounter it in family
relationships, where one would clearly differentiate between grandparents,
parents, children and so forth. Indeed, on the level of a given biological
population, the idea of a ”generation” is more or less man-made, since the
generation of children is not as strictly tied to age as it would be with this
algorithmic approach.

14

4.2

GPU-based algorithm
Based on these terms, one can now describe the several algorithmic steps that

are necessary to generate new and better, fitter individuals and thus increase the
quality of the over all population, specifically in the GPU [4]:
1. In the beginning, the problem data file is read into the program. It contains
the unambiguous nodeIDs to identify each city, its coordinates and demands
as well as the position of the depot. With this information, an initial random
population is created with duplicate-free individuals.

Furthermore, all

possible distances between two cities in the problem-map are pre-calculated
and stored in a distance table, which can be accessed with the nodeIDs of the
cities. This table is later used for the distance calculation of the individuals.
2. Based on the specific populationSize, a certain number of individuals of the
current generation is chosen for the crossover-operation, which always only
involves two individuals. Therefore, multiple of those crossover-operations
are performed for each generation, which causes the highest potential
workload. Thus, parallelism is applied to the greatest possible extent for
this algorithmic step to increase the calculation speed. All following steps
only refer to one such crossover operation:
3. Four individuals are randomly chosen as possible parents for a crossover
operation. As a first selection step, only those two individuals with the
lowest overall distance and therefore best fitness value are chosen as actual
parents.
4. The actual crossover is performed on the selected parents with a k-point
crossover-algorithm. Therefore, a randomly selected half of the first parent
is taken together with a second half of all cities from the second parent, and

15

both chunks are then merged to a new child. The same process is repeated to
produce a second child. Naturally, the children may now contain duplicate
cites, while missing other cities and also depot stops, but this problem is only
treated in the next step.
5. With a certain probability, mutation is performed on the children. This
means, that within a randomly selected part of the offspring individuals the
order of cities is reversed. In any case, a refining of the offspring is performed
in this step, which includes clearing of duplicates, insertion of missing cities
and finally the insertion of depots wherever necessary.
6. For local optimization, the 2-opt-algorithm is performed on each sub tour
between two depot-stops within the children. This algorithm looks for fitness
optimization through changing order of two cities at a time. At the same
time, the fitness of the child individuals is calculated.
7. From the four individuals, which include two parents and two generated
children, only the two best are selected to join the population of the next
generation.
8. To preserve desired properties of the population and keep up the diversity at
the same time, additional algorithmic steps are performed to delete duplicate
individuals from the emerging new population, while an elitist preservation
is used to carry the top 5% of the parent generation in any case to the next
generation, even if they are beaten by their direct children in terms of fitness.

16

CHAPTER 5
Necessary Adaptions of the Algorithm for a FPGA-based
implementation
5.1

FPGA Architecture and Properties
The algorithm as described earlier was initially designed for execution on a

GPU, using the multi-core capabilities for parallel computing on those specialized
processors.

However, this thesis aims to describe an approach to implement

a similar Genetic Algorithm on FPGAs as described before, which requires
adaptations to the algorithm. To understand the necessary steps, it is of greatest
importance to first get an idea of the capabilities of this unique computational
platform [?]. Some specific benefits are:
• It may be assumed a general fact about FPGAs that computation is
practically for free and unlimited, while memory access is a hard problem
and therefore considered as a bottleneck in each FPGA-design. While the
bandwidth and clock-speed after routing for different memory technologies
such as BRAM, DDR or HBM is limited, the main advantage of a FPGA
is the high number of Look-Up-Tables and DSP-slices that can be linked
together for extremely fast and parallel calculations.
• Closely related to this first fact is the statement that FPGAs are way more
capable in processing data flows than control flows, while the opposite is
true especially for CPUs, but also somewhat for GPUs. This means that
a FPGA can play out its advantages on an algorithm with a comparatively
small number of control-conditions, but a high number of parallelizable dataprocessing streams.
Based on those findings on the general properties of FPGAs in comparison to
17

a GPU, some changes to the algorithmic structure are necessary to achieve better
results in terms of speed and hardware consumption. The following sections aim
to describe the FPGA-specific implementation of the already mentioned steps of
the algorithm.
A specific problem for this type of implementation is the close connection between
hardware architecture and algorithm, which is among the reasons why some design
decisions described here are better understood with the architectural descriptions
of the following chapter of this thesis:
5.2

Data Representation and Distance Calculation
Keeping in mind the general statement about FPGAs that “computation beats

memory-acces”, especially one design decision of the GPU-implementation sticks
out as unworkable for the adaptive hardware approach: Pre-calculating all possible
distances between cities in the given map and storing them in a distance table for
later accesses is a clear contradiction to the main design principle of FPGA, since it
would use the superior calculation power of the hardware only once in the beginning
and then later limit itself by constant waiting times for the memory-stored distance
values.
Instead, the decision was made to calculate all distance values for the 2-optadaption and the fitness calculation on demand during each children-generationprocess.
This, however, requires a different data representation of the cities in the problem
in comparison to the GPU-implementation: In this original approach, each city
was marked by an individual ID as identifier, which then served as a lookup value
for the distance table. The great benefit of this solution is that the range of IDs
just has to cover the number of cities in the given problem, which then also limits
the maximum bit width of such an ID and therefore the storage space acquired by

18

a single city. For example, a problem with 128 cities requires log2 (128) = 7 Bits
for each ID value.
In the FPGA on the other hand, it is beneficial to denote each city with its
position, which is by definition also unique (Figure 2). Thus, in the distance
calculation process the“name” of a city, consisting of its x- and y-coordinate, can
be used as input values for the calculation process. The only disadvantage of this
approach is the potential greater storage demand for each city depending on the
local distribution and coordinate scaling of the given problem map: If the 128
cities from the example above are distributed over a coordinate field of 64 × 64
spots, each coordinate has to be stored in log2 (64) = 6 Bits. Furthermore, for
algorithmic processing “on the go”, the delivery demand for each city also has to
be stored. The bit width of this part of the node identifier depends on the largest
single demand of a city in the given problem due to a required homogeneity of data
representation. If within the described 128-city-problem the maximum demand of
a city was for example 20, the required bits for representation of demands in all
cities would be log2 (20) = 5. Lastly, one has to keep in mind that cities or nodes
in this problem are always stored as part of a route, which needs to show where
depot stops are necessary after a sequence of such cities, called subroutes. To be
able to only work with subtours of identical length, it makes sense to reserve one
bit in each city identifier which reveals if this city is followed by a depot stop the Depot-Bit is set to 1 - or not, in which case the Depot-Bit would be set to 0
(Figure 3). Therefore, the bit length of the data representation of each city sums
up to 2 × coordinate bits + demand bits + depot bit. In the discussed 128-city
problem, this would result in 2 × 6 + 5 + 1 = 18 Bits per node.

19

01111‘10010‘0011‘0
xCoordinate

depotBit

nodeCapacity

yCoordinate

nodeID

Figure 2: Composition of the identifier for a single city in the FPGAimplementation

Individual
fitness

2164

r1

8’17’2

0

3’28’3

r2

0

12’15’4

1

19’20’2

0

30’30’5

r3

0

26’13’1

1

29’2’10

r4

1

15’4’2

0

7’1’4

0

1’8’3

1

Figure 3: Multiple cities form a whole Individual. The set depot-bits at several
cities mark the end of subtours, where the truck has to return and start again at
the depot.

20

5.3

Algorithm implementation
Obviously, also the specific algorithmic steps are impacted by the different

computational characteristics of a FPGA in comparison to a standard-device such
a GPU or CPU. To analyze those changes in detail, it seems appropriate to
differentiate between a macro- and micro-algorithm (Figure 4):
5.3.1

Macro-Algorithm: Handling of the population

The term ‘macro-algorithm’ includes all processes related to the handling of
the population, the selection of individuals as parents for further processing in
the micro-algorithm as well as controlling certain defining parameters of these
calculations during the runtime.
Pooling-concept in contrast to the classic idea of generations in GA
Due to the nned to preserve and access all individuals within the population,
especially the memory-properties of the FPGA have a major impact on this part
of the algorithm. The original “generation-based” concept as presented for the
GPU would require three different sub-memories: The main memory stores the
set of individuals that belongs to the current generation, while two sub-memories
buffer the best parents of this generation as part of the elitism concept as well
as all new generated children.

Further memory reading with comparison of

the fitness values of all those individuals is then required to transfer the right
elements to the main memory and form a new generation this way. This data
flow has two main drawbacks with respect to the properties of a FPGA: First
of all, memory space is scarce, which makes it desirable to keep the required
storage space as small as possible. Furthermore, memory accesses always form
bottlenecks for calculation on those devices, wherefore a good implementation
should aim to form an uninterrupted memory-cycle, where reading and writing

21

Macro-Algorithm

Micro-Algorithm

Ind 1.1
Ind 1.2
Ind 1.3

Parent 1

Parent 2

Ind 1.1

Ind 1.4

Ind 1.6

Ind 1.5
Ind 1.6
Ind 1.7
Ind 1.8

Population of multiple Individuals

Crossover-Operation
Generating two children

1
Protection
of elite
memory
addresses

Selection of
elite parents for
crossover

Elitist List

Pot 2.1

Child 1

References to the best
known individuals in the
population to give elite
parents the chance to
survive their children

Controlling parameters and generating
random numbers

1-pMutationProbability

Mutation
Introducing Randomness to the children

Pot 2.1

BestKnownSolutions
Keeping track of the
individuals with best
fitness as possible final
outcome of the algorithm

Child 2

pMutationProbability

2

Pot 2.6

Pot 2.6

3

Depot Insertion
Ensuring feasibility of the children solutions

4

2-opt heuristic
Optimizing the subtours within the children

5

Distance Calculation
Calculating the fitness of the children

1- pWildchildProbability
Compare children & parents
and send back two best
individuals

pWildchildProbability

Send back the two children
anyway

Figure 4: Flowchart of the algorithm. The macro-algorithm deals with the handling
of the population and the selection of individuals for processing as specified by the
micro-algorithm. It involves steps such as crossover, mutation and 2-opt.

22

without any unconditional memory accesses in between frame the calculation
pipeline in hardware.
These considerations lead to the idea of a population-pool instead of the classic
generation-based concept of processing in Figure 5: The algorithm on the FPGA
now has only one main memory that holds the entire population. At each iteration,
pairs of individuals are randomly selected from this pool for crossover-operations.
During the duration of the processing of the micro-algorithm on those parentindividuals, their memory addresses are blocked. It is now part of the microalgorithm to find the two best individuals from the group of two parents and
two children, which are then written back to the two address locations blocked
before. This implementation now follows all guidelines for FPGA-based memory
architecture as defined before:
• Per crossover, only two reading and writing processes are required at
maximum. Furthermore, they frame the whole processing pipeline, which
works completely without any further memory acccesses that would lead to
additional waiting times.
• By getting rid of the original elitism-concept, only one memory is required
in comparison to three memories in the GPU-solution.

Due to the

internal fitness-comparison in the micro-algorithm performed on the selected
individuals, it is also ensured that the overall quality of the population is not
decreasing over time.
• Finally, this idea of a pool of individuals is also closer to the reality in nature
and therefore follows the bionic concept of mimicking natural processes: In
comparison to a generation-based population, there is a higher chance of
individuals with different age to perform a crossover and therefore generate
new children with combined strengths. This allows the pool-based population
23

Stepwise processing
Parent-Selection

Generation

Parent-Selection
Ind 1.1
Ind 1.2
Ind 1.3
Ind 1.4
Ind 1.5
Ind 1.6
Ind 1.7
Ind 1.8

Generation 1

Ind 3.1
Ind 3.2
Ind 3.3
Ind 3.4
Ind 3.5
Ind 3.6
Ind 3.7
Ind 3.8

Ind 2.1
Ind 2.2
Ind 2.3
Ind 2.4
Ind 2.5
Ind 2.6
Ind 2.7
Ind 2.8
Elitist-Selection

Generation 2

Elitist-Selection

Generation 3

Pooling

time

Ind 1.1
Ind 1.2
Ind 1.3
Ind 1.4
Ind 1.5
Ind 1.6
Ind 1.7
Ind 1.8

Ind 2.1
Ind 2.2
Ind 1.3
Ind 2.4
Ind 1.5
Ind 1.6
Ind 1.7
Ind 2.8

Ind 3.1
Ind 2.2
Ind 1.3
Ind 2.4
Ind 2.5
Ind 1.6
Ind 2.7
Ind 3.8

Ind 3.1
Ind 3.2
Ind 3.3
Ind 2.4
Ind 4.5
Ind 1.6
Ind 2.7
Ind 4.8

Continuous processing

Figure 5: Comparison of GPU/CPU- and FPGA-implementation: The upper
process resembles the generation-based population handling as implemented in
the GPU, while the lower flow shows the idea of pooling, which is tailor-designed
for the properties of a FPGA-platform.
to carry the best properties of multiple “generations” of individuals over a
longer period, providing them protection over short-time improvements that
may turn out as dead-ends in evolutionary development.

Extension of the pooling-concept with an elitist-protection
However, testing the proposed design showed that at least one design principal
of the generation-based processing could actually also improve the pooling-based
Genetic Algorithm on FPGA: A clear weakness of the way that parents are
overwritten by their improved children in the pooling-design is that also very
good parents get overwritten instead of staying alive in the population and getting
more chances to produce very good children in further crossover operations. The
generation based approach prevents such a self-limitation by offering a bypass of
24

the best parents around the elitist-selection: While being used as actual parents in
various crossovers, the best individuals also directly join the next generation, where
they can again contribute to the overall fitness of the population by producing
more very good children. In the pooling-design, the protection of elitist parents
has to happen once they are potentially erased by getting overwritten by their
own children (Figure 6). For that, an extended pooling concept maintains a
constantly updated list with memory-references to the best known individuals in
the population, so that it can keep track whether such an “elite parent” is enrolled
in a crossover. Once it turns out that is has produced even better children and
would normally get overwritten, the write-back mechanism is changed by assigning
a randomly selected memory address for the new offspring. Therefore, the elitist
parent remains part of the population, while a different individual, with a very
high probability not an elitist itself, is lost in exchange for children individuals
which itself will make it to the elitist list.
Another principle that can be implemented based on such an elitist list is the
quality-based selection of parents: As one can find in nature, it makes sense that
very fit parents are involved in more crossover processes than non-elitist members
of the population. By directly accessing the elitist list with some probability for
selecting parents, this effect can be reached very easily.
To allow for direct comparison of the “raw” and “extended” pooling concept, the
size of the elite list can be set to zero before instantiation of the design, which
effectively prevents any elitism-effects.
Initiation crossovers as a way to generate diversity within the population
An important question for every Genetic Algorithm is how to ensure a
satisfying diversity of the population in the beginning of the algorithm. On GPU
and CPU with their advanced technologies for memory accesses, it is no problem

25

Pooling

time

Ind 1.1
Ind 1.2
Ind 1.3
Ind 1.4
Ind 1.5
Ind 1.6
Ind 1.7
Ind 1.8

Ind 2.1
Ind 2.2
Ind 1.3
Ind 1.4
Ind 2.4
Ind 1.6
Ind 1.7
Ind 2.8

Ind 3.1
Ind 2.2
Ind 1.3
Ind 1.4
Ind 2.5
Ind 1.6
Ind 2.7
Ind 3.8

Continuous processing

Ind 3.1
Ind 3.2
Ind 3.3
Ind 1.4
Ind 2.5
Ind 4.5
Ind 2.7
Ind 4.8

Protecting memory
spaces of elite parents

Elitist list

Updating the
Elitist list

Figure 6: Detailed processing flow for a pooling concept including an elitist list:
The memory spaces of elitist lists are blocked and can’t get overwritten, while the
elitist list is updated after each crossover.
to fill up the whole memory with randomly composed and therefore completely
different individuals, which then form the starting generation of the population.
The FPGA however has to be programmed via a single bit file that includes not
only the description for the hardware setup and routing, but also the initial values
of all storage elements. To keep this process as simple as possible, the idea was
to pre-initialize the whole population just with copies of one identical individual.
To generate a diverse starting population from this extremely homogeneous group
of individuals, so called “initiation crossovers” were introduced as a starting point
for each processing run (Figure 7): This means, that for a certain number of the
first crossover processes the mutation rate is not only set to 100%, but mutation
will also be performed multiple times on the same children within those crossover
processes. At the same time, the fitness value of the identical individuals in the preinitialized population is set to the absolute maximum that is possible within the
bit width of the fitness-value field in their route-representations. This ensures that
their heavily mutated children will definitely perform better than their parents and

26

1111111
1111111
1111111
1111111
1111111
1111111
1111111
1111111

Individual 1
Individual 2
Individual 3
Individual 4
Individual 5
Individual 6
Individual 7
Individual 8

Initiation crossovers

100 %

0010110
0110111
1011101
1101110
0001111
0101111
1011001
1110011

Individual 1
Individual 2
Individual 3
Individual 4
Individual 5
Individual 6
Individual 7
Individual 8

Standard crossovers

Final population
Best found solution

pMutationProbability

Repeat mutation for
pInitMutationProcesses

1-pMutationProbability

One mutation
process

No mutation
process

Figure 7: Visualization of the composition of the whole algorithm flow, consisting of
initialization crossovers and standard crossovers. To get an idea of the distribution
of both groups, one can have a look at a realistic example with a population of
2000 individuals: In this case, it turned out to be a good set-up to have 5000
initiation crossovers and 95000 standard crossovers.
replace them in the population. If the number of initiation crossovers is chosen in
a reasonable ratio to the number of individuals in the populations, there is almost
no chance that any of the old individuals might have survived the initiation phase:
Their place in the population are all taken by their children, whose genes are
randomly generated.
5.3.2

Micro-Algorithm: Processing of two individuals

The term ‘micro-algorithm’ was chosen to describe the processing steps that
are performed on two individuals once they are selected as parents from the
population. The main objective of those processing steps is to produce two new
individuals - called children or offspring - by crossover of the parent individual.
While the crossover-process is only a step in the whole micro-algorithmic tool
chain, it is of such significant meaning that it is also often used as a term to
describe the whole processing flow as pars-pro-toto. Randomly selected microalgorithm processes also include a mutation operator for the children produced by
crossover to increase diversity and allow for random optimizations. In any case, the
feasibility of the children solutions is ensured by a following depot insertion, while

27

the 2-opt heuristic provides local optimization of the generated solution. After a
final distance calculation the fitness-values of children and parents are normally
compared to send back the two best individuals to the population. However, to
increase diversity in the population, a mechanism called “Wildchild” randomly
selects some runs, that always have also included the mutation step before to
directly transmit the children back to population, even if they are worse in fitness
than their parents. This procedure, which only occurs very rarely, is thought to
add completely new genes to a population which might be stuck in a dead-end
development otherwise.
Crossover Processing
The first and most important step of the micro-algorithm is the crossoveroperation, which produces two children by mixing the two selected parents from
the population. This happens analogously to the final meiosis after the unification
of sperm and egg in sexual reproduction, when the two haploid cells recombine to
one diploid cell that carries the whole DNA of the child. The idea of this process
is the transfer of valuable genetic information - in the case of CVRP the order of
visited nodes in the problem - from both parents to the children, which therefore
represent a completely unique mixture of new characteristics, that may lead to
better fitness.
Different to nature, both parents contribute a different amount of their genes to the
parents: For generating child 1, a sequence of cities of random length starting at
the beginning of the individual is copied from parent 1 to the first spots of the child
tour. After that, an iteration through parent 2 is started: Each node from parent
2 is compared to all those cities that have already been copied to the new child. If
they are not present in there, they are then copied to the next free spot of child 1.
Once the offspring is completely filled with cities, the process is terminated. The

28

idea of taking a clear cut of cities from one parent and copying it to the children is
called a 1-point crossover. Furthermore, the comparison mechanism for filling up
the rest of the spots of the child with nodes from the other parent prevents any
duplicate or missing cities with a minimum amount of required comparisons.
Although GPU and FPGA both implement this procedure called 1-point-crossover,
the specific mechanism underlying it is different: In the GPU-solution, both
parents are cut at the same position, so that in the consequence the two different
parts of each parent can be directly recombined to two new children.

After

that, this new generated child has to be compared with a list of all nodes to
flag all duplicates in the new individual and also create a list of all missing
nodes. After that, all marked cities within the child are replaced with nodes
from the list of missing places. This algorithm therefore requires a comparison
of every node in the individual with every entry in the list of all nodes, which
results in N umber of comparisons = (N umber of cities in the map)2 , a quadratic
computational effort.

At the same time, the mechanism implemented in the

FPGA relies on the fact that the copied part from one parent is already free
of duplicates. Considering the fact that it might be necessary to check all nodes
in the second parent with the already existent cities in the child, the maximum
number of comparisons is N umber of comparisons = Length of copied chunk ×
N umber of cities in the map. Obviously, this is again dependent on the length of
the copied chunk of cities, but one should keep in mind that with a larger number
of directly inserted cities the number of places to be filled is also low, which reduces
the probability that a check of all cities in the second parent is required. Thus,
it is a good estimation that the maximum number of comparisons is required for
Length of copied chunk = 1/2 × N umber of cities in the map, which would still
result in 50% less comparisons.

29

Crossover starting
point is at the
beginning of the
parent tour
16’7’8

1

1’26’18

Delivery capacity: 21, Depot at 15’15

Parent 1
1

3’6’12

Parent 2
0

4’15’3

0

11’18’6

1

1’26’18

1

11’18’6

0

3’6’12

0

4’15’3

1

16’7’8

1

Copy the section of a
randomized number of
cities beginning at the
crossover starting point
to the child

Child 1:

16’7’8

1

1’26’18

1

11’18’6

0

3’6’12

0

4’15’3

1

Iterate through parent 2, take each city and
compare it to the cities which are copied from
parent 1 to child 1. If the city is not present in the
copied part of child 1, add it to the next free spot in
child 1.

Figure 8: Example of a crossover process as performed in the FPGA by copying a
chunk of cities from one parent and then filling up the empty spots with missing
nodes from the second parent.
Mutation Processing
As explained before, mutation itself does not occur for all crossover-produced
children, but is applied on a randomized selection base. It is implemented as a
two-point mutation, where a chunk of cities within the generated child is randomly
selected by a randomized starting point and a randomized length of this chunk
(Figure 9). Within this chunk of cities, all nodes symmetrically exchange their
positions. This has the effect that all connections within the chunk of cities stay
the same, with the exception of those between the framing cities of the reversed
chunk and those within the child that are not changed. This has a potential
implication for the composition of subtours within the child by necessary insertion
of depots later.

30

randomized
mutation starting
point = 1

Child 1:

16’7’8

1

1’26’18

0

11’18’6

0

3’6’12

0

4’15’3

1

11’18’6

0

1’26’18

0

4’15’3

1

Starting at the randomized mutation starting
point, reverse the position of all cities within a
randomized mutation length

16’7’8

1

3’6’12

0

Figure 9: Example of a mutation process as performed in the FPGA by reversing
the orders of the cities within a randomly defined chunk of nodes within the childindividual.
Depot Insertion
To ensure the feasibility of the new solutions, it is of utmost importance
to insert depot stops wherever it is required by the demands of the cities in
the solution and the maximum delivery capacity of the truck in the problem.
The mechanism for this insertion is based on a simple iteration- and summationfunction (Figure 10): Starting with the first city in the child, the demands of all
following nodes are summed up. Once this sum exceeds the maximum delivery
capacity of the truck, the algorithm steps back to the city before and inserts a
depot stop by setting the specific depot-bit of the belonging node-identifier. At
the same time, the sum of demands is set back to zero to start the iteration process
again with the next node. For all nodes in between, the depot-bit is set to zero,
since they could be defined otherwise from the parents where they were copied
from.

31

16’7’8

Delivery Capacity = 20

3’6’12

0

11’18’6

0

1’26’18

0

4’15’3

1

1

11’18’6

0

1’26’18

0

4’15’3

1

4’15’3

1

8 + 12 = 20 - Fits capacity
16’7’8

Iterate through the tour and sum up
the single demands to compare them
to the truck capacity. If exceeded, go
one step back and insert a depot stop.
Insert a depot stop at the last city.

1

0

3’6’12

6 + 18 = 24 > 20 – Exceeds capacity
16’7’8

0

3’6’12

1

11’18’6

1

1’26’18

0

6 + 18 = 24 > 20 – Exceeds capacity
16’7’8

0

3’6’12

1

11’18’6

1

1’26’18

1

4’15’3

1

Final city is reached

Figure 10: Example of a depot insertion process as performed in the FPGA by
iterating through the child and adding the node demands to compare them to the
delivery capacity of the truck in the problem.
2-opt-algorithm
An essential part of the processing tool flow in the microalgorithm is the 2-opt
heuristic, which is used to improve the fitness of the generated new individuals on
a subtour-level. This means, that the distances of the subtours are minimalized
by changing the orders of nodes between two depot stops in order to eliminate all
intersections of the route. The triangle inequality proves that a non-intersecting
route is always shorter than an intersecting one.
The algorithm begins with the first node after the depot stop as “anchor” of
calculation and the next node as the partner for a potential swap of order
(Figure 11). Then the current distances - from the depot to the first node and
from the second node to the third node - are calculated as well as the potential
new distances after a swap. In this case, that would be the distances from the
first to the third node and from the second node to the depot. If the sum of the
old distances is longer than the sum of the new distances, a swap can improve the
overall distance of the subtour and is therefore performed.
32

The position of the “anchor” stays the same for the next iteration of the algorithm,
while the position of the possible swap partner is increased by one. In the case
that there would be more than one node between anchor and swap partner and a
swap would improve the result, the order of all nodes in between anchor and swap
partner is reversed. Once the position of the swap partner has reached the end of
the subtour - thus the next node with a set bit for a following depot stop - also
the anchor-position is increased by one. Therefore, the algorithmic process can be
repeated until also the anchor-position has reached the last node in the subtour.
Finally, the application of 2-opt guarantees an intersection-free subtour with the
locally optimal distance.
Regarding the distance calculation for the evaluation of possible swaps within
the 2-opt-algorithm, two different metrics are possible: In most cases, a squared
p
distance metric with ds (A, B) = [de (A, B)]2 = ( (xa − xb )2 + (ya − yb )2 )2 =
(xa − xb )2 + (ya − yb )2 is sufficient to find an optimized routing, while avoiding the
time-consuming square-root calculation. However, some intersections might not
be detected this way, since the sum of two squareroots is a non-injective function.
p
Therefore, also the Euclidean distance metric de (A, B) = (xa − xb )2 + (ya − yb )2
can be applied for evaluating possible swaps in 2-opt. This definitely leads to the
optimal result, but requires several magnitudes more clock cycles due to the timeconsuming square-root calculation. Both metrics are implemented in hardware,
and the specific choice must be seen as part of the tuning of the algorithm for a
specific problem within the general trade-off between solution speed and solution
quality.
Fitness calculation
The fitness calculation again makes use of the direct composition of the node
identifiers of the coordinates of the cities, but in this case, the Euclidean distances

33

12’13

10
3’9

Subtour Length = 12+11+10+18 = 51
18

11

Depot
0’0

12

12’2

3’9

12’2

12’13

0’0

0’0
97

148

121

90
Existing connections:

148 + 97 = 245

Possible new connections:

121 + 90 = 211

3’9

245 > 211

12’2

Change of order improves the fitness

12’13

0’0

0’0
313

90

90
313

Existing connections:

90 + 313 = 403

Possible new connections:

90 + 313 = 403

12’13

403 = 403

Change of order does not improve fitness

More comparisons necessary, but
no improvements possible

10
3’9

11

Subtour Length = 9+10+11+12 = 42

9

2-opt has reduced the length of this subtour
from 51 to 42
Depot
0’0

12

12’2

Figure 11: Example of the application of the 2-opt-heuristic on a subtour, in this
case with the squared distance metric.
34

𝑑𝑖𝑗 =
16’7’8

16’7

15’15
8

3’6

13

0

𝑥𝑖 − 𝑥𝑗

3’6’12

1

11’18’6

11’18

5

2

+ 𝑦𝑖 − 𝑦𝑗
1

1’26’18

5

1

4’15’3

1

1’26

15’15

15’15
15

2

4’15

15’15

15’15
18

18

11

11

Overall fitness: 104

Compare fitness values of parents and children and transmit the two
best individuals to add them to the population

Figure 12: Example of distance calculation as performed in the FPGA by iterating
through the child and calculation all Euclidean distances between subsequent nodes
and depot stops.
between the cities are calculated and summed up (Figure 12). Comparable to the
depot insertion, the algorithm iterates through the cities of a child and inserts the
coordinates of subsequent nodes in a distance calculation function. The resulting
distances, which are mathematically rounded to the next integer, are then summed
up. Additional calculations have to be inserted if the depot-bit of a city is set,
which marks the end of a subtour within the individual. In such a case, the distance
from the first node to the depot and then from the depot to the next following
node has to be calculated instead of the direct distance between the subsequent
nodes.

35

CHAPTER 6
Architectural Design
6.1

Overall structure - overview of all used modules
With the high degree of freedom in design on a FPGA, it is the dominating

target to implement a hardware architecture that allows the processing of the
algorithm described before as efficiently as possible. “Efficiency” in this case is
pre-dominantly defined as calculation speed for all intents and purposes when it
comes to the final comparison of the different computational platforms. However,
also energy efficiency is an important quality measure when it comes to highperformance computing.
Keeping in mind the properties of FPGAs in general, one can easily conclude that
the best speed advantage in algorithmic processing can be used if the superior
calculation power of the device is used to parallelize the whole computation process
as much as possible. This premise was also inspirational for the design of the
final computer architecture that evolved as the implementation of the Genetic
Algorithm (Figure 13). Again, one can clearly detect the separation of the whole
problem in a macro- and a micro-algorithm, which both feature parallelism on
different levels:
• The main central controller is the hardware implementation of the macroalgorithm, since it handles the accesses to the population memory, thus
the selection of parents and their assignment to crossover-processes.
Furthermore, this module also keeps the record of the statistics of the
algorithmic progress, for example the best known solutions at any given
point in calculation as well as the number of clock cycles already elapsed
for calculation. Finally, this module also manages the UART-connection,

36

which is finally used for transmitting the best found solution to a host-PC
for further exploration. This is necessary since the FPGA itself is a “closed”
device, which does not reveal the intermediate results of computation. UART
as serial interface allows to transmit the final routing as generated by the
algorithm.
• The two Processing Nodes are the direct realization of the micro-algorithm
in hardware. The fact that there are two such modules can be interpreted as
first level of parallelism: Limited by the number of possible memory accesses
to the used BRAM-based population memory, both modules can operate
completely independently from each other, which allows for the parallel
execution of two crossover-processes at any given point in time. A closer
look at the internal data flow of those Processing Nodes also reveals that
both children in each such process are generated independently at the same
time, which can be seen as second level of parallelism.
Both the main controller and the Processing Nodes interact with a variety of
different other modules, such as random number generators, square root calculators
or UART-communicators that provide required services. However, those two types
of Verilog-modules can be seen as main cores of the design. All parts of the
implementation shall be described in detail in the following sections:
6.2

Central Controller
As the main structure of the whole design (Figure 13) reveals, the central

controller itself as implementation of the macro algorithm has a diverse internal
structure with multiple different elements in charge of controlling different levels
of the algorithmic steps. First of all, it serves as central connecting hub, building
the bridge between the population memory, the Processing Nodes and also the

37

Population-Memory

...

Access

Access

Linear Feedback
Shift Register

Central Controller
Parent Selection

Algorithmic-Controller

Linear Feedback
Shift Register

BestSolutionMem

Clock-Counter

PN-Controller

Add-Mem

Processing Node

Add-Mem

+

Square-Root

Square-Root

Square-Root

Square-Root

UART-Communicator

Processing Node

Parent 1

Parent 1
Parent 2

PN-Controller

Output communication

Parent 2

Offspring 1

Offspring 1

Offspring 2

Offspring 2

Figure 13: Overview of the whole implemented computational design, including
a central controller that transmits data from the population memory to the
Processing Nodes, which are in charge of the micro algorithm as discussed before.
Data exchange with a connected PC is ensured via a UART-communicator.

38

“outside world” via the UART-module fed with data from the main controller. All
of these different functionalities are implemented in a central main Finite State
Machine (FSM) for all the very high-level algorithmic controls and two identical
FSMs for the micro-management of the Processing Nodes, especially for controlling
their memory accesses.
6.2.1

Main FSM - the Algorithmic Controller

The main FSM in the central controller implements the different main steps
of the Genetic Algorithm in a number of corresponding states (Figure 15):
• IDLE : One after each other, the two Processing Node-FSMs are notified
to start the calculation process in the Processing Nodes via the
signals “start processing[processing node counter]”. Once this initialization
of the process is finished, the FSM switches to the following state
WAIT INIT RUNS.
• WAIT INIT RUNS : This state is only used to ensure the initiation phase
with increased mutation rate as described before.

Therefore, the run

counter, which is operated by specific signals from the Processing Node FSMs
(Figure 14), is compared to the number of initiation crossover runs as predefined by a parameter of the Processing Node module. As long as this
initiation phase lasts, the FSM stays in this state and also the signals in
“processing init[processing node counter]” stay high to signify the increased
mutation rate to the Processing Node FSMs.
• WAIT STANDARD RUNS : Parallel to the state before, this state of the
FSM is used to wait for the remaining crossover runs of the Processing
Nodes until the overall number of specified runs is reached by the design.
During that time, the signals in “processing init[processing node counter]”

39

remain low to get the desired normal mutation probability. Once the run
counter reaches the pre-defined number of overall runs, the FSM sets back the
“start processing[processing node counter]” signals, so that the Processing
Node FSMs won’t start a new crossover run when the current process is
finished.
• WAIT FOR END: The time until both Processing Nodes have concluded
their last crossover-run is spent by the main FSM in this state. Once this
is signified via the “all finished” signal, the FSM switches to the next state
described below.
• EVALUATE BEST SOLUTION : In this state, the FSM iterates through the
register file “best known solutions[]” to compare the best solutions found by
the various Processing Nodes and determines the overall best individual.
With this information, the memory address of this best solution is evaluated
so that the FSM of the first Processing Node can begin to read this individual
and transmit it over UART.
• FINISHED: After the evaluation of the best individual, the FSM switches
in this final state, where nothing else happens anymore.

6.2.2

Procedural control and memory communication - the Processing
Node-Controller

While the main FSM as described before manages the macro-scale phases
of the algorithm, such as initialization, normal operation and finish, the “micromanagement” is implemented in the Processing Node Controller. To allow for
parallel use of multiple Processing Nodes and therefore the desired parallelism in
overall operation, each Processing Node is controlled by its own exact copy of
this automaton in the main controller. In summary, its function is to manage the

40

+

run_counter

+

PNFSM 1

run_counter_arr[1]

+

run_counter_arr[0]

LOAD_
PARENT_1

LOAD_
PARENT_1

PNFSM 2

Figure 14: Logic design of the run counter used to measure the number of executed
crossover operations.

41





•

IDLE
start_processing
[processing_node_counter] <= 1
processing_init
[processing_node_counter] <= 1
processing_node_counter ++
done_o <= 0
processing_node_counter
==
pNumProcessingNodes:

processing_node_counter <=
0

processing_node_counter ==
pNumProcessingNodes

•

run_counter == pNumInitRuns
WAIT_STANDARD_RUNS
run_counter == pNumRuns || (!
pRunBeyondBestKnown
&&
best_known_solution_met):
•
for nodeCounter < pNumProcessingNodes:

start_processing
[nodeCounter] <= 0

•

run_counter == pNumRuns ||
(!pRunBeyondBestKnown
&& best_known_solution_met)

all_finished
•

•

•

•

WAIT_INIT_RUNS
run_counter
==
pNumInitRuns:
•
for
nodeCounter
<
pNumProcessingNodes:

processing_init
[nodeCounter] <= 0

EVALUATE_BEST_SOLUTION
best_known_solutions
[processing_node_counter]
<
best_final_solution:

best_final_solution
<=
best_known_solutions
[processing_node_counter]

best_final_solution_address
<=
best_known_solutions_addr
esses
[processing_node_counter]
processing_node_counter
<
pNumProcessingNodes - 1:

processing_node_counter
++
Else:

load_best_solution_processi
ng_node_1 <= 1

WAIT_FOR_END
all_finished:

processing_node_counter
<= 1

best_final_solution <=
best_known_solutions
[0]

best_final_solution_addr
ess
<=
best_known_solutions_
addresses[0]

processing_node_counter ==
(pNumProcessingNodes - 1)



FINISHED
done_o <= 1

Figure 15: Finite State Machine of the main controller used for implementation of
the macro-algorithm and control of the workflow of the whole design.
42

communication between the population memory and the Processing Nodes, while
following the higher-level instruction of the main FSM. For that purpose, not only
the general communication standards for the memory access have to be observed,
but also algorithm-dependent signals coming from the top FSM.
Selection of parent individuals with and without elitism
One of the most important problems to be solved in the main controller is the
right assignment of individual parents to the Processing Nodes via the Processing
Node FSMs. Several rules apply for this process:
• First of all, the different Processing Nodes cannot use the same individuals at
the same time as parents for their crossover processes. This is due to the fact
that the memory spaces of the individuals are potentially re-used at the end
of each crossover process to store eventually better child individuals. Since
the various Processing Nodes are non-aligned time-wise in their calculation,
parallel usage of the same individuals could thus lead to concurrent memory
accesses to the same addresses and corrupted individuals in the population
memory.
• For the same reason, a crossover process can only be started with two different
individuals as parents. Not only would the double selection of the same
parent for one crossover process contradict the biological model of sexual
reproduction and threaten the diversity of the population, but it could also
potentially lead to corrupted memory accesses.
Two different mechanisms help to solve this problem. However, the complexity
greatly differs if the Elitist list is used or not. Without applied elitism, the
mechanism can be described as following:
Generally speaking, the main controller generates two random numbers that

43

indicate two individuals in the population per clock cycle with the Linear Feedback
Shift Registers included in the design.

At the same time, the control signal

“current inds approved” is calculated that indicates whether those two random
numbers are valid candidates for the selection as parents. This requires a multistep combinatorial logic (Figure 16):
• First of all, the LFSRs generate numbers in the range of the specified bit
width, which does not necessarily match the range of individual numbers
in the population memory. Therefore, the excess address space has to be
mapped to the valid memory address space, for example by double-addressing
the last individual numbers. This is not part of the control logic calculation,
but essential for the generation of potential parent numbers itself.
• Secondly, the two proposed individual numbers have to be checked against
each other to determine that they are not identical.
• Furthermore, the proposed individuals are checked against a register filebased list of all individuals that are currently used as parents in the ongoing
crossover processes in the different Processing Nodes. Only if the proposed
individual is not in used with any Processing Node, it can be accepted as
parent for the next crossover process.
If an Elitist list is configured to be used for the respective problem, the mechanism
gets more complex, since two different sources of prospective parents have to
be used.

Furthermore, those proposed individuals also need to be evaluated

regarding an additional property - it needs to be checked whether the individuals
are registered in the elitist list. Again, a two-step hardware is used to implement
those functions:
• First of all, prospective parents are selected from two different sources
44

Linear Feedback
Shift Register - 1

Linear Feedback
Shift Register - 2

lpSizeOf
Population

lpSizeOf
Population
raw_individual_output_1

a
a
>
b

0

1

b

0

b

-

a
>
b

1

raw_individual_output_2

-

a

lpSizeOf
Population

1

ind_in_use

=

=
=

Individual 4

=

=

Individual 3

=

=

Individual 2

=

Individual 1

0

=

=

&

=

1

&
individual_proposed_1

current_inds_approved

individual_proposed_2

Figure 16: Data flow for the selection and approval of possible parents for a next
crossover run in one of the Processing Nodes if no elitism is applied.
45

(Figure 17: On the one hand, two LFSRs are used to randomly pick
any two individuals from the population pool.

Those two individuals

may or may not be elitist individuals, and the are regarded as first
possible “couple” of parents in the algorithm.

The second “couple”

of parent individuals of that form is directly randomly selected by two
other LFSRs from the elitist list, so that one can be sure that the
individuals are elite. The decision, which of those couples is passed on
to the next stage of calculation and thus possibly proposed as parents
for one of the ProcessingNodes is again based on probability:

By

comparing the outcome of a fifth Linear Feedback Shift Register to a
predefined value, a constant probability dominates the decision which pair of
individuals to pass on. Based on both the ElitistSelectionProbablity and the
SizeOfElitistList, the probablity for an elite individual to become a parent
can be calculated: pE litistSelected = ElitistSelectionP robability + (1 −
ElitistSelectionP robability) ∗

SizeOf ElitistList
pP opulationSize

• In the next step, the signal for approval of the proposed individuals has to be
calculated as it is done in the configuration without any elitism (Figure 18).
For that, the two proposed parents are compared to each other as well as
with the individuals already in use to check if the selected individuals are
free to be included in crossover operations. One additional property needs
to be checked to generate the “curent inds approved” signal:

Granting access rights for parent reading to the Processing Nodes
This hardware section produces two potential parents per clock cycle, which
is enough for only one Processing Node. Since the Processing Nodes operate
completely independently and especially without any predictable scheduling, it is

46

Linear Feedback
Shift Register - 1

lpSizeOf
Population

lpSizeOf
Population

Linear Feedback
Shift Register - 2

1

raw_individual_output_1

raw_individual_output_2

a

-

a

b

Linear Feedback
Shift Register - 3

1

b

0

0

a
>
b

-

-

a
>
b

1

lpSizeOf
Population

Linear Feedback
Shift Register - 4
Elitist list
Elitist 1
Elitist 2
Elitist 3

Elitist 4
Elitist 5
...

Linear Feedback
Shift Register - 5

lpSizeOf
Population

a

b

0

individual_proposed_1

1

0

1

a<b

individual_proposed_2

Figure 17: Selection of prospective parent individuals from the general population
as well as specifically from elitist list.

47

individual_proposed_1

individual_proposed_2

Elitist list
Elitist 1

=

=

Elitist 2
Elitist 3

=

=

Elitist 4
Elitist 5

=

=

...

0

=

=

=

=

ind_in_use

=

=
=

Individual 4

=

=

Individual 3

=

=

Individual 2

=

Individual 1

0

=

=

Ind_1_follower
_not_in_use

=

&

Ind_2_follower
_not_in_use

1

&
parent_1_
is_elitist

individual_proposed_1

&
&
current_inds_approved

parent_2_
is_elitist

individual_proposed_2

Figure 18: Calculation of the approval signal for the prospective parent individuals
as well as generating an indication for their status as elitist individuals.

48

important to implement a special locking mechanism in the first states of the PNFSM to ensure that only one node per clock cycle uses the proposed and approved
individuals as parents for the next crossover process (Figure 19): Potentially,
both Processing Node FSMs are in the same beginning state IDLE PN, as it is
the case for the start of the algorithm. Once they get the “start processing[]”
signal by the main FSM, both FSMs will then switch to the actual START
state, and have set up the so called “processing priority[]” value according to
the formula processing priority = 2processing node number . This means, that the
first Processing Node gets a priority value of 1 and the second node a weight
of 2. The whole system is designed for the usage with more than two nodes,
therefore prepared for a future switch to another memory architecture that allows
for more parallel Processing Nodes. All priority weights are then summed up in
the variable “priority sum”. In the following state START, a node can only access
the proposed parent individuals and start the further crossover processing if the
condition processing priority > priority sum/2 is met. This is only always true
for the one Processing Node FSM with the highest priority value. If this happens,
the approval signal for the proposed individuals are checked. If they fit the parent
role, the priority value for the respective FSM is set back to zero, and the reading
process from the population memory begins.
Memory access operations in the Processing Node FSMs
Given the size of the population, the memory addresses of the parent
individuals can easily be calculated from their index numbers, which were
delivered as proposed parents.

From there, the states LOAD PARENT 1,

START LOADING PARENT 2 and LOAD PARENT 2 are used to read the
memory lines belonging to the parent individuals from the population memory and
send it to the respective Processing Node via the “pop data o” bus. Since each

49

START

processing_priority[0]

PNFSM 1

a

a>b
a^b

IDLE_PN

a

b

#PN:

b

0
2
START

PNFSM 2

processing_priority[1]
a
IDLE_PN

a>b
a^b

a

b

#PN:

b

1

+
priority_sum

a
a/b
b

Figure 19: Data and control flow for the calculation of the access priority weights
and the determination of memory access rights for the different Processing Node
FSMs.

50

individual consists of multiple memory lines, the counter “mem access counter”
is used to iterate through the population memory during a reading process and
transmit the several subsequent memory lines (Figure 21).
Once the initial transmission of parents is completed, the FSM stays in
WAIT PROCESSING while the Processing Node proceeds with the micro
algorithmic steps to generate child solutions.

The end of this procedure is

communicated to the Processing Node FSM in the main controller via one of the
signals “transmission ind 1 out” or “transmission ind 2 out”. Whichever signal
is received determines if the first writing process is related to the address space
of the first or the second parent individual. Furthermore, at the same time the
FSM checks for the “unchanged transmission out” signal. If this is set to high,
the parent individual for this spot has “survived” the crossover run, so that no
writing process for the specific memory address has to be initiated. In this case,
the FSM directly jumps to the state where it can begin the reading-process for the
second child, if transmitted. For the specific addressing of the memory, the FSM
uses the parent addresses as stored in the register file “ind in use[ ]”. Apart from
that, the reading process - organized in the states START WB OFFSPRING 1,
WB OFFSPRING 1, START WB OFFSPRING 2 and WB OFFSPRING 2 - is
over all very similar to the transmission process of the parents before. Again,
the enumerator variable “mem access counter” is used to iterate through the
address space to store the transmitted individuals from the Processing Nodes.
One important additional step is the check for the best found solution during
the controlled transmission process: The Processing Node sends the fitness of the
best found solution together with the slot to which this solution belongs. The
Processing Node FSM then compares this fitness value with the best known fitness
value so far, and if better, replaces this stored distance value with the new one.

51

Parallel to that, also the address of the solution with this superior quality is stored
by combining the information about the slot of the best solution with the already
known addresses of the parents which were originally used for the crossover process.
Thus, both FSMs always account the best solution that they have “seen” over time
- a simple comparison of those “best known solutions” of all FSMs leads to the
overall best individual at the end of the algorithmic processing.
In the end, each process terminates in the state FINAL. From there, the
FSMs switch back to IDLE PN during the normal operation of the algorithm
to start the next crossover run.

Only in one case, another algorithmic

step is executed: The FSM belonging to the Processing Node 0 will switch
to TRANSMIT CLOCK CYCLES after its final crossover run to start the
transmission of the best known individual via the UART protocol to an outside
connected workstation for further evaluation of this algorithmic outcome.
Additional complexity is added if an Elitist list is used in the main controller
to keep track of the best known parents and protect them from instantly getting
erased by their even better children: Once the ProcessingNode begins to write back
the children at the end of the WAIT PROCESSING state, the controller checks
whether the parents that are exceeded in fitness by their children were elitists. If so,
the children are not written to the addresses of their parents, but to the additionally
protected memory spaces that were already saved at the beginning of the whole
crossover process (Figure 20. Furthermore, after completing the write back process,
the FSM in the FINAL state checks whether one or both of the transmitted children
belong to the elitist list themselves by comparing their transmitted fitness to the
fitness of the already known worst members of the elitist list. If necessary, those
“worst of the best” can then be replaced by the new children. After that, the FSM
uses the states ELITIST CHECKER WAIT STATE and REPLACE ELITIST to

52

iterate through the updated Elitist list and find the new two worst members of
that selection, so that they can potentially be replaced after the next crossover
process.
UART-transmission in the Processing Node FSM
During the development phase of the FPGA-based implementation of the
Genetic Algorithm, the question for the communication of the algorithmic results
had to be solved: Since the FPGA doesn’t offer any interface for human-machinecommunication, such as a display, it is necessary to transmit the best found
individual, therefore the desired routing for the given problem set, to an outside
computer, which is then able to further process the transmitted individual, for
example for visualization of the routing. For reasons of simplicity, the decision
was made to use the classic UART-protocol for the communication with the
workstation: The chosen FPGA already provides an onboard UART-via-USB
interface, which allows for easy transmission of data via the USB connection to the
development PC, and multiple IP-core modules for UART are available. The design
chosen for the main controller provides the following communication interface for
the Processing Node FSM:
• i Clock : The general clock signal is passed on to the UART-module.
• i Tx DV : Whenever a byte is sent to the UART module to be transmitted
next, the Processing Node FSM sets this signal to high.
• i Tx Byte: Via this bus, the Processing Node FSM sends the next byte to
be transmitted via UART.
• o Tx Active: During an active UART-transmission, this notifier-signal is set
to high.

53

START

IDLE_PN
•

•

Wait until start signal is sent to
take next proposed parents
and start calculation

•

Wait for acknowledgement of parents

•

Save required information for crossover:
elitist state, write back addresses etc.

If nodeID == 1: Transmit best
results via UART at the end

•

Transmit first memory line of first parent to
PN

LOAD_PARENT_1
•

START_LOADING_PARENT_2

Transmit remaining memory
lines of first parent to PN

•

Transmit first memory line of second parent
to PN

LOAD_PARENT_2
•

WAIT_PROCESSING

Transmit remaining memory
lines of second parent to PN

START_WB_OFFSPRING_1
•

Receive transmitted first individual

•

Check if it is a child or parent

•

I child, check whether parent
was elitist or not

Wait until PN has finished calculation

•

Receive transmitted first individual

•

Check if it is a child or parent

•

If child, check whether parent was elitist
or not

•

Receive transmitted second individual

•

Check if it is a child or parent

•

I child, check whether parent was elitist or
not

START_WB_OFFSPRING_2

WB_OFFSPRING_1
•

•

Receive remaining memory
lines of the transmitted first individual

WB_OFFSPRING_2
ELITIST_CHECKER_WAIT_STATE
•

•

Receive remaining memory lines of the
transmitted second individual

•

Finish receiving process

•

Check whether one or both of the transmitted children has to be included in the
elitist list by comparing it to the worst
known elite parent

Replace the worst element in
the elitist list with the better
child from the crossover pro-

FINAL

REPLACE_ELITIST
•

Iterate through the “new” elitist
list to find the new worst individuals in the list

Figure 20: Main part of the Finite State Machine of the main controller used for
controlling the communication with the Processing Nodes, in which the microalgorithm is implemented. Only bried descriptions of the several states are shown,
and also the states for elitism are included.

54

•

•



IDLE_PN
start_processing[pn_counter]:

Processing_priority[pn_counter] <=
2**pn_counter
•
processing_init[pn_counter]:

init_mode[pn_counter] <= 1
•
else:

init_mode[pn_counter] <= 0
else:

stop_clock_counter <= 1

tx_dv_i <= 1

tx_byte_i <= clock_counter
[lpClockCounterBitWidth-1 -: 8]

ClockCounterTransmissionCounter
<= 1
child_2_first[pn_counter] <= 0

start_pro
cessing

•

START
(2**pn_counter > priority_sum/2) && current_inds_approved:

processing_priority[pn_counter] <=
0

ind_in_use[2*pn_counter] <= individual_proposed_1

ind_in_use[2*pn_counter + 1] <=
individual_proposed_2

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*individual_proposed_1

do_mutation_regfile[pn_counter]
<= (MutationDeciderOutput < pMutationProbabilityThreshold)

mem_access_counter[pn_counter]
<= 1

load_best_solution_processing_
node_1 && pn_counter == 0
TRANSMIT_CLOCK_
CYCLES






(2**pn_coun
ter > priority_sum/2)
&& current_inds_a
pproved
(input_readin
g_done_ou
t) && !
mem_acce
ss_counter
< lpNumOfMemAcces
ses

START_LOADING_PARENT_2
start_parent_1_regfile[pn_counter] <= 0
finished_parent_1_regfile[pn_counter] <= 1
mem_addr[‘pn_counter] <= lpNumOfMemAccesses*ind_in_use[2*pn_counter+1]
mem_access_counter[pn_counter] <= 1


•

LOAD_PARENT_1
start_parent_1_regfile[pn_counter] <=
1
input_reading_done_out[pn_counter]:
•
mem_access_counter
[pn_counter]
<
lpNumOfMemAccesses:

mem_access_counter
[pn_counter] ++

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_u
se[2*pn_counter]
+
mem_access_counter
[pn_counter]
•
else:

run_counter_arr
[pn_counter] ++

finished_parent_1_regfile
[pn_counter] <= 1

start_parent_1_regfile
[pn_counter] <= 0

(input_reading_done_out) && !
mem_access_counter < lpNumOfMemAccesses
LOAD_PARENT_2
start_parent_2_regfile[pn_counter] <= 1
input_reading_done_out[pn_counter]:
•
mem_access_counter[pn_counter] < lpNumOfMemAccesses:

mem_access_counter[pn_counter]
<=
mem_access_counter[pn_counter] + 1

mem_addr[pn_counter] <= lpNumOfMemAccesses*ind_in_use
[2*pn_counter+1] + mem_access_counter
[pn_counter]
•
else:

finished_parent_2_regfile[pn_counter] <=
1

start_parent_2_regfile[pn_counter] <= 0


•



•

transmission_ind_2_out &&
unchanged_transmission_out
START_WB_OFFSPRING_1
finished_parent_1_regfile[pn_counter] <= 0
transmission_ind_1_out[pn_counter]:
•
unchanged_transmission_out[pn_counter]:

mem_we[pn_counter[ <= 0
•
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_use[2*pn_counter]

data_pn_to_mem[pn_counter]
<=
pop_data_pn_to_mem[pn_counter]

mem_access_counter[pn_counter] <= 1


•

transmission_ind_1_out && !
unchanged_transmission_out

•

•

transmission_ind_1_out && unchanged_transmission_out

mem_access_counter < lpNumOfMemAccesses-1 && child_2_first

WB_OFFSPRING_1
mem_access_counter[pn_counter] < lpNumOfMemAccesses - 1:

mem_access_counter[pn_counter]
<=
mem_access_counter
[pn_counter] + 1

mem_we[pn_counter] <= 1

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_use
[2*pn_counter]+mem_access_counter[pn_counter]

data_pn_to_mem[pn_counter]
<=
pop_data_pn_to_mem
[pn_counter]
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_use
[2*pn_counter]+mem_access_counter[pn_counter

data_pn_to_mem[pn_counter]
<=
pop_data_pn_to_mem
[pn_counter]

start_parent_1_regfile[pn_counter] <= 0

transmission_ind_1_out &&
unchanged_transmission_out



•

•

WAIT_PROCESSING
start_parent_2_regfile[pn_counter] <= 0
finished_parent_2_regfile[pn_counter] <= 0
transmission_ind_1_out[pn_counter]:

child_2_first[pn_counter] <= 0
•
unchanged_transmission_out[pn_counter]:

mem_we[pn_counter] <= 0
•
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter] <= lpNumOfMemAccesses*ind_in_use[2*pn_counter]

data_pn_to_mem[pn_counter] <= pop_data_pn_to_mem[pn_counter]

mem_access_counter[pn_counter] <= 1
•
better_fitness_value_out[pn_counter] < best_known_solutions[pn_counter]:

best_known_solutions[pn_counter] <= better_fitness_value_out[pn_counter]
•
better_fitness_slot_out[pn_counter] == 0 :

best_known_solutions_addresses[pn_counter] <= ind_in_use[2*pn_counter]
•
else:

best_known_solutions_addresses[pn_counter] <= ind_in_use[2*pn_counter
+ 1]
transmission_ind_2_out[pn_counter]:

child_2_first[pn_counter] <= 1
•
unchanged_transmission_out[pn_counter]:

mem_we[pn_counter] <= 0
•
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter] <= lpNumOfMemAccess*ind_in_use[2*pn_counter+1]
•
better_fitness_value_out[pn_counter] < best_known_solutions[pn_counter]:

best_known_solutions_addresses[pn_counter] <= ind_in_use[2*pn_counter]
•
else:

best_known_solutions_addresses[pn_counter] <= ind_in_use[2*pn_counter + 1]


•

transmission_ind_2_out && !unchanged_transmission_out

•

mem_access_counter < lpNumOfMemAccesses-1 && child_2_first

FINAL
mem_we[pn_counter] <= 0
finished_parents_2_regfile[pn_counter] <= 0
pn_counter == 0 && load_best_solution_processing_node_1:

stop_clock_counter <= 1

tx_dv_i <= 1

tx_byte_i <= clock_counter[lpClockCounterBitWidth-1 -: 8]

ClockCounterTransmissionCounter <= 0

START_WB_OFFSPRING_2
finished_parent_1_regfile[pn_counter] <= 0
transmission_ind_2_out[pn_counter]:
•
unchanged_transmission_out[pn_counter]:

mem_we[pn_counter] <= 0
•
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter] <= lpNumOfMemAccesses*ind_in_use
[2*pn_counter+1]

data_pn_to_mem[pn_counter] <= pop_data_pn_to_mem
[pn_counter]

mem_access_counter[pn_counter] <= 1

•

WB_OFFSPRING_2
mem_access_counter[pn_counter] < lpNumOfMemAccesses - 1:

mem_access_counter[pn_counter]
<=
mem_access_counter
[pn_counter] + 1

mem_we[pn_counter] <= 1

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_use
[2*pn_counter]+mem_access_counter[pn_counter]

data_pn_to_mem[pn_counter]
<=
pop_data_pn_to_mem
[pn_counter]
else:

mem_we[pn_counter] <= 1

mem_addr[pn_counter]
<=
lpNumOfMemAccesses*ind_in_use
[2*pn_counter]+mem_access_counter[pn_counter

data_pn_to_mem[pn_counter]
<=
pop_data_pn_to_mem
[pn_counter]

start_parent_1_regfile[pn_counter] <= 0

Figure 21: Main part of the Finite State Machine of the main controller used for
controlling the communication with the Processing Nodes. Full details regarding
all used variables are shown. The states related to the Elitist list are not depicted.
55

• o Tx Serial : This signal is the serial data output that follows the UARTprotocol.
• o Tx Done: Once the transmission of a full byte via UART is finished, this
signal is set to high.
With this communication interface, the communication process is always
the same for all data chunks transmitted by the Processing Node FSM
(Figure 22):

The states TRANSMIT CLOCK CYCLES INTERMEDIATE

and TRANSMIT CLOCK CYCLES are used to send the 256-bit wide
clock counter as an indicator of the calculation time via UART. The
“ClockCounterTransmissionCounter” is used to iterate through this clock counter
and select single byte-wide sections of it for transmission. Those are sent via
“i Tx Byte”, while “i Tx DV” is set. Once “o Tx Done” turns high, the next byte
can be transmitted.
After finishing the transmission of the clock counter, the FSM uses the
stored address of the best known solution to begin reading the first memory
line of this individual for further transmission.

This is managed in the

states BEGIN TRANSMITTING, TRANSMIT SOLUTION INTERMEDIATE
and TRANSMIT SOLUTION. In this case, the “serial transmission counter” is
used to iterate through the memory line and transfer a byte at a time. Special
precautions must be taken for the cases when it is necessary to read the next
memory line for transmission: Those accesses to the population memory are
managed in the states READ WAIT STATE and READ WAIT STATE 2.
6.2.3

Linear Feedback Shift Register

Multiple different processes within the macro algorithm implemented in the
central controller require random numbers in different forms: Random parents

56

TRANSMIT_CLOCK_CYCLES_
INTERMEDIATE
•
! TX_done_o:

tx_dv_i <= 0

TX_done_o

•

else
•

ClockCounterTransmissionCounter
== lpClockCounterTransmitUpperLimit

READ_WAIT_STATE

TRANSMIT_CLOCK_CYCLES
ClockCounterTransmissionCounter
==
lpClockCounterTransmitUpperLimit:

tx_dv_i <= 0

mem_we[pn_counter] <= 0

mem_addr[pn_counter] <= 0

mem_access_counter[pn_counter]
<= 0
else:

tx_dv_i <= 1

tx_byte_i
<=
clock_counter
[lpClockCounterBitWidth-1ClockCounterTransmissionCounter*8 -: 8]

ClockCounterTransmissionCounter++

READ_WAIT_STATE_2



serial_transmission_counter == 9
&& mem_access_counter
[pn_counter] < lpNumOfMemAccesses

•





BEGIN_TRANSMITTING
tx_dv_i <= 1
tx_byte_i <= data_mem_to_pn[pn_counter] [71:64]
serial_transmission_counter <= 1

TRANSMIT_SOLUTION_INTERMEDIATE
!TX_done_o:

tx_dv_i <= 0
TX_done_o

•

•

TRANSMIT_SOLUTION
serial_transmission_counter == 9
•
mem_access_counter[pn_counter]
<
lpNumOfMemAccesses:

tx_dvi_i <= 0

serial_transmission_counter <= 0

mem_we[pn_counter] <= 0

mem_addr[pn_counter] <= lpNumOfMemAccesses*best_final_solution_addr
ess
+
mem_access_counter
[pn_counter]

mem_access_counter[pn_counter] <=
mem_access_counter[pn_counter] + 1
else:

tx_dv_i <= 1

tx_byte_i <= data_mem_to_pn[pn_counter]
[71-serial_transmission_counter*8 -: 8]

serial_transmission_counter != 9

serial_transmission_counter == 9
&& !(mem_access_counter
[pn_counter] < lpNumOfMemAccesses)

DONE

Figure 22: Second part of the Finite State Machine of the main controller in charge
of controlling the Processing Nodes, which specifically handles the data exchange
with the UART-modules.
57

have to be chosen from the population memory, node-sequences of random lengths
shall be transmitted during the crossover process in the Processing Nodes, which
shall also perform mutation by randomized chance and for a mutation chunk of
randomized length. However, the generation of truly random numbers is a nontrivial problem on a completely deterministic device such as a FPGA - and in
the scope of this thesis a problem that could not be completely solved. A very
simple IP-core that was used to create some of the numbers needed for the various
processing steps is the so-called Linear Feedback Shift Register, as taken from [35].
The hardware structure of such a module is more than simple (Figure 23), it just
consists of a chain of register files, of which each holds one bit of the internal
counter. Via a feedback loop, channelled through a XNOR-gate, a shifting process
for that chain of registers is created. Thus, the LFSR does not produce “random”
numbers to any extent, it is actually a regular counter, which just doesn’t follow
the regular order of numbers. However, the sequence of numbers is periodic - once
all numbers within the bit range of the counter have appeared, the exact same
sequence of numbers will be generated by this module again. Thus, randomness
can only be generated by the randomization of access time: The various processes
within the algorithm take an almost unpredictable amount of time, after which
they will access the LFSR to get new random numbers for the next crossover run
etc.. Due to the unpredictability of this access time, also the exact number that is
taken from the LFSR cannot be predicted.
Nevertheless, there is still one problem: Two exact instances of a LFSR, started
at the same time and run with the same clock signal will produce the exact same
“random” numbers in every clock cycle - which is definitely not helpful if the task is
to select two different individuals or choose beginning and end of a chunk of nodes
to be reversed. Obviously, a LFSR can be started with a different seed value,

58

=1

Bit 4

Bit 3

Bit 2

Bit 1

Bit 0

Figure 23: Basic structure of the Linear Feedback Shift Registers with an XNORgate as feedback implementation.
which causes it to produce different numbers. However, then the pair-coupling of
the two LFSRs will still be periodic and don’t allow for all possible combinations.
Thus, additional sources of numbers had to be chosen, for example by selecting
some bits of the continuously updated clock counter.
6.2.4

UART-Controller

For the communication of results to an external development work station,
the UART-protocol was chosen, as explained before in the remarks about the
Processing Node FSM. This requires a specialized IP-communication module,
which was again taken as an already existing IP-core from [36].
UART itself is one of the oldest and mostly used transmission protocol wherever
simplicity is more important than communication speed or bandwidth. Since it is a
serial communication protocol, a single bit line is necessary for the communication:
With a default-high signal, a single low start- and a single high stop-bit are enough

59

to frame the transmission of five to nine data bits - normally, the sent data package
is one byte.
Therefore, also the internal FSM of this module is rather simple: Given the
parameterized baud-rate of the serial communication, a counter is used to sample
the incoming data bus “i Tx Byte” and transmit this data-word bit by bit. This
transmission is then framed by the correctly set control signals “o Tx Active” and
“o Tx Done”.
6.3

Population Memory
As it can be seen from the description of the used Genetic Algorithm, the

design of the storage used as population memory is of high importance, since the
size and access-bandwidth of this storage define the population size and limit the
processing spped, two extremely crucial parameters for the algorithm respectively
for the processing efficiency as defined above. Therefore, the design phase of the
architecture included multiple different ideas for the memory technology to be used
at this decisive point of the implementation:
• Both of the FPGAs on the shortlist for the realization of the design offer High
Bandwidth Memory or HBM in short, which is a next-generation storage
technology only included in high-performance FPGAs and graphic processors
today. Based on the vertical stacking of conventional DRAM-structures,
those memories combine large storage space with a very high bandwidth
over multiple access ports. Specifically for the chosen chips by Xilinx and
Intel, this would have meant a 8 Gigabyte HBM, that could be accessed over
16 indepedendent AXI-ports with a bandwidth of 256 bits each. While this
sounds like the ideal solution for the purpose of a population memory for a
Genetic Algorithm, this technology has some significant drawbacks:

60

– First of all, HBM is optimized for persistent reading- and writing
accesses from following memory addresses. This however is completely
contradictional to the actual memory usage as required for a Genetic
Algorithm: All memory accesses are at random addresses, and always
only include a very small fraction of the whole memory. This would
result in long memory latencies and prevent the advertised enormous
data throughput.
– Furthermore, each of the 16 AXI-ports for independent memory accesses
the same number of Processing Nodes would require a 256 Bit wide data
line as well as some additional bits for controlling purposes. Those wide
connections make the routing process on the FPGA extremely hard and
slow, and also limit the maximum possible clock frequency.
– Finally, it turned out that the documentation of the HBM technology
and the availability of IP-cores for the required hardware controllers is
worse than expected, which leads to an increased development expense.
• For testing purposes in the simulation phase, it was helpful to simulate
the population memory with distributed RAM as register files. While the
big advantage of this implementation is the extremely low read- and writelatency, it is not feasible to use this LUT-based memory design for a real
population memory due to the incredibly high hardware utilization.
• Finally, the classic BRAM-technology turned out to be the most suitable
solution for the requirements of a population memory in a Genetic Algorithm
for multiple reasons:
– First of all, the chosen FPGA has a maximum capacity of 360 Megabytes
in BRAM-cells. While this is well below the advertised HBM-capacity

61

as described before, it turned out that the maximum size of problems
that can be enrolled on the FPGA is limited by the availability of LUTs,
not the memory. Therefore, this somehow small capacity is suitable for
all use cases regarding the population memory in Genetic Algorithms.
– Furthermore, the BRAM is optimized for randomized memory accesses
and always provides the same low latency of only one clock cycle.
Therefore, the strategy of selecting random individuals from the
population is not penalized with long waiting times for the memory.
– Most importantly, the BRAM allows for two completely independent
memory accesses.

This is the necessary basis for at least two-fold

parallelism in processing of individuals within the Genetic Algorithm.
While the width of each memory line can be chosen freely in general, the technical
handbook of the chosen FPGA specifies that at least one of the memory accesses
has to use the maximum bid width of 72 bits. To enable the highest degree of
parallelism without adaption for the different ProcessingNodes, it was therefore
decided to generally use a memory layout with lines of 72 bit width.
However, this decision requires some planning regarding the organization and
representation of individuals in the memory.

Since a convenient handling of

individuals in the selection process of the main controller made it necessary to
be able to directly calculate the storage address of an individual by its number
in the population, it was clear that each individual begins in its own new line
and that each line only contains nodes of one single individual. Furthermore, a
decision regarding the alignment of a memory line and the stored data had to be
made: As explained before, each individual only consists of sub-elements of the
same length - the first such subelement holds the individual fitness, while all of
the following sections are actual nodes of the tour. The BRAM could have been
62

used to the highest possible degree if the memory organization would have not
considered this content-related suborganization of the stored data. This would
have meant, that all 72 bits of each memory line could have been used to store
information of an individual. On the other hand, no guarantee for the alignment of
the representation of nodes and the length of a memory line could have been given,
which would have meant a significant overhead in the interpretation of the read
memory content in the ProcessingNodes: After buffering a whole section from the
memory, the single bits would need to be sorted to the respective nodes that they
belong to. Instead, the decision was made to renounce the maximum utilization
of the available memory and organize the storage space according to the internal
structure of the stored content: Nodes within an individual are always stored at
the same positions within a single memory line, no overlapping from one line to
the next can occur. This may cause some unused bits at the end of each 72-bit
wide memory line, but drastically reduces the processing of read data: Since the
position of a node in a memory line is known before and always the same, the data
can directly be split up and inserted in the specialized hardware (Figure 24).
6.4

Processing Nodes
As explained before, the two Processing Nodes can be seen as direct

implementation of the micro-algorithm and are therefore responsible for the
processing of two parent individuals to two children, which are then compared
to each other to evaluate possible progress towards higher fitness in the overall
evolution of a Genetic Algorithm. The architecture of those modules (Figure 25)
again strictly follows the principle of highest possible parallelism while using
communality in calculation to save hardware ressources: While the management
of the communication with the main controller is managed via a common main
FSM, the two children are created independently from each other and therefore

63

Example: Individual with 7 cities and 20 bits per city

...

0

1

333

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

Fitness-Value

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

City-Node 2

City-Node 1

12 unused bits at the end of
an internal memory line

334

City-Node 3

City-Node 5

City-Node 4

12 unused bits at the end of
an internal memory line

335
...

City-Node 6

City-Node 7

32 unused bits at the end of the last
memory line of the individual

Figure 24: Example of the storage organization of a single individual with 7 nodes
and 20 bits per node in the 72-bit memory lines of the BRAM used as population
memory.
in parallel in two separate, but identical children FSMs. This saves time, while
reducing the hardware overhead to the absolute minimum.
6.4.1

Communication Interface

As an interface between macro- and micro-algorithm, the communication
connections between the main controller and the Processing Note were designed
with greatest attention to avoiding bottlenecks by too few communication lines
while still trying to minimize the number of parallel transmitted bits in order to
simplify the process of routing, keeping in mind the later hardware implementation
on a FPGA. According to the described role of the main FSM as the one in charge
of I/O handling, all actively driven connections are handled by this automaton,
while various data lines are also read by the children FSMs if necessary. The
resulting ports are the following:
• clock i: Transmits the design-wide clock signal to the Processing Node.
• rst i : Transmits the global reset-signal to the Processing Node.

64

mutation_length_1/2_i

crossover_1/2_length_i

do_wild_child_i

do_mutation_i

bitter_fitness_value_o

better_fitness_slot_o

random_city_number_mutati
on_1/2_i

unchanged_ind_o

transmission_ind_1/2_o

input_reading_done

pop_data_o

pop_data_i

finished_parent_1/2_i

start_parent_1/2_i i

init_mode_i

rst_i

clk_i

Processing Node

Main-FSM

FV Parent 1
Parent 2 Tour

Parent 1 Tour
FV Parent 2

Child-1-FSM

Child-2-FSM

FV Child 1
Child 1 Tour

Child 2 Tour
FV Child 2

SQR-Calculator-Child-1

SQR-Calculator-Child-1

Figure 25: Structural overview of the Processing Node, including a common main
FSM and two separate FSMs for parallel processing of two children. Remarkable
is the representation of parents and children in register files, manipulated by those
FSMs.
65

• init mode i : Regarding the number of mutation processes, two phases in the
algorithm can be distinguished as explained before. The initiation phase,
which is indicated by this communication line, involves multiple mutation
processes for each produced child, while in the later regular algorithmic
working phase mutation is only randomly applied once per child. Therefore
this signal impacts the workflow in the child FSMs of the Processing Node
regarding the number of processed mutation steps.
• start parent 1/2 i : The two signals “start parent 1 i” and “start parent 2 i”
are used to indicate the transmission of parent 1 respectively parent 2 from
the population memory via the main controller to the Processing Node. As
long as this signal is set, new memory lines of 72 bits each belonging to the
respective parents are sent and have to be stored in the Processing Node.
• finished parent 1/2 i : “finished parent 1 i” and “finished parent 2 i” have to
be seen as counterparts of the two signals mentioned before: When those
signals are set, the transmission of the respective parent is finished and either
the second parent has to be stored or - if it is “finished parent 2 i” and thus
both parents are present in the Processing Node, the individual and parallel
computation of the children can be started.
• pop data i : This bus with a width of 72 bits is used to directly transmit a full
memory line from the population memory to the Processing Node. Since the
transmitted memory line is directly split up, interpreted and stores as defined
nodes instead of just the raw bits and bytes, the value of “pop data i” has
to be the same over multiple clock cycles, until it is completely processed in
the Processing Node.
• pop data o: As “pop data i”, this signal is again a 72-bit wide bus, which has
66

to be seen as a counterpart of the aforementioned one. It is used to transmit
the children solutions from the Processing Node to the main controller and
from there to the population memory, if those children show better fitness
values than their parents.
• input reading done: This signal is used to indicate when the Processing Node
is done with reading and storing the content of one memory line transmitted
via “pop data i”. Thus, the main controller is noticed with this signal when
it is time to read the next line from the population memory and send it to
the Processing Node via the bus.
• transmission ind 1/2 o:

The two signals “transmission ind 1 o” and

“transmission ind 2 o” are used in the process of writing back solutions from
the Processing Node to the main controller and then to population memory.
Analogue to the two spots of the two loaded parents, also two spots for
re-transmission of solutions exist. Those spots can either contain no new
solution that is sent back - if the parent at this spot has a fitness value that
surpasses its children - or an actual child individual is transmitted via the
outgoing data bus. In any case, those signals notify the main controller which
of the two possible transmission spots is in usage at any given point in time,
which for example impacts the memory address if actual data is sent and
then written to the population memory.
• unchanged ind o: According to what is described for the signals before, each
write-back spot of a Processing Node can either transmit actual children,
which are better than their parents, or the parents itself, if their fitness
is still superior. In this case, only the “unchanged ind o” signal is set to
indicate that no data is transmitted, since the individual in the population

67

memory remains untouched.
• random city number mutation 1/2 i : Those two busses transmit the starting
points for the chunks of cities within the two children tours that are to be
reversed as part of the mutation processes.
• better fitness slot o: Indepedently from the solutions sent back to the main
controller, the Processing Node also transmits the best found fitness value.
To make clear, to which of the two re-transmitted slots this value belongs,
this bit can either be set to 0 or 1 to mark the respective spot.
• better fitness value o: The transmission of the superior fitness value is
completed over this bus.
• do mutation i : If this bit is set, the Processing Node has to perform mutation
for both children in their respective generation-processes.
• do wild child i : According to the “WildChild”-concept to ensure a satisfying
population diversity, the ProcessingNode will send back the two new children
regardingless their fitness if this bit is set by the main controller.
• crossover 1/2 length i:

The

two

busses

“crossover 1 length i”

and

“crossover 2 length i” transmit the length of the chunk to be transmitted
from one parent to the respective following parent in the crossover process.
Thus, these numbers control the percentage of genetic material that is
directly passed from one parent to one children for each of the two FSMs
independently.
• mutation length 1/2 i :

The

two

busses

“mutation length 1 i”

and

“mutation length 2 i” transmit the length of the chunk to be reversed
in the mutation processes of each of the two children.
68

The generation

process of these numbers in the main controller ensures, that the whole
chunk - starting at “random city number mutation 1/2 i” and ending
at

“random city number mutation 1/2 i

+

mutation length 1/2 i”

is

completely within the range of an individual.
6.4.2

Data I/O in the Parent-FSM

The common so-called Parent-Finite State Machine in the Processing Node
has the task of handling the incoming data from the main controller, preparing
the parent individuals and making them therefore ready for further processing in
the children FSM. Once those two individual automatons have reached their final
states and therefore produced two feasible children, the parent FSM regains control
of the process to compare the fitness values of the four individuals that are present
in the Processing Node (parents and children) to select the two best, which are
then re-transmitted to the main controller and the population memory again under
the control of this parent FSM.
Storage technology for the individual representation
A defining characteristic of the architectural idea of Processing Nodes is that
they require local copies of the parent individuals as well as physical realizations
of the emerging children individuals during the processing, since the various
algorithmic steps such as crossover and mutation potentially involve all nodes
of such an individual, which makes it virtually impossible to operate within the
population memory.

Two technologies were conceivable at an early point of

development:
• Similar to the general population memory in the memory controller, also
the four individuals in the Population Node could be stored in four single
BRAM-modules with two access ports each. The clear advantage of this

69

design is the relatively small footprint of the implementation in terms of
hardware usage, while the major drawback of such a memory architecture is
also clear: For copying nodes from the parents to the children as well as for
the heavily parallelized processing of the children individuals in the children
FSMs the availability of only two independendt access points would mean a
serious limitation. Furthermore, algorithmic steps that rely on comparison
of nodes as key part of the decision flow would be increasingly complex and
delayed due to the resulting waiting clock cycle in BRAM-reading processes.
• The second implementation variant focuses on the usage of tailor-specified
register files for the four individuals, whereas each node is stored in its own,
partitioned register that allows for direct access to the node coordinates, the
demand section and the final depot bit. In comparison to a BRAM-based
implementation, this storage decision definitely requires more ressources,
especially the remarkably more valuable Look-Up-Tables in FPGA-hardware
- and large register files made of LUTs tend to limit the highest available
clock speed of a design on top of that. On the other hand, such a register
file-based implementation provides the highest possible degree of parallelism
for algorithmic processing of the children individuals in the respective FSMs.
When the decision was made to go for a BRAM-based population memory, it was
also easy to decide about the suitable individual storage solution in the Processing
Node: Since the BRAM-population memory limits the number of Processing Nodes
to two, a slightly larger ressource utilization per Processing Node doesn’t impact
the overall design too much. Thus, it makes sense to accept the larger footprint of
the register file individual storage in order to gain the flexibility and easy usability
of such an implementation.

70

Storing incoming parents
The process of storing incoming parents from the main controller is handled
in a number of states of the parent FSM (Figure 27) and as such relatively
easy in comparison to the communication of the outgoing individuals, since
no exceptional cases have to be included in the control flow - the incoming
communication follows a strict pattern, where the two parents are sent one after
each other: The beginning of a new transmission process is indicated in the
“IDLE PARENT” state by the “start parent 1 i” signal: At this point, the first
memory line of 72 bits from the population memory is already present at the
“pop data i” bus, so that the first section of this line holding the fitness value
of the incoming first parent can be read and stored into a specialized register.
After that, the state “READING INPUT VALUES PARENT 1” is entered to
dissect the rest of the transmitted memory line in the single nodes to store them
in the parent 1-register file. For this process, two hardware counters have to
be run in parallel: While the “city 1 load counter” addresses the right sub-field
in the register file for the current node, the “incoming data counter” points to
the next node section within the 72 bit-wide “pop data i”-bus. Obviously, the
correct addressing mechanism has to rely on the specifically chosen bit width of
each node in the memory line. Given those information, the latter counter can
be used to detect if the last node from a 72-bit memory line was read. In this
case, the main controller is notified that the next memory line needs to be read
from the population memory and sent via the bus toe the Processing Node. The
waiting cycle for this new content is bridged in the “WAIT STATE 1” of the
Processing Node, before the reading process of the next transmitted memory line
then begins in “START NEW INPUT LINE PARENT 1” and is completed again
in “READING INPUT VALUES PARENT 1”.

71

Once the main controller signifies the transmission of the final memory line of one
parent individual, the FSM switches to the state “BEGIN PARENT 2” to first
read the fitness value of this second parent individual and then begins the same
storing process for this second transmitted individual from the population memory.
Once this process is finalized, the FSM returns to the “IDLE PARENT”, while the
children FSM begin to create offspring individuals. The next step for the parent
FSM is activated when those two independendent children processes are finished.
Comparison of fitness values and writing-process
Once both children FSMs have reached their terminate state and are therefore
done with the creation of the offspring individuals, the signal “children finished”
is set and used to notify the parent FSM in order to initiate the process of writing
back solutions to the main controller and population memory.
In the first place it is thus necessary to find the two best individuals due to their
fitness values - or in other words, find the two individuals that dominate at least
two other individuals out of the group of four. This task is solved by combinatoric
logic (Figure 26), which generates the four signals “parent 1 dominates 2”,
“parent 2 dominates 2”, “child 1 dominates 2” and “child 2 dominates 2”. With
this information, it is then necessary to decide how to assign the two winning
individuals to the possible output slots. Since those slots are tied to the hardware
storage addresses of the original parent individuals, the assignment algorithm has
the highest priority to sort the old parents, if they are part of the two winning
individuals, to their old slots. Therefore, the slot assignment has to be hard-coded
and cannot be solved simply priority-based (Table 1). In the end, this ensures a
reduced re-transmission time, since parents in their old slots do not need to be
physically written back bit by bit to the population memory - their representation
is still stored there.

72

do_wild_child_i

1

≥

&

a
a
<
b

fitness_value_parent_1

a
a
<
b

parent_1_dominates_2
0

&

b

fitness_value_parent_2

1

≥

&

1
1

b

parent_2_dominates_2

a
a
<
b

fitness_value_child_1

&

&

1

b
a
a
<
b

fitness_value_child_2

≥

0

1
1

&

child_1_dominates_2
0

b

&

a
a
<
b

&

1
1

b
a
a
<
b

&

1
≥
1

&

b

child_2_dominates_2
0

Figure 26: Data path for the comparison of the fitness values to determine the
dominance of individual solutions in comparison to the three other solutions present
in the Processing Node

For the writing-back process for the individual output slots the old readin hardware can be practically used again in reverse order:

The signals

“transmission ind 1 o”’ respectively “transmission ind 2 o” are set to announce
whether transmission of slot 0 or slot 1 follows. Then, always 72 bits at once
are read from the register file representation of the specific individual and send
via the data bus “pop data o” to the main controller. Again, two counters are
used to determine the state of this process: With the “city counter” the bits
within the register file to be transmitted are addressed and selected, while the
counter “cityCounterParents” is used to evaluate the overall number of writing
steps that have already occured for the current individual. Once the whole register

73

p1d2 p2d2 c1d2 c2d2
X
X
X
X
X
X
X
X
X
X
X
X

slot 0
slot 1
parent 1 parent 2
parent 1 child 1
parent 1 child 2
child 1 parent 2
child 2 parent 2
parent 1 parent 2

Table 1: Assignment of the individuals to the re-transmission slots due to their
dominance regarding the fitness values
file is transmitted via the bus, the whole write-back process either switches to
the individual to be sent or is terminated, determined by the internal signal
“second child to be transmitted”. In the latter case, the parent FSM returns to
the idle state to wait for the next incoming parents.
If the slot assignment process in the first place assigns one of the parent individuals
to the old slot, no writing has to be initiated. Instead, the transmission-indication
signal is set together with the “unchanged ind o” signal.
6.4.3

Algorithmic Processing in the Children-FSM

The two children FSMs operate independently and implement all the
important processing steps of the micro-algorithm as explained in the chapter
before. Timewise, their operation takes place after the successful loading of the
parent solutions by the parent FSM, which is then also triggered to run the final
re-transmission once the children-processing is finished. The different algorithmic
steps are implemented as following (Figure 28):
Crossover processing - generating two “raw” children
The crossover processing, therefore the initial creation of “raw” offspring
solutions, can be divided into two sub-processes: First of all, an unchanged chunk
is load from one parent to a child. After that, the rest of the child tour is filled
up with nodes from the second parents, which requires comparison of nodes to be

74



•

•

IDLE_PARENT
transmission_ind_1_o <= 0; transmission_ind_2_o <= 0; unchanged_ind_o <= 0; better_fitness_slot_o <= 0; better_fitness_value_o <= 0
start_parent_1_i:

fitness_value_parent_1_i
<=
pop_data_i[71:71-lpCityEntryBitWidth]/2;
city_1_load_counter
<=
0;
city_2_load_counter <= 0
children_finished:

parent_1_dominates_2 parent_1_dominates_2 parent_1_dominates_2 parent_2_dominates_2 parent_2_dominates_2
&&
&&
&&
&&
&&
parent_2_dominates_2 child_1_dominates_2
child_2_dominates_2
child_1_dominates_2
child_2_dominates_2

child_1_dominates_2
&&
child_2_dominates_2

else:

6

7

unchanged_ind_o <= 1

second_child_to_be_trans
mitted <= 1

unchanged_ind_o <= 1

transmission_ind_1_o
<= 1

transmission_ind_1_o
<= 1

1

2

3

4

transmission_ind_1_o <= 1

5

transmission_ind_2_o <= 1

pop_data_o <=
f_v_c_1 / child_1_tour

cityCounterParents
<=lpNumOfEntriesPer
MemLine - 1
f_v_p_1 <= f_v_p_2:

b_f_s_o <= 0

b_f_v_o <=
f_v_p_1
else:

b_f_s_o <= 1

b_f_v_o <=
f_v_p_2

f_v_p_1 <= f__v_c_1:

b_f_s_o <= 0

b_f_v_o <=
f_v_p_1
else:

b_f_s_o <= 1

b_f_v_o <=
f_v_c_1

f_v_p_1 <= f_v_c_2:

b_f_s_o <= 0

b_f_v_o <=
f_v_p_1
else:

b_f_s_o <= 1

b_f_v_o <=
f_v_c_2

f_v_p_2 <= f_v_c_1:

b_f_s_o <= 1

b_f_v_o <=
f_v_p_2
else:

b_f_s_o <= 0

b_f_v_o <=
f_v_c_1

f_v_p_2 <= f_v_c_2:

b_f_s_o <= 1

b_f_v_o <=
f_v_p_2
else:

b_f_s_o <= 0

b_f_v_o <=
f_v_c_2

1, 7





2, 3, 4, 5

PARENT_2_UNCHANGED
transmission_ind_1_o <= 0
transmission_ind_2_o <= 1
unchanged_ind_o <= 1

•

•

6



•

•

•



TRANSMIT_CHILD_1/2_REST
for(cityCounter < lpNumOfEntriesPerMemLine):

pop_data_o[71cityCounter*lpCityEntryBitWidth:lpCityEntryBitWidth]
<=
child_1_tour
[cityCounterParents+cityCounter]
cityCounterParents < pNumCities - lpNumOfEntriesPerMemLine:

cityCounterParents <= cityCounterParents +
lpNumOfEntriesPerMemLine
else:

cityCounterParents <= 0

Child 2: second_child_to_be_transmitted <=
0

•



•

WAIT_STATE_1/2
input_reading_done <= 0
•



START_NEW_INPUT_LINE_PARENT_1
input_reading_done <= 0
•



START_NEW_INPUT_LINE_PARENT_2
input_reading_done <= 0

f_v_c_1 <= f_v_c_2:

b_f_s_o <= 0

b_f_v_o <=
f_v_c_1
else:

b_f_s_o <= 1

b_f_v_o <=
f_v_c_1

TRANSMIT_CHILD_1/2_BEGIN
transmission_ind_1_o:

transmission_ind_1_o <= 0

transmission_ind_2_o <= 1
else:

transmission_ind_1_o <= 1

transmission_ind_2_o <= 0
unchanged_ind_o <= 0
pop_data_o[71-:lpCityEntryBitWidth] <= fitness_value_child_1
for(cityCounter < lpNumOfEntriesPerMemLine):

pop_data_o[71lpCityEntryBitWidth*cityCounter:lpCityEntryBitWidth] <= child_1_tour
[cityCounter-1]
cityCounterParents
<=
lpNumOfEn-

READING_INPUT_VALUES_PARENT_1/2
incoming_data_counter
<
lpNumOfEntriesPerMemLine:

parent_1/2_tour[city_1/2_load_counter]
<=
pop_data_i[71incoming_data_counter*lpCityEntryBitWidth:lpCityEntry BitWidth]

city_1/2_load_counter++

incoming_data_counter++
else:

BEGIN_PARENT_2
start_parent_2_i:

fitness_value_parent_2 <= pop_data_i[71:71lpCityEntryBitWidth]/2

incoming_data_counter <= 1

city_2_load_counter <= 0

Figure 27: Main Finite State Machine of the Processing Node, which is in charge
of handling the data I/O with the main controller and storing the incoming values
in the parent-register files
75

•

IDLE
Wait until parents are
completely loaded, reset all register-stored
variables

TWO_OPT_EVALUATION
Compare a set of distances to evaluate
whether a swap of
nodes is required
•
If so, swap the first two
framing nodes
•

•

•

•

LOAD_CHILD
Copy the randomly selected chunk of nodes
from the first parent to
the child

DEPOT_INSERTION
Iterate
through the
child and set the depot
bits as required by the
capacity limit
Calculate first pair of
distances for the 2-opt
algorithm

•

•

INSERT_REST_NODES
Insert nodes in the remaining empty spots of
the child from the second parent

PERFORM_MUTATION
Reverse the order of
nodes within a randomly selected chunk of
nodes within the child

TWO_OPT_WAIT_STATE
Wait until the squareroot modules have finished the calculation of
the distance values for
the 2-opt algorithm

TWO_OPT_CALCULATION
•
Start a new distance
calculation for a new
pair of nodes
•
Decide whether 2-optalgorithm should be
terminated
•
Start general fitness
calculation

DISTANCE_CALCULATION_
WAIT_STATE
•
Wait until the squareroot modules have
finished the calculation
of the distance values
for the 2-opt algorithm

DISTANCE_CALCULATION
•
Start a new distance
calculation
between
two non-depot nodes
as part of the fitness
calculation for the individual fitness

DISTANCE_CALCULATION_
DEPOT_STOP_WAIT_
STATE
•
Wait until the squareroot modules have
finished the calculation
of the distance values

DISTANCE_CALCULATION_
DEPOT_STOP
•
Start a new distance
calculation between a
depot-stop and the
following node

DISTANCE_CALCULATION_
INTERMEDIATE_
FINAL_WAIT_STATE
•
Wait until the squareroot modules have
finished the calculation
of the distance values

DISTANCE_CALCULATION_IN
TERMEDIATE_FINAL
•
Add the distance from
the last node to the
last following depot
stop

•

TWO_OPT_SWAP
Swap the remaining
node within the selected chunk as determined by the 2-optalgorithm

•

•

DISTANCE_CALCULATION_FINAL
Final step of distance calculation, return
to idle state

Figure 28: Simplified depiction of the FSM for the microalgorithmic processing in
the Processing Node. Includes all waiting states for the calculation of euclidean
distance values.
76

loaded and nodes already present in the child tour.
After being notified that the storing of the parent individuals is over, the child FSM
switches to the state “LOAD CHILD” (Figure 30), where the copying process of
a chunk of nodes is initiated. Instead of loading all nodes of the chunk of random
length into the child register file at once, the transfer of information is organized
node-by-node.

For this purpose, the counter “city selector child 1 counter 1”

(exemplaric for the child 1 FSM) is used to not only address the empty spots in the
child register file beginning at the very first position, but also to select the nodes
to be copied from the register file of parent 1 beginning at the randomly selected
“crossover point child 1”. Thus, the process of copying the sequence of nodes is
variable in length, depending on the randomly selected “crossover 1 length i”.
Following

this

loading

process,

the

FSM

switches

to

the

state

“INSERT REST NODES”, where the remaining empty spots in the child
register file are then filled up with non-present nodes from the other parent
individual. For this purpose, two different counters are used:
• city selector child 1/2 counter 1 : This counter addresses the next slot to
be filled in the child-FSM. The currently selected element from the parent
individual is always copied to this spot in the child register file, even before
the check for this copied node is run.

Therefore, this counter is only

incremented, if the node that was copied to the belonging spot has been
proven as unique in the child-individual. Otherwise, the same spot in the
child tour is marked by this counter until it is filled with a node that has not
been copied to the child individual so far. Obviously, this counter is set to
the sport after the last copied element in the beginning of the node insertion
process.
• city selector child 1/2 counter 2 : This counterpart of the first pointer is used
77

...

Parent 2
Node 2.4

Node 2.5

Node 2.6

Node 2.7

Node 2.8

Node 2.9

...

+

city_selector_child_1_counter_2

1

city_selector_child_1_counter_1

Child 1
Node 1.1

Node 1.2
prefilled

Node 1.3
prefilled

Node 1.4
prefilled

Node 1.5
prefilled

Node 1.6
prefilled

Node 2.6
copied

+

prefilled

=

=

=

=

=

=

0

=

node_already_in_child_1

Figure 29: Data flow for the comparison of a node from parent 2 to be inserted
into the next empty spot of child 1.
to address the elements to be copied from parent 2 to child 1 or from parent
1 to child 2. Other than the child-related counter, this value is increased by
one in each clock cycle to test each node in the parent one after each other
if it is already present in the child individual.
As one can see in Figure 29, the whole comparison process of the nodes to be
inserted in the child individual with the nodes already present in the child is done
in parallel with the help of combinatoric logic. Thus, each check requires only one
clock cycle, which limits the maximal duration of the whole node insertion process
to a number of clock cycles equal to the number of nodes in a parent tour.

78

Mutation
The whole mutation process - if applied - is managed in the state
“PERFORM MUTATION”: Within the randomized length of the chunk of nodes
to be reversed and beginning at the also randomized starting point of this chunk
of cities to be reversed, the positions of two nodes are exchanged in each clock
cycle. Therefore, the duration of the mutation process depends on the lenght of
the chunk to be reversed.
Depot insertion
After mutation, the FSM switches to the state “DEPOT INSERTION” to
correctly set all depot bits where they are required by the summed up demands
of the nodes within the children tours.

The process itself is rather simple,

making use of the counter “capacity counter child 1/2” to iterate through the
nodes of the children tours. At the same time, a register-based memory called
“capacity counter child 1/2” is used to store the accumulated demand of all visited
nodes. Generally speaking, the depot bits of all visited nodes are set to one and
“capacity counter child 1/2” is increased. Only if “capacity counter child 1/2”
exceeds the delivery capacity of the truck in the benchmark problem, the depot
bit of the node before the current one is set to zero, and also the node iterator
value is decreased by one to start iterating through the nodes of the new subtour.
After this process, the FSM switches over to the 2-opt heuristic algorithm.
2-opt processing
The definitely most complex part of the micro-algorithmic processing in the
children FSMs is the implementation of the 2-opt heuristic, which is used to find
and eliminate crossing routes within subtours of individuals in order to improve
the overall fitness of the children. Generally speaking, the whole algorithm can be

79


•




•
•

IDLE
child_1_finished <= 0
finished_parent_2_i:

city_selector_child_1_counter_1 <= 0

city_selector_child_1_counter_2 <= 0

child_1_iterator <= 0

capacity_counter_child_1 <= 0

init_mode_set_child_1 <= init_mode_i

mutation_process_counter_child_1 <= 0

fitness_value_child_1 <= 0

INSERT_REST_NODES
child_1_tour
[city_selector_child_1_counter_1] <= parent_2_tour
[city_selector_child_1_counter_2]
city_selector_child_1_counter_2++
!node_already_in_child_1:

city_selector_child_1_counter_1++
city_selector_child_1_counter_2
==
pNumCities:
•
do_mutation_i
||
init_mode_set_child_1:

first_mutation_point_child_1 <=
random_city_number_mutation_1_i
•
else:

city_selector_child_1_counter_1
<= 0

capacity_counter_child_1 <= 0

•

c_s_c_1_c_1
== pNumCities
&&
(do_mutation_i
||
ini_mode_set_
child_1)

•

DEPOT_INSERTION
capacity_counter_child_1 <= pVehicleCapacity:

capacity_counter_child_1 <= capacity_counter_child_1 + child_1_tour
[city_selector_child_1_counter_1].no
deCapacity

child_1_tour
[city_selector_child_1_counter_1].de
pot <= 0

city_selector_child_1_counter_1 <=
city_selector_child_1_counter_1 + 1
else:

capacity_counter_child_1 <= 0

city_selector_child_1_counter_1 <=
city_selector_child_1_counter_1 - 1

child_1_tour
[city_selector_child_1_counter_1].de
pot <= 1

•

•

city_selector_child_1_counter_1 != pNumCities

•

LOAD_CHILD
city_selector_child_1_counter_1 < crossover_1_length_i:

child_1_tour
[city_selector_child_1_counter_1]
<=
parent_1_tour[crossover_point_child_1
+ city_selector_child_1_counter_1]

city_selector_child_1_counter_1++
else:

city_selector_child_1_counter_1
<=
crossover_1_length_i

child_1_iterator <= 0

•

c_s_c_1_c_1
== pNumCities
&& !
(do_mutation_i
||
ini_mode_set_
child_1)

PERFORM_MUTATION
city_selector_child_1_counter_1 < pMutationDistance/2:

child_1_tour
[first_mutation_point_child_1
+
city_selector_child_1_counter_1] <=
child_1_tour
[first_mutation_point_child_1 + pMutationDistance
city_selector_child_1_counter_1 - 1]

child_1_tour
[first_mutation_point_child_1 + pMutationDistance
city_selector_child_1_counter_1 - 1]
<=
child_1_tour
[first_mutation_point_child_1
+
city_selector_child_1_counter_1]
else:

city_selector_child_1_counter_1 <= 0
•
init_mode_set_cild_1
&&
mutation_process_counter_child_1 < pInitMutationProcesses:

first_mutation_point_child_1 <=
random_city_number_mutation_1_i

mutation_process_counter_child_1 <=
mutation_process_counter_child_1 +
1
•
else:

city_selector_child_1_counter_1
<= 0

capacity_counter_child_1 <= 0

else
!(c_s_c_1_c_1 < pMutationDistance/2) && init_mode_set_child_1 && mutation_process_counter_child_1 <
pInitMutationProcesses

Figure 30: First part of the Finite State Machine of the Processing Node that
is in charge of processing the children, including crossover, mutation and depot
insertion. Complete overview of all used variables and registers.

80

broken down into three steps, which are directly incorporated by the three main
states in the FSM that are related to that part of the algorithm (Figure 32):
• TWO OPT CALCULATION : This state is used to determine the two nodes
which are under investigation as possible start- and endpoint of a chunk
of nodes to be reversed as part of the 2-opt improvements.

Therefore,

the x- and y-coordinates are selected and passed on to the DSP-sliceimplemented distance calculation hardware (Figure 31).

For this, two

positions are important: The first pointer “city selector child 1/2 counter 1”
addresses the starting point of a possible chunk of nodes, while
“city selector child 1/2 counter 2” marks the end of this potential sequence
of cities. As long as no potential improvement through reverse of order
is detected, the latter counter is increased. Once it reaches the last node
within the current subtour, marked by the set depot bit, the first counter is
increased by one and the second counter is set back to the following position
of this new starting node. Once the first counter is pointing to the end of
the current subtour, a special re-arrangement of the pointers is necessary: If
improvements through re-ordering of nodes were made, the two counters are
set back to the beginning of the subtour for a next 2-opt check of the tour.
If not, the pointers are set to the beginning of the next subtour. Once the
end of the individual is reached, the 2-opt algorithm is terminated.
Regarding the distance comparison that is used to evaluate the necessity of
a swap of nodes, the distances between the starting point of the potentially
reversed chunk and its predecessor, between the end point of the potentially
reversed chunk and its sucessor as well as the distances between those two
framing points of the chunk and their potential new neighboring nodes after
a reverse of order have to evaluated. For this, as mentioned before, the

81

coordinates are inserted into a DSP-based combinatoric distance calculation
logic, which delivers the required non-Euclidean distance values in the same
clock cycle, so that each algorithmic step for checking only takes one clock
cycle.
• TWO OPT EVALUATION : After each counter-incrementation initiated in
TWO OPT CALCULATION, the FSM switches to this state to check
the calculated distance values and evaluate the necessity of a reversal of
orders of nodes within the chunk of cities currently under investigation: If
part dist 1 + part dist 3 > part dist 5 + part dist 4 - which means that the
distance from the framing nodes of the potential chunks to their current
neighbors is longer than the distance to their potential new neighbors after
a reversal of the order of the nodes - such a reversal is initiated. The first
such swap, namely between the two framing nodes, is processed in the state
itself, whereas it is important to check for the case that the end-point of
the chunk of nodes might be followed by a depot stop. Anyways, the FSM
then switches to TWO OPT SWAP for the reversal of order of the rest of
the nodes between the two framing cities. Otherwise, the automaton jumps
back to TWO OPT CALCULATION.
• TWO OPT SWAP : The swapping of the remaining nodes in the chunk is
organized in this state of the FSM. Two nodes at the opposite ends of the
chunk are chosen at the same time to reverse their positions. Once this
process is finished for all nodes within the chunk, the FSM switches to the
standard TWO OPT CALCULATION state.
However, it is up to the final tuning of the algorithm to decide whether to
keep this squared distance metric or to use the real euclidean distance calculation
instead. In this case, the pre-calculated distances of the DSP-slices are inserted
82

city_selector_child_1/2_counter_2

city_selector_child_1/2_counter_1

Node C.3
X3
Y3

- -

*

-

-

*

*

- -

-

-

-

*

*

-

-

*

*

part_dist_4

part_dist_5

-

*

+

+

+

+
part_dist_1

part_dist_3
Node C.5
Node C.6
X5
Y5
X6
Y6

Node C.4
X4
Y4

- -

Node C.2
X2
Y2

-

Node C.1
X1
Y1

part_dist_4

part_dist_5

part_dist_1

part_dist_3

+

a

+

a
<
b

Swap decision

b

Figure 31: Calculation network implemented to check for the necessity of swaps
within the 2-opt algorithm. All calculation-related circuits, especially the adders
and multipliers, are later enrolled on the dedicated DSP-slices of the FPGA.
in the square-root calculator modules. During the calculation time of the square
roots, the FSM switches in the state TWO OPT WAIT STATE to wait until the
results are available. Only then, the evaluation of those distances can begin in
TWO OPT EVALUATION (Figure 28).
Distance calculation
To provide the correct euclidean fitness value for the children individuals,
the counter “city selector child 1/2 counter 1” is reused to iterate through the
individual and insert the coordinates into the same DSP-based calculation
hardware as used before. Different to the 2-opt-heuristic before, the distance
results are then inserted in the external square root calculator module to
get the exact Euclidean distances.

Since the calculation time of this
83

module is dependent on the specific process, the FSM has to switch to the
DISTANCE CALCULATION WAIT STATE to wait for the final outcome of the
calculation (Figure 28).

A special treatment must be given to those nodes

which have a set depot bit and are therefore followed by a depot stop.

In

such a case, the distance to the depot needs to be calculated, as well as the
distance from the depot to the next node as stored in the individual. Once the
counter reaches the final node of the individual, the process is finished in the
DISTANCE CALCULATION FINAL. If both children FSMs have reached this
final state, a signal is aggregated for the main FSM to start the comparison of the
fitness values and then the re-transmission of individuals.
6.4.4

Square-root calculator

The square-root calculator, which is used for the calculation of the Euclidean
fitness values, was adopted as an external IP-core from [37], but changed in some
details to fit the communication standards and design decisions of the Processing
Node architecture. So the module as finally used in the Processing Nodes has the
following I/O-ports:
• clk : The root calculator receives the same system-wide clock signal as all
other parts of the design.
• start: This signal has to be set to start a calculation. Normally, this bit is
transmitted at the same time as the input value over the bus “rad”.
• busy: During calculation, this signal is set to high by the calculator. It only
switches back to 0 once the calculation is over.
• valid : Almost as a counterpart to the signal mentioned before, this bit is set
to high once the calculation is finished and the final results can be read from
the busses “root” and “rem”.
84

TWO_OPT_EVALUATION
part_dist_1_child_1 + part_dist_3_child_1
>
part_dist_5_child_1
+
part_dist_4_child_1:
•
! c_1_t[c_s_c_1_c_2].depot:

c_1_t[c_s_c_1_c_1] <= c_1_t
[c_s_c_1_c_2]

c_1_t[c_s_c_1_c_2] <= c_1_t
[c_s_c_1_c_1]
•
else:

Swap as before, but: c_1_t
[c_s_c_1_c1].depot <= 0 and
c_1_t[c_s_c_1_c_2].depot <= 1

child_1_improvement_made <= 1

child_1_swap_counter <= 1
else:
•
c_1_t[c_s_c_1_c_2].depot:

c_s_c_1_c_1++

c_s_c_1_c_2++
•
else:

c_s_c_1_c_2++

•

•

part_dist_1_child_
1+
part_dist_3_child_
1>
part_dist_5_child_
1+
part_dist_4_child_
1

•

•

else

c_s_c_1_c_1 +
c_1_s_c <
c_s_c_1_c_2 c_1_s_c

else

!c_1_t[c_s_c_1_c_1].depot

TWO_OPT_CALCULATION
c_s_c_1_c_1 == (pNumCities-1):
child_1_improvement_made:

x1_1_child_1 <= lpDepotXCoord

y1_1_child_1 <= lpDepotYCoord

x1_2_child_1 <= c_1_t[0].nodeID.xCoordinate

y1_2_child_1 <= c_1_t[0].nodeID.yCoordinate

c_s_c_1_c_1 <= 0
•
else:

c_s_c_1_c_1 <= child_1_return

c_s_c_1_c_2 <= child_1_return + 1

child_1_improvement_made <= 0
else:
•
c_1_t[c_s_c_1_c_1-1] .depot:

x/y1_1_child_1 <= Depot

x/y5_1_child_1 <= Depot
•
else:

x/y1_1_child_1 <= c_1_t[c_s_c_1_c_1-1]

x/y5_1_child_1 <= c_1_t[c_s_c_1_c_1-1]

x/y1_2_child_1 <= c_1_t[c_s_c_1_c_1]

x/y4_1_child_1 <= c_1_t[c_s_c_1_c_1]

x/y2_1_child_1 <= c_1_t[c_s_c_1_c_1]

x/y2_2_child_1 <= c_1_t[c_s_c_1_c_1 + 1]

x/y5_2_child_1 <= c_1_t[c_s_c_1_c_2]

x/y3_1_child_1 <= c_1_t[c_s_c_1_c_2]
•
c_1_t[c_s_c_1_c_2].depot:

x/y3_2_child_1 <= Depot

x/y4_2_child_1 <= Depot
•
else:

x/y3_2_child_1 <= c_1_t[c_s_c_1_c_2 + 1]

x/y4_2_child_1 <= c_1_t[c_s_c_1_c_2 + 1]
•

! child_1_improvement_made

•

•

•

•



DISTANCE_CALCULATION_INTERMEDIATE_FINAL
fitness_value_child_1
<=
fitness_value_child_1
part_dist_1_child_1

+




•

DISTANCE_CALCULATION
c_s_c_1_c_1 == (pNumCities - 1):

x/y1_1_child_1 <= Depot

x/y1_2_child_1 <= c_1_t[pNumCities1]
else:
•
c_1_t[c_s_c_1_c_1].depot:

x/y1_1_child_1
<=
c_1_t
[c_s_c_1_c_1]

x/y1_2_child_1 <= Depot
•
else:

x/y1_1_child_1
<=
c_1_t
[c_s_c_1_c_1]

x/y1_2_c_1
<=
c_1_t
[c_s_c_1_c_1 + 1]

c_s_c_1_c_1 ++
fitness_value_child_1
<=
fitness_value_child_1 + part_dist_1_child_1

c_s_c_1_c_1 ==
(pNumCities - 1)

else



TWO_OPT_SWAP
c_s_c_1_c_1 + c_1_s_c < c_s_c_1_c_2 c_1_s_c:

c_1_t[c_s_c_1_c_1 + c_1_s_c] <=
c_1_t[c_s_c_1_c_2 - c_1_s_c]

c_1_t[c_s_c_1_c_2 - c_1_s_c] <=
c_1_t[c_s_c_1_c_1 + c_1_s_c]

c_1_s_c++
else:
•
c_1_t[c_s_c_1_c_2].depot:

c_s_c_1_c_1 ++

c_s_c_1_c_2 <= c_s_c_1_c_1 + 2
•
else:

c_s_c_1_c_2 ++

DISTANCE_CALCULATION_FINAL
child_1_finished <= 1
state_parents != IDLE_PARENT:

c_s_c_1_c_1 <= 0




c_1_t
[c_s_c_1_c_1].depot

DISTANCE_CALCULATION_DEPOT_STOP
x/y1_1_child_1 <= Depot
x/y1_2_child_1 <= c_1_t[c_s_c_1_c_1 + 1]
c_s_c_1_c_1 ++
fitness_value_child_1
<=
fitness_value_child_1 + part_dist_1_child_1

Figure 32: Second part of the Finite State Machine of the Processing Node that is
in charge of the 2opt-heuristic and distance calculation for the children individuals.
The waiting-states for the square-root calculation are not included.
85

• rad : This bus is used to transmit the radicand to the square root calculator, it
is therefore the input value of the calculation. The bit width of the radicand
and therefore also the busses used for both input- and output-communication
is determined via parameters during the initialization of those hardware
modules.
• root: This bus transmits the actual root and therefore the main result of the
square root calculation.
• rem: Next to the root, also the remainder of the square root calculation is
transmitted via this specific bus. This value is unused in the Processing Node
design.
6.4.5

Evolution of the ProcessingNode-Design during development

During the development process, multiple design changes were made especially
regarding the usage of the LUT-based register files as storage for the parent- and
children individuals in the module. As explained before, a major drawback of this
design decision is the relatively large resource footprint. By reducing the degree
of parallelism in accessing those large register files, improvements regarding the
utilization of FPGA resources could be made at the expense of calculation speed
(see also chapter 8 on the specific numbers). The different design stages in the
development process of the Processing Nodes were the following:
• Design #1 : In the initial design, all accesses to the register files were
parallelized to the maximum possible extent. This included especially the
following critical points (Figure 33):
1. In the first step of the crossover process, all nodes of the selected part
of one parent were passed on to the child at once, thus in only one clock
cycle.
86

2. In the second step of the crossover process, the comparison of nodes to
be inserted in the remaining empty spots of the child was also done in
combinatorical logic and in parallel in only one clock cycle.
3. The reversal of nodes in a specified section of the newly generated
children individuals was done in only one clock cycles and therefore
completely parallelized.
4. The incoming 72 bits of data from the population memory were loaded
to the parent individuals completely parallel and thus in only one clock
cycle.
5. Analogous to the point before, also for the output transmission, always
72 bits at once were read from the children individuals and sent via the
outgoing data bus.
• Design #2 : The next design iteration omitted the parallel comparison of the
nodes to be inserted in the empty spots of the child individual. Instead, only
one comparison per clock cycle was performed. This change obviously heavily
impacts the duration of each calculation process, since such a comparison
process for larger problems can easily last a few hundred instead of only one
clock cycle.
• Design #3 : The third design kept the changes of the second iteration and
added the serialization of the copying-process of the first nodes from the
parent to the child.
• Design #4 : Again, all former changes were continued, while additionally also
the mutation process was completely serialized. This means, that only two
nodes changed place in one clock cycle instead of all nodes to be reversed at
once.
87

pop_data_i

4

P1.1

P1.2

P1.3

P1.4

Copying nodes from
the input bus

P1.5

P1.6

P2.1

P2.2

2
1

P2.3

P2.4

P2.5

P2.6

Insertion of rest
nodes

Copying of nodes to
a child

C1.1

5

C1.2

C1.3

C1.4

3

Transmitting nodes
via the output bus

Reversal of nodes
during mutation

pop_data_o

Figure 33: Visualization of the critical points of the different design iterations for
the Processing Nodes.
• Design #5 : Adding to the already existing changes, the initial input reading
of the incoming memory lines to the parent register files was organized in
node-wise steps.
• Design #6 : In the sixth design iteration, this principle was also applied
for the output-reading of values from the children register files to the
transmission bus.
• Design #7 : According to the test results with respect of the hardware
utilization after synthesis and implementation, the final design iteration as
presented in this extensively in this thesis features all changes to the original
implementation except for those presented in Design #2 and Design #6.
Thus, the comparison of nodes to be inserted with all the nodes already
present in the child is parallelized as well as the output-transmission process
from the children register files to the output bus.

88

6.5

Design for Experimentation - the CVRPWrapper
As explained before, the whole algorithm as implemented in the hardware

design runs for the predefined number of runs and then transmits the best found
result via the UART-connection to the host-PC. While one can easily imagine to
use the FPGA in exactly this way for real-world routing applications, the procedure
of a single execution of the algorithm with a single result does not completely
match the experimental design that was set up to compare the three different
computational platforms: To be able to see the behaviour of CPU, GPU and FPGA
time- and quality-wise, it is important to get multiple “intermediate” results of the
algorithm to be able to see the development of the algorithmic outcome over time.
To automatically receive such results from the FPGA, a special property of the
design was used: Whenever the “reset”-signal of the main controller is set high
and then released again, all internal registers of the design will be set to their
initial default value again - with the exception of the content of the population
memory. Thus, the algorithm will re-start for another execution of the predefined
number of crossovers, but start with the already optimized population inherited
from previous runs.
For that, the CVRPWrapper-module (Figure 34) senses via a specialized signal
whenever the central controller has finished the UART-transmission. After that,
the “reset”-signal of the controller is set high for a certain period of time, during
which the host-PC can process the received data stream. After that, the Wrapper
restarts the central controller by releasing the “reset”-signal.
Another benefit from this centralized design is the fact, that all parameters required
for the tuning of the algorithm - thus the optimization of the algorithmic behaviour
for the Capacitated Vehicle Routing Problem in general and certain selected
problem sets in special - can be determined only once in this exact Wrapper module.

89

The relevant parameters to be set before an experiment are the following ones:
• pNumInitRuns:

The number of runs that is performed with multiple

mutation-processes in the beginning of each algorithmic execution.
• pNumRuns: The overall number of runs that is performed during a fill
algorithmic execution.
• pPopSize: The size of the population, which is crucial for maintaining
beneficial diversity. Experiments on the GPU found that a 20n population,
hence a population with 20 times the number of nodes in the problem, would
contribute to a good result of the GA. However, the maximum population
size is also limited by the available hardware resources.
• pMutationProbabilityThreshold : This parameter determines the probability
with which mutations are performed in non-init crossover processes.
The actual probability can be calculated as P robabilityM utation

=

pM utationP robabilityT hreshold
.
210

• pWildChildProbabilityThreshold : This parameter determines the probability
with which WildChild processes, therefore the propagation of children back
to the main population without comparison of their individual fitness, are
enforced. The actual probability can be calculated as P robabilityW ildChild =
pW ildChildP robabilityT hreshold
.
210

• pWildChildLimit: Determines the latest crossover process up to which a
WildChild-behaviour might be enforced. After that limit, all crossovers
require proper fitness comparisons of children and parents.
• pSizeElitistList: Determines the size of the Elitist list in the algorithm.
Experiments on the GPU revealed that a 5% elitism, thus an elitist list
90

CVRPWrapper
Initialization

rst_i

parameters

Central Controller

Figure 34: Design diagram for the wrapper module and its connections to the main
controller of the hardware architecture
with 5% of all individuals from the population gives promising results.
• pElitistSelectionProbabilityThreshold :

This parameter determines the

probability with which individuals from the Elitist list instead of
individuals selected from the general population are chosen as potential
parents for a crossover.
P robabilityElitistSelection =

The actual probability can be calculated as
pElitistSelectionP robabilityT hreshold
214

91

CHAPTER 7
Workflow of Design and Testing
7.1

Selection of a FPGA device
Bearing in mind the architectural design idea and considering the necessity to

scale the various components of the hardware to adapt it to different benchmark
problems with up to a few thousand cities, it was a given fact that one of the most
capable FPGA-chips available in the market had to be chosen for this project.
Furthermore, the initial design draft still included High-Bandwidth Memory as
a key component for the population memory, which therefore also became a
decisive point for the selection of a suitable hardware platform. Given those
circumstances, the choice could be narrowed down to the respective top-tier FPGAs
of the two duopolists in this part of the semiconductor business: On the one
hand, the Intel Stratix 10 MX 2100 chip included in the DK-DEV-1SMX-H-A
development board by Intel-Altera [38], and on the other hand the Xilinx Virtex
UltraScale+ XCVU37P-L2FSVH2892E chip embedded in the Virtex UltraScale+
HBM VCU128 FPGA Evaluation Kit by Xilinx [39], which was recently acquired
by AMD. Both devices are pretty much comparable in their technical specifications
(Table 2).
At first glance, twice the available HBM memory speaks for the Intel board.
However, early calculations for the test problems to be solved with the board
proved that even for the largest benchmark tests the FPGA would not be utilizing
that amount of memory. Instead of this, it turned out very quickly that the limiting
factor in hardware would be the number of available Logic Cells respectively LookUp-Tables, which are used for constructing the actual hardware structure needed
for calculations. Since the Xilinx chip has a 37.6% advantage in this category and
also provides 42.3% more BRAM as back-up solution for a population-memory, it
92

Included Logic Cells
Available HBM
Available BRAM
DSP-Slices
I/O-Ports

Intel Stratix 10 DX 2100 FPGA
2073000
16 GB
239.5 Mb
3960
656

Xilinx Virtex UltraScale+ XCVU37P
2852000
8 GB
340.9 Mb
9024
624

Table 2: Comparison of the technical specifications of the high-performance
FPGAs by Intel-Altera and Xilinx
was finally easy to justify the decision for this FPGA.
7.2

Hardware development toolchain
The decision for one of the development board did not only mean a decision

for a hardware platform, but also had a significant impact on the development
toolchain, since both Intel and Xilinx include a license for their respective
Integrated Development Environment (IDE) for hardware implementation with the
chips. Therefore, Xilinx’s Vivado ML 2021.2 was used as development platform
for all hardware-related issues. This software package integrates all required tools
for FPGA-implementations, especially a text editor, an interpreter and simulator
for all modern Hardware Description Languages (HDLs) as well as a combined tool
chain of synthesizer and implementation software that takes the specified HDLmodules and converts them into a bitstream including all the required information
for configuration and set-up of the FPGA.
Due to practicability and pre-existing knowledge, SystemVerilog was chosen
as main HDL for all completely new designed modules, especially the central
controller and the ProcessingNodes. Some parts of the integrated IP-cores were
written in classic Verilog, which is no problem due to the cross-compatibility of
both languages and the ability of the Vivado-HDL-interpreter to combine both
versions of the HDL.
The development process was executed as recommended for any hardware-

93

focused projects, but with several deviations due to the differences between the
implementation of a single Genetic Algorithm and - for example - a complete CPU
design. The most important steps were the following ones:
1. After the theoretical design of the various submodules, they were actually
implemented in SystemVerilog, beginning with the ProcessingNodes as cores
of the micro algorithm.
2. To test the functionality of the single components, simple testbenches were
created as classic Verilog-modules to validate the implemented functionality
of those modules:
• Due to the central meaning of the ProcessingNodes, they were again
testes first. Since there are no true “edge-cases” or different operators
that manipulate the functionality of those modules, it was enough to see
the outcome of the internal data manipulation with an exemplaric set
of two individuals. At this point, very short examples with only a few
nodes were used to allow for manual checking of calculated values such
as fitness or the depot-placement based on the summation of demands.
• For isolated testing of the main controller it was necessary to write
so called “stubs” of the essential submodules that are connected to
it, especially of the ProcessingNode, as long as this module itself was
not completely validated. Those ‘lightweight’ implementations do not
feature the full internal data flow of the real modules, but show the exact
same I/O-behaviour in terms of following the defined communication
protocols. By this, it was possible to examine the correct functionality
of the control flow in the main controller. Furthermore, the very first
tests of this module used simple registerfiles as population memory due

94

to their simplicity in read- and write-accessability.
3. Once the correctness of the isolated modules could be verified, an integrated
test-run was started, where the stubs in the main-controller were replaced
by the real implementations of the ProcessingNodes as well as a BRAMbased population memory. At this point, the “testbench” was just used as a
clock- and reset-generator as well as to instantiate the main controller with
all required parameters, since this main module can be seen as a standalone implementation of an algorithm, that does not need any external
input.

Instead, the population memory was pre-filled with the starting

population, which allows the main module to start the macro algorithm,
automatically includes individual-processing in the ProcessingNodes. This
procedure turned out to be not only valuable for the validation of the correct
functionality of the design, but also for running actual benchmark problems
as long as some last problems with the UART-communcation could not be
solved. Due to the cycle-exact simulation, it was no problem to get an
exact value for the calculation time as well as all interim results. However,
the execution of such a large-scale simulation is very time- and ressourceconsuming, which meant, that even the simulation of only a few thousand
crossover operations on a very powerful server hardware would require several
hours.
4. The final stage of testing is actually the normal operation of the implemented
digital design as originally planned: Since the hardware is customized for each
benchmark problem, it requires adaptations to the parameters as well as the
initial set-up of the population memory, before synthesis and implementation
can be started. Depending on the size of the problem and therefore also the
design, this process takes some hours. Once finished, the bitstream can be
95

loaded onto the FPGA, which then performs the pre-defined algorithms and
transmits the results for further evaluation to the workstation.
7.3

Benchmark problems and software tools for adaptation
Keeping in mind the main goal of this work, to compare the computational

performance of GPU, CPU and FPGA, it is of outermost importance to agree on
common benchmark problems, which can be executes on all platforms to provide
statistical data on processing time and solution quality. Since the exploration
of Vehicle Routing Problems has a long tradition in both systems and computer
engineering, a wide range of such problems is available online [40]. The size of such
benchmark problems ranges from 12 to up to 30000 cities in a map.
7.3.1

Composition of a benchmark problem

For each benchmark problem, two files in a common format are given to
include in the own processing tool chain of the Genetic Algorithm:
• The first and most important file is named problem name.vrp and contains all
information of the problem map. The header of the file contains lines with the
dimension of the problem (“DIMENSION: #”), therefore the number of the
cities in the problem including the depot and the capacity of the delivery
truck in the example (“CAPACITY: #”). Those initial information are
then followed by the “NODE COORD SECTION”, where the coordinates
for each node including the depot are given.

Thus, each line consists

of a problem-specific node-ID in ascending order, followed by x- and ycoordinates, whereas each value is just separated by a single space. Generally
saying, the coordinates are chosen from R - they can have decimal points and
be negative. The next part in this file is the “DEMAND SECTION”, where
each line again holds the node-ID together with the demand value, chosen

96

from N. One can easily detect the depot-node with node-ID 1 by its demand
value of 0.
• The second file for each problem is named problem name.opt and contains the
best known routing - thus the solution with the best known fitness value - for
the specific problem. This “optimal” routing is not available for all problems,
since research is still ongoing, especially for larger problems. Smaller maps
might have been evaluated with exact, but very time-consuming methods of
solution-finding.
Within this file, each line represents a single subtour of the optimal routing
and therefore begins with “Route #(num of line):”, then followed by the
node-IDs of the cities in this subtour in the right order. One has to note,
that the node-IDs are shifted by one in this file compared to the original .vrp
problem statement: In this first file, the first non-depot city would have the
node-ID 2, but in the .opt-file, it is shifted to 1. The last line of the file then
holds the overall fitness of the stated solution.
7.3.2

Software-toolchain for problem adaptation

While the software-based implementations of the Genetic Algorithm on
GPU and CPU can easily just read-in the stated .vrp file as starting point of
their processing flow, this is generally speaking not possible for the hardwareimplementation on a FPGA. Instead of this, the information from the problemsetting file is necessary for the adaptation of the hardware-design before synthesis
and implementation can occur. Therefore, it makes sense to use a software-based
evaluation of the files to extract essential information for the hardware-set up, such
as the bit widths of various processing elements, and most importantly the initial
population memory configuration.
The same is true for the evaluation of the final processing outcome of the algorithm:
97

While CPU and GPU are always part of a conventional computing setup and can
therefore directly print their results including the best found fitness value etc. on
the screen, the FPGA-development board is more or less a stand-alone computing
device, which does not directly host a screen to produce human-readable output.
Instead, the final results are sent to a workstation via a serial UART-connection.
Again, it makes sense to read and interpret this binary output with a specific
software suite. In a last step, it is very convenient to have the ability to use
another software tool to produce visualizations of the found best routing, so that
one can check for the actual composition of subtours etc.. Given a series of such
visualizations over time, it is also possible to detect the different optimization steps
as performed by the Genetic Algorithm.
Problem-based hardware generator
The most important software tool for the FPGA-based processing of CVRPbenchmark problems is the hardware generator, which reads the .vrp and .opt
files to extract the necessary information about the nodes, their coordinates and
demands, process them and generate the information that are necessary to adjust
the Verilog-files of the hardware design before synthesis and implementation of a
design can be started, which is then completely adapted to the specific problem.
From the .vrp-file, the code, which was written in Python for reasons of simplicity
and availiablity of functional libraries, reads the list of demands and coordinates
as well as the position of the depot stop (Figure 35).
To meet the specific properties of the hardware implementation, especially the
coordinates then have to undergo a processing tool chain: While coordinates are
chosen from R and can thus be negative decimal numbers, the decision was made to
only use positive integers in the hardware-based calculations to simplify the whole
process and save valuable hardware ressources. Therefore, a three-stage process

98

has to be applied (Figure 36):
• In the first place, the list of coordinates is scanned for negative numbers,
and - if found - for the smallest x- and y-coordinates. The absolute values
of these numbers are then used as offset, which is added to all coordinates.
This mechanism ensures that the entire group is moved to the first quadrant
of the coordinate system, where all coordinates are either positive or zero.
• In the next step, the coordinates are scaled by multiplying them with
multiples ten to turn them into integers. However, the scaling factor directly
impacts the absolute value of the coordinates and thus also the number of
bits that is later necessary to represent them in hardware. As a consequence,
the smallest scaling factor is searched, for which all nodes with scaled-up
coordinates have a unique location.

This might lead to inaccuracies in

comparison to the GPU-based implementation, which directly uses floatnumbers for the representation of coordinates, but is an acceptable trade-off
for less hardware usage and therefore higher efficiency.
• Within all scaled coordinates, the largest single number is searched to
calculate the number of bits that are necessary to represent this value. The
same happens for the list of demands. By adding the depot-indicating bit,
one can therefore directly determine the specific data-representation for the
given benchmark problem.
With this information, a “problem name processed.txt”-file is created, which
mainly consists of two parts: The header of the text file contains the various
parameters that need to be inserted in the Verilog-modules to adapt it to the
benchmark problem. The second part of the text is then Verilog-code itself and
contains the declaration of all cities as localparameters, as well as a for-loop that

99

Demand List
•#

nodes

• truck

• Bitwidths

capacity

• coordinates

• Optimal

Depot Position

of

all nodes
• demands

configuration
with localparams and
Verilog-code

of all

nodes

Coordinate
List

problem_name.vrp

fitness

• Memory-

Coordinate
List - shifted

Coordinate
List - scaled

problem_name_
processed.txt

Assignment of all
nodes to their respective subtours in the best
known routing

List of the coordinates of all
nodes in the
problem

Calculation of the optimal
fitness

problem_name.opt

problem_name_
coordinate_list.txt

Figure 35: Processing flow in the “main float.py”-python script: Based on
the poblem-file and the optimal tour, the basic parameters and the memory
configuration file for the hardware adaption as well as a coordinate list for later
visualization are created.
pre-initializes the population memory. Those lines just need to be copied to the
respective BRAM-module.
At the same time, the subtour-ordering from the .opt-file is used to calculate the
Euclidean fitness of this optimal routing, which is also part of the parameters for
the hardware modules.
Furthermore,

a

list

of

the

adjusted

coordinates

is

printed

to

the

“problem name coordinate list.txt” file, which can then later be used for
the purpose of visualization.

100

6

NAME: EX-n10-k1
TYPE: CVRP
DIMENSION: 10
CAPACITY: 25
NODE_COORD_SECTION
1
0.0
0.0
2
5.4
-2.8
3
-6.2
-3.7
4
3.5
2.3
5
1.1
-4.7
6
-5.9
3.4
7
-2.0
2.8
8
5.2
6.3
9
-2.5
-2.0
10
0.9
6.0
DEMAND_SECTION
1
0
2
5
3
3
4
11
5
7
6
2
7
13
8
12
9
4
10
1
DEPOT_SECTION
1
-1
EOF

8: 5.2 / 6.3
10: 0.9 / 6.0

4
6: -5.9 / 3.4
7: -2.0 / 2.8 2

4: 3.5 / 2.3
D: 0.0 / 0.0

-6

-4

-2

2

9: -2.5 / -2.0

4

6

-2
2: 5.4 / -2.8

-4

3: -6.2 / -3.7

5: 1.1 / -4.7

Smallest xcoordinate

-6

Smallest ycoordinate

Coordinate Shift:
•

X-coordinates +6.2

•

Y-coordinates +4.7

Calculation of bit widths and coordinate representations

12
8: 11.4 / 11.0

10

8

EX-n10-k1.vrp

Largest x-coordinate

Largest y-coordinate

Largest demand

116

110

13

7 Bit

7 Bit

4 Bit

10: 7.1 / 10.7

6: 0.3 / 8.1
7: 4.2 / 7.5

4: 9.7 / 7.0

6
D: 6.2 / 4.7

4

Node-Representation: XXXXXXX YYYYYYY DDDD S

9: 3.7 / 2.7

2

7 Bit

2: 11.6 / 1.9

2

7 Bit

4 Bit 1 Bit

5: 7.2 / 0.0

3: 0.0 / 1.0

4

6

10

8

19 Bits per Node

12

Coordinate Scaling:
•

Multiply each coordinate with 10 to
lose the decimal

120

100
80

10: 71 / 107

8: 114 / 110
Largest ycoordinate

6: 3 / 81
7: 42 / 75

4: 97 / 70

60
D: 62 / 47

40

Largest xcoordinate
9: 37 / 27

20

2: 116 / 19
5: 72 / 0

3: 0 / 10

20

40

60

80

100

EX-n10-k1
Number of cities: 9
Bits per coordinate: 7
Bits per demand: 4
Bits per node: 19
Depot X-coord: 62
Depot Y-coord: 47
Capacity of the vehicle: 25
Optimal fitness due to own metrics: 9457
Optimal fitness due to euclidean metrics: 349
localparam
localparam
localparam
localparam
localparam
localparam
localparam
localparam
localparam
localparam

initial_fitness
node_116_19_5
node_0_10_3
node_97_70_11
node_72_0_7
node_3_81_2
node_42_75_13
node_114_110_12
node_37_27_4
node_71_107_1

=
=
=
=
=
=
=
=
=
=

19’b1111111111111111111;
19’b1110100001001101010;
19’b0000000000101000110;
19’b1100001100011010110;
19’b1001000000000001110;
19’b0000011101000100100;
19’b0101010100101111010;
19’b1110010110111011000;
19’b0100101001101101000;
19’b1000111110101100010;

EX-n10-k1_processed.txt

120

Figure 36: Example of the coordinate-adaption mechanism in the “main float.py”python script: Coordinates are shifted and scaled to integers. Then, the final data
representation can be calculated.
101

UART-reader script
As explained before, the UART-module implemented in hardware transmits
the current value of the clock counter as well as the memory lines belonging to the
best found solution of the algorithm once the pre-defined number of crossoverprocesses is reached.

Therefore, a suitable software has to fulfill two tasks

(Figure 37):
• First of all, it has to controll the serial I/O-communication of the computer
to receive the transmitted bits. For this task, the strengths of Python as
language of choice for this problem are proven: The pulicly available library
“serial” includes the command “serial.Serial” which allows to set-up a serial
communication with definition of essential parameters like baudrate, the size
of each transmitted byte and communication-specific details such as stopbits. Therefore, the received bits can just easily be added to binary string
which can then be processed locally.
• The main task is now to interprete the raw-data as received from the FPGA.
Therefore, the receiver software has to have knowledge on the composition
of an individual for the specific benchmark problem under test. For this
purpose, the software reads the “problem name processed.txt”-file in the
first step and extracts the included information on the number of cities,
the bit widths of single nodes, coordinates and demand sections. Thus,
lists of all nodes, the position of depot stops and the demands can be
created. For further control of the feasibility of the created solution, the
algorithm in the following checks for duplicate and missing cities, calculates
the euclidean fitness value with software-based functions and compares it to
the transmitted fitness value.
Furthermore,

the

software

uses

the
102

received

data

to

generate

a

“problem name own best.opt”-file, in wheach each line represents a single
subtour of the general routing, whereas each node is depcited by its x- and
y-coordinate.
Visualizing tool
During the development cycle, it seemed to be valuable to be able to plot
the actual routing in a coordinate system to see the development of the subtours
and their connections over time, for example to check for possible intersections in
routes, which should be eliminated after applying the 2-opt heuristic. Therefore,
reading in the coordinate list together with the “problem name own best.opt”
allows to plot all cities by their coordinates and connect those with a singlecolored line that are part of the same subtour. For this, the Python-library
“matplotlib.pyplot” was used.

103

FPGA

UART-Bitstream

Workstation

problem_ name_processed.txt

...0000001010100101011000110101000000101011010100010101010010010010011010100100100101010 ...

ClockCounter

Fitness

Node 1

Node 2

X-coord

Y-coord

Demand

Depot-Stop

13

7

4

0

16

8

2

0

1. Feasibility Check:

4

23

5

0

Check for duplicate
cities in the coordinate section

7

8

1

1

9

16

9

0

23

17

8

0

5

15

11

1

13 7; 16 8; 4 23; 7 8;
9 16; 23 17; 5 15;

Node 3

3. Feasibility Check:
Calculate the Euclidean fitness in software to compare it to
the hardwarecomputed number

2. Feasibility Check:
Check for alignment of subtours
and added node-demands in
comparison to the truck-capacity

problem_ name_own_best.opt

Figure 37: Processing work flow in the UART-reader software: With information
taken from the “problem name processed.txt” file, the received bitstream is split
up and interpreted, so that the feasibility of the transmitted route can be checked.

104

CHAPTER 8
Hardware Utilization
As it is always the case for such hardware-related projects, the question for
the utilization of the available hardware ressources of the FPGA-platform is an
essential one. Especially two factors in this design impact the percentage of used
hardware elements and are at the same time also optimization mile stones of the
architecture itself:
• The size of the problems for which the specific design is adapted is the most
crucial factor for the hardware utilization of that design. Keeping in mind the
design structure, especially the large registerfiles for the storage of parent and
children individuals in the Processing Nodes, but also all related counters,
connectors and comparator structures as well as the size of the population
memory is heavily impacted by the number of nodes in the given problem
map and the bit width of each node in that problem. While it is one of
the main goals of the whole thesis project to be able to run the largest
possible problems in hardware, the size itself is a limiting factor for the
further hardware development.
• Furthermore, the clock speed is also an influencing factor for the percentage
of utilized hardware, while at the same the maximum clock frequency is also
limited by the complexity of the design. Again, as before, the maximization
of clock frequency remains one of the important optimization goals of the
complete design draft, since a higher clock speed equals a fast processing of
the algorithm.
To simplify the whole optimization and evaluation process, the decision was made
to give the capability of running larger problems a higher priority than the
105

optimization for the highest possible clock speed. Instead, a fixed design goal
of a stable clock speed of 100 MHz was set for at least the smaller problems under
investigation. If necessary, it is appropriate to reduce the reachable frequency for
larger problems.
During the development phase, multiple changes to the architectural design were
made in order to improve the overall hardware utilization. Therefore, the next
sections shall describe the various steps towards the final design as described in
chapter 6.
8.1

Development stages of the Processing Node
In subsection 6.4.5 various design decisions regarding the degree of parallelism

for the accesses to the register files that hold the parent and children individuals in
the Processing Nodes (Figure 33) were examined. Those changes, made to optimize
the ressource footprint of the module, have a crucial impact on the utilization of the
various components of the FPGA: For reasons of simplicity, an early engineering
prototype of the architecture with still 16 ProcessingNodes and a register filebased population memory was used to evaluate the various development steps as
described before (Table 3) in terms of their hardware-footprint when configured
for one of the smallest official test problems, the 15-node n16 k8 test set. This
small and easy-to-solve problem had also the advantage that it could be run until
completion, which means until the best known solution was found to analyze the
impact of the different designs on the calculation time.
One can clearly see, that each change away from parallelism towards a serialized
processing within the Processing Node also meant a clear increase of required clock
cycles to run the problem into completion: Especially meaningful is the observable
development from the original Design #1 towards Design #2, which introduced
serialized comparison of nodes to be inserted into the empty spots of the child

106

Design
Design
Design
Design
Design
Design
Design

#1
#2
#3
#4
#5
#6
#7

LUT
0.57
0.58
0.51
0.43
0.32
0.38
0.32

FF DSP # of clock cycles until completion
0.06 0.18
35261
0.07 0.18
51564
0.06 0.18
53332
0.06 0.18
55565
0.06 0.18
71489
0.07 0.18
0.06 0.18
52495

Table 3: Hardware footprint of a single processing node in the various Design
steps, configured for the problem set n16 k8. For Look-Up-Tables (LUTs), Flip
Flops (FFs) and the Digital Signal Processing slices (DSPs), the percentage usage
of all available hardware ressources of that kind is specified.
individual: While this development meant a 46.24% increase in clock cycles until
completion, it didn’t lead to any improvements regarding the hardware utilization
- in fact, one can even observe a slight increase in the usage of Look-Up-Tables.
This explains why this specific step towards serialization was not included in the
final design of the Processing Nodes, as well as the serialization of the outputreading as introduced for Design #6, which directly lead to an 18.75% increase in
LUT-usage in comparison to the design stage before. In the end, this resulted in
a mix-up of the design stages 2, 3 and 4 for the final Design #7.
8.2

Hardware utilization of the overall design for different sizes
Besides the stand-alone optimization of the Processing Node, also the overall

design in its final configuration with a BRAM-based population memory and two
Processing Nodes had to be tested regarding the hardware footprint. According
to the general selection of the Uchoa-benchmarks, the idea was to test for the
hardware utilization and maximum reachable clock speed for some of the problems
of that set, most importantly for the starting problem X-n101-k25 (Table 4).
Obviously, the number of nodes in the problem has a defining impact on the
hardware utilization as well as on the maximum reachable clock speed: In each

107

80000

0,7

70000

0,6

60000

0,5

50000

0,4

40000
0,3

30000

0,2

20000

0,1

10000
0

0

Design #1

Design #2

Design #3

Design #4

Design #5

Design #6

Utilized Percentage of all available LUTs

Clock Cycles until completion

Design Development for the Processing Node

Design #7

Design Stages
Clock Cycles until completion

Utilized percentage of available LUTs

Figure 38: Diagram of the hardware utilization of the Processing Node in the
various design stages. For Design #6, no data was recorded for the clock cycles.

X-n101-k25
X-n376-k94

Population size LUT FF BRAM DSP Maximum Clock Speed
20n = 2000
8.09 1.34
9.92
1.17
100
3n = 1125
20.97 3.72 20.54
1.15
71.4

Table 4: Percentage hardware utilization and maximum reachable clock speed
for the full architecture with two Processing Nodes and BRAM-based population
memory, configured for two of the Uchoa test problems. It is important to point
to the different population sizes for the two architectures as well as the fact, that
the X-n376-k94 problem was synthesized with a slightly different architectural
implementation from an older point of development. The latest version of the
hardware would cause a slightly higher hardware usage, especially for the LUTs.
. Neither of the synthesis results includes the use of the Elitist list, which would
again increase the utilization of LUTs.

108

Population size LUT
X-n101-k25 w/o Elitist list
X-n101-k25 w/ Elitist list

20n = 2000
20n = 2000

8.09
9.07

FF
1.34
1.45

BRAM DSP
9.92
9.92

1.17
1.17

Table 5: Percentage hardware utilization and maximum reachable clock speed
for the full architecture with two Processing Nodes and BRAM-based population
memory, configured for the X-n101-k25 Uchoa-problem. Shown is the difference
between a design with and without Elitist list, where the Elitist list holds references
to 100 nodes and therefore 5% of the population.
category of hardware components were an increase of usage was expectable, such
a plus of used resources can be detected, thus in Look-Up-Tables, Flip Flops and
BRAM, whereas the latter one is not directly comparable due to the differently
sized population of the Genetic Algorithm. However, it is also obvious that the
scaling of the utilized hardware resources is not directly proportional to the increase
of nodes in the problem: While one can detect a 375% growth of city nodes from
the first to the second problem, the increase of LUTs is only 259%, and for the
FFs it is 278%.
8.3

Impact of elitism in the main controller
While all of the former designs do not feature elitism in form of the Elitist

list in the main controller, it is important to see how this additional hardware
structure impacts the utilization of the available resources: Since the memory
table with references to the population is entirely implemented with register files,
obviously the percentage of utilized Look-Up-Tables increases. As one can see in
Table 5, the utilization for the standard X-n101-k25 problem goes up from 8.09%
to 9.07%, which is a reasonable additional utilization given the tuning effects of
the Elitist list. However, this additional use of LUTs is directly dependent on the
chosen size of the Elitist list and thus also a question of the algorithmic tuning.

109

8.4

Problem-dependent memory utilization
Obviously, a critical limitation of the hardware approach is the maximum

available memory space for the stored population of individuals, for mainly two
reasons:
• The population size is always relative to the problem size, therefore the
number of nodes within the problem. Former examinations with the GPUbased algorithms have shown that a 20n population, therefore a population
with a number of individuals equal to twenty times the number of nodes
in the problem is optimal regarding the population diversity and thus the
evolutionary development of the solution quality. This obviously means that
larger problems will need an even larger population in terms of included
number of individuals.
• At the same time, each individuals requires more memory space if the
problem gets larger for two reasons:
– First of all, the number of nodes per individual is directly expressed by
the problem size of the adapted testbench problem.
– Furthermore, each node needs to be represented in more bits with a high
probability for larger problems. This is due to the fact, that a higher
resolution of coordinates is needed to decide between more coordinates
in a map of the same size as for the smaller problems.
Thus, it is of interest to analyze the memory space utilization for the relevant Uchoa
test problems (Table 6): While the increase in bidwidth is completely driven by
the increased demand of each node and therefore the necessity to use more bits to
store this demand, the increasing number of memory lines per individual definitely
impacts a crucial parameter of the algorithm: As visible in Figure 39, the optimal
110

X-n101-k25
X-n148-k46
X-n200-k36
X-n251-k28
X-n298-k31
X-n351-k40
X-n401-k29
X-n449-k29
X-n502-k39
X-n548-k50
X-n599-k92
X-n655-k131
X-n701-k44
X-n749-k98
X-n801-k40
X-n856-k95
X-n895-k37
X-n957-k87
X-n1001-k43

# of
nodes

Bits /
coord

Bits /
node

100
147
199
250
297
350
400
448
501
547
598
654
700
748
800
855
894
956
1000

10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10

28
29
29
29
30
30
30
30
30
31
31
31
31
31
31
31
31
31
31

Mem
lines /
ind.
51
74
100
126
149
176
201
225
251
274
300
328
351
396
401
428
448
479
501

% mem
util
for 1n
0.486
1.038
1.898
3.005
4.221
5.876
7.669
9.615
11.995
14.296
17.112
20.461
23.436
28.254
30.599
34.905
38.203
43.679
47.788

% mem
util
for 3n
1.459
3.113
5.694
9.014
12.663
17.627
23.007
28.844
35.984
42.888
51.336
61.383
70.308
84.761
91.798
/
/
/
/

% mem
util
for 10n
4.865
10.376
18.982
30.046
42.211
58.757
76.689
96.148
/
/
/
/
/
/
/
/
/
/
/

% mem
util
for 20n
9.729
20.752
37.963
60.092
84.421
/
/
/
/
/
/
/
/
/
/
/
/
/
/

Table 6: Various aspects of the memory usage of the GA-hardware configured for
the relevant problems of the Uchoa-test set.
20n population size can only be implemented for the small problems up to X-n298k31 with 297 nodes. For all problems starting with X-n856-k95, only a suboptimal
1n population size can be realized.

111

Memory-space usage for the population memory of the GA-hardware configured for
various Uchoa-test problems
Percentage usage of the BRAM-based population memory

120

100

80

60

40

20

0

Configurations based on the various Uchoa-problems
Memory utilization for 1n

Memory utilization for 3n

Memory utilization for 10n

Memory utilization for 20n

Figure 39: Diagram of the population memory utilization for a 20n, a 10n, a 3n
and a 1n population for the relevant Uchoa-test problems. Obviously, the usage of
the larger population sizes is limited to the small problems of that set.

112

CHAPTER 9
Comparison of the algorithmic results of a GA for CVRP on GPU,
CPU and FPGA
With one of the core targets of the projects being the comparison between
GPU, CPU and FPGA as computational platforms for the execution of Genetic
Algorithms for the Capacitated Vehicle Routing Problem (CVRP), it is of utter
necessity to determine the corner stones of an appropriate comparative metric for
those very different devices.
9.1

Design-decisions for an appropriate comparison of GPU, CPU and
FPGA
Due to the very open nature of the comparational tests of the different

computational platforms, it turned out to be a very crucial point of the scientific
work for this project to decide on an appropriate test strategy for the three different
platforms. The following corner stones of testing turned out to be useful:
9.1.1

Test design

Analyzing the algorithmic structure as implemented on GPU and CPU
(chapter 4) with the design decisions made for the FPGA (chapter 5), one can
easily identify some major differences between those realizations of the Genetic
Algorithms, although all three versions aim to generate optimized routings for the
same benchmark problems:
• GPU and CPU, which share an almost identical algorithmic approach, rely
on the classic generation-based processing with distinct, time-based groups of
individuals with restrictions on the combination of parents, while the FPGA
implements a pooling-based GA, where no such age- or group-related limits
for crossovers of individuals exist.
113

• Closely related to that difference is also the fact, that GPU and CPU both
have some form of elitism included, which allows very good parents to survive
their own children. Such a mechanism is not included in the FPGA-based
design in order to simplify the memory access mechanism.
Those divergent design decisions have a direct impact on how comparative tests
between the platforms can be performed:
• Former publications regarding the GPU-based algorithm featured some
comparisons of various different parameter set-ups as well as of the
GPU versus other, already existing software-implementations and their
performance metrics. For all of that, an important metric was “time per
loop”, thus the required calculation time for all crossovers of one generation
with a 20n population. Obviously, such an approach can’t be applied to the
FPGA, since it doesn’t include the concept of generations in the first place.
• Furthermore, it doesn’t seem to be appropriate to require the same algorithm
tuning, thus the same set of parameters, applied to all computational
platforms in order to achieve the highest comparability. The most striking
example of why this general rule of comparative test design is not appropriate
for three very different algorithms with different limitations might be the
crucial populationSize: While it is no problem for devices such as GPU and
CPU, which can utilize virtual memory space and memory swaps, to always
run with the optimal number of 20n individuals, chapter 8 underlines, that
it is physically impossible for the FPGA to work with such populations for
larger problems.
Considering all those points, the decision was made to set up a very low-key
framework for the testing: Only the common testbenches are standardized. Apart

114

from that, each algorithm is tuned on its own due to its very specific requirements.
In the end, the outputs of the algorithm over time are plotted, relative to the best
known result as specified in the .opt-files of the benchmarks. Thus, the test is as
close to reality as possible, with “quality of solution over time” as the only and
decisive algorithmic metric.
9.1.2

Selection of benchmark problems

The task was to find a set of benchmark problems that are available together
with best-known results and provide various different problems over a range from
around 100 to 1000 nodes per map, since this covers the reasonable design size of
the FPGA-architecture. Thus, it just seemed logical to select the group of Uchoabenchmarks ([41]): This set includes benchmark problems in the required sizes,
so that the whole range from 100 to 1000 nodes in roughly steps of 50 nodes can
be covered. Furthermore, those problems are “not easy” to solve, since they don’t
show a clear arrangement of the nodes in groups, as it is the case for some older
benchmark sets (Figure 40). This increases the complexity of routing and therefore
creates more valuable test cases for the Genetic Algorithms.
9.2

Comparison results
The comparison of solution quality over time of the various computational

devices gives the following results:
9.2.1

X-n101-k25 - without elitism and with non-euclidean distance
calculation for 2-opt

The results for X-n101-k25 in the first configuration without the usage of an
Elitist list and with non-euclidean distance calculation in 2-opt are very clear, and
obviously not advantageous for the FPGA (Figure 41): Both CPU and GPU reach
a level of around 5-6% quality gap to the best known solution, whereas the GPU

115

Problem Map for M-n101-k10

Cities in the problem setting
Depot

80
70
60
50
40
30
20
10
0

1000

20

40

60

80

Problem Map for X-n101-k25

Cities in the problem setting
Depot

800
600
400
200
0

0

200

400

600

800

1000

Figure 40: Plotting of the problem sets of M-n101-k10 and X-n101-k25 (Uchoabenchmark set). Obviously, the nodes are grouped in the old M-n101-k10 test
map, while the more modern X-n101-k25 map shows a clear distribution of the
nodes.

116

X-n101-k25
Comparison of GPU, CPU and FPGA
0.9

0.8

Percentage Quality Gap to the Best Known Solutin

0.7

0.6

0.5
GPU
CPU

0.4

FPGA
0.3

0.2

0.1

0
0

1000

2000

3000

4000

5000

6000

Computing time in sec

Figure 41: Performance of GPU, CPU and FPGA for the X-n101-k25 problem.
Displayed is the quality gap to the best known solution over time.
delivers a slightly better and - more importantly - much faster result: After less
than 1000 seconds, equal to roughly 17 minutes, the perfomance peak is reached.
At the same time, the FPGA still delivers a quality gap of around 25%. Even after
around 5500 seconds of running time, the quality gap is still way higher than for
the other devices at around 15%.
However, the change in the routing from the first to the last optimization run is
quite obvious (Figure 42): Less intersecting routes and routes that “go around”
the depot stop are visible - as one could have expected for an optimization of the
truck delivery route.

117

Problem Map for X-n101-k25
1000

800

600

400

200

0
0

200

400

600

800

1000

800

1000

Problem Map for X-n101-k25
1000

800

600

400

200

0
0

200

400

600

Figure 42: Routing for X-n101-k25 after the first optimization run with 100.000
crossovers on the FPGA and after the last of such optimization runs.

118

Figure 43: Comparison of the crossover processes calculated on FPGA, GPU and
CPU in one second for the current X n101 k25 problem.
Advantages of the FPGA: Raw processing power and frugal computing
Furthermore, the performance difference between GPU and CPU on the one
hand and the FPGA on the other hand cannot be explained with calculation speed:
When analyzing the specific algorithmic processing of the different devices, one can
find out that the processing speeds per time unit differ quite drastically (Figure 43):
• GPU: 33600 crossover runs/second
• CPU: 34651 crossover runs/second
• FPGA: 53378 crossover runs/second
Thus, one could argue that the FPGA is roughly 54% faster in raw calculation
speed than the other devices, but much less efficient per crossover run than the
other devices. Therefore, the problem doesn’t seem to be related to the hardware
design, but to the algorithmic implementation in the FPGA and especially the
tuning of the algorithm.

119

Figure 44: Comparison of the power consumed by FPGA, GPU and CPU for the
current X n101 k25 problem. Data for the FPGA is taken as an estimation from
the VIVADO-CAD-tool.
At the same time, also a closer look on the power consumption of the three
different computational platforms reveals another big advantage of the FPGA
(Figure 44): It only requires a small fraction of the electrical energy that both
GPU and CPU use for their calculation - while the GPU consumes roughly around
70 Watt for that relatively small problem, the CPU shoots up to around 300 Watt.
At the same time, the Vivado-tool gives an estimation of well below 5 Watt for the
FPGA in its current configuration for this 100-node-problem. Thus, one could say
that the FPGA is by far superior in the field of frugal computing and therefore the
most efficient computational platform. This could certainly play a key role when
it comes to the application of such a hardware-based routing optimization tool in
the field of robotics or other autonomous vehicles.
9.2.2

X-n101-k25 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

The problem of lacking algorithmic efficiency in FPGA in comparison to GPU
and CPU can be solved if three measurements are taken (Figure 45):

120

• First of all, elitism is applied with an Elitist list that always holds the top
5% of the population to preserve very good parents from getting erased by
their children.
• Additionally, both the fitness calculation and the 2-opt algorithm have to
use the euclidean distance metrics to avoid remaining intersections and
irregularities caused by the squared distance calculations.
• Based on those two design decisions, an appropriate algorithmic tuning has to
be found. The setting of the relevant parameters should aim to maintain high
diversity in the population while forcing evolutionary development towards
better fitness. Especially two settings turned out to be successful in this
regard:
– Keeping in mind the protection of high-quality parents through
the Elitist list, the WildChildProbabilityThreshold and therefore the
percentage of “wild children” can be increased in comparison to the
first experiments. This has the effect, that from time to time “fresh”
genes due to mutation processes in the re-transmitted children enter the
genetic pool of the population while still avoiding to lose good parents in
this process. Therefore, higher diversity meets halfway with the needs
of forced evolutionary development. In the end, a threshold value of
around 100 (and thus a WildChildProbability of around

100
210

≈ 10%)

was applied for this and all following experiments.
– At the same time, the probability for parents to be selected from the
Elitist list instead of the general population was kept low to avoid an
“over-use” of current promising parents, which could in the long run
again harm the population diversity. For this as well as for the other

121

problems, a pElitistSelectionProbabilityThreshold of around 400 (and
thus a probability for the selection of such elite parents of approximately
400
214

≈ 2.5%) turned out to be successful.

Obviously, the FPGA with those settings is much closer to the performance of the
two other platforms - it gets down to a performance gap of around 5%, but still
needs a little more time for that development than GPU or CPU. Furthermore,
the superiority in pure processing speed is lost due to the use of time-consuming
euclidean distance calculations in the 2-opt algorithm: In this case, the FPGA
only performs around 30000 crossovers per second - almost exactly, what both
GPU and CPU are able to achieve. Therefore, the new setup eliminates one of
the striking advantages of the FPGA - the pure processing speed -, but allows
to achieve a very comparable algorithmic performance with only a fraction of the
power consumption of the two other devices. Finally, this experimental result
stresses again the fact that the FPGA should be seen as a crucial step towards
frugal and efficient computing.
9.2.3

P-n50-k7 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

Based on the experimental findings regarding optimized algorithmic
performance of the FPGA with Elitism and all-Euclidean distance metrics, the
idea was to test the design with other, smaller problems from an older problem set
([42]). A striking success could be claimed for a small 50-node-problem (Figure 46)
from the Augerat-P-Set: Directly with the first run of crossovers after 2.7 seconds,
a performance gap of only 1.65% is reached. This is remarkably faster than both
CPU and GPU for such a result. Overall, the performance of the FPGA for this
problem is better than the GPU at any point in time. Only the CPU manages to
achieve a slightly better performance gap, but this takes around 200 seconds. In

122

X-n101-k25 - with elitism and all euclidean distances
Comparison of GPU, CPU and FPGA
0.9

Percentage Quality Gap to the Best Known Solution

0.8

0.7

0.6

0.5
GPU
CPU

0.4

FPGA
0.3

0.2

0.1

0
0

1000

2000

3000

4000

5000

6000

Computing time in sec

Figure 45: Performance of GPU, CPU and FPGA for the X-n101-k25 problem.
This time, the FPGA was set up with an Elitist list and Euclidean distance metrics
for both fitness- and 2-opt calculation. Displayed is the quality gap to the best
known solution over time.

123

P-n50-k7
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
0.12

Percentage Quality Gap to the Best Known Solution

0.1

0.08

GPU

0.06

CPU
FPGA
0.04

0.02

0
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Computing time in sec

Figure 46: Performance of GPU, CPU and FPGA for the P-n50-k7 problem. The
FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.
conclusion, the FPGA turns out to be the device of choice when it comes to the
processing of small problems under time- and energy-limitations.
9.2.4

P-n70-k10 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

Different to the 50-node-problem before, the FPGA does not outpeform either
GPU or CPU on a 70-node-problem (Figure 47): While it still reaches a 6.5%
performance gap in the end, the development over time is much slower than
for the two other devices. Interestingly, the CPU still outperforms the GPU
drastically, both in quality and calculation speed. This can be explained by the
non-parallelizable control overhead in such a small problem compared to the larger

124

benchmarks. In such a case, the superior clock speed of the CPU beats the multicore design of the GPU.
However, the most important fact for the FPGA-development is that a very similar
algorithmic tuning produces very different results for a 50- and a 70-node-problem
from the same set of CVRP-benchmarks. This allows the final conclusion, that the
tuning of the FPGA - different to the parametric control of the implementations on
GPU and CPU - is sensitive to the exact problem, which requires more research
on the parametric tuning of the hardware architecture in the future. Another
interesting observation regarding the algorithmic tuning and its effects are the clear
spikes in performance gap that can be seen during all phases of FPGA-processing:
Those are most likely caused by the WildChildRate, which - in rare conditions may cause the selection of low-quality individuals as final results of the algorithm.
The positive outcome however is the fact that those spikes only occur once from
time to time, while the general development gets back to the former quality level
of individuals.
9.2.5

X-n125-k30 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

Based on the findings with the simpler 50- and 70-node-problems from [42],
the idea was to test some more of the more complex and modern Uchoa-problems
([41]), but in the same 100-nodes range. For the X-n125-k30 problem, the FPGA
shows a very similar behaviour as seen for P-n70-k10 (Figure 48): While GPU and
CPU show an almost indistinguishable performance with slight speed benefits for
the classic processor, the FPGA performs worse regarding the quality of solutions:
Time-wise, the development of solutions is almost identical to the GPU at least
in the beginning, but after around 100 seconds, the FPGA-algorithm seems to
have reached a limit in performance gap, while both of the other platforms show

125

P-n70-k10
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
0.45

Percentage Quality Gap to the Best Known Solution

0.4

0.35

0.3

0.25
GPU
CPU

0.2

FPGA
0.15

0.1

0.05

0
0

5000

10000

15000

20000

25000

Computing time in sec

Figure 47: Performance of GPU, CPU and FPGA for the P-n70-k10 problem.
The FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.

126

X-n125-k30
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
0.25

Percentage Quality Gap to the Best Known Solution

0.2

0.15

GPU
CPU
FPGA

0.1

0.05

0
0

5000

10000

15000

20000

25000

Computing time in sec

Figure 48: Performance of GPU, CPU and FPGA for the X-n125-k30 problem.
The FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.
a further fast solution development. Interestingly, this again proves the theory
of problem-dependent algorithm tuning: X-n125-k30 should be very comparable
to the slightly smaller X-n101-k25, but for this problem, the FPGA does reach
the exact same quality gap as CPU and GPU. Again, another algorithmic tuning
might have helped to perform on par with those competitors.
9.2.6

X-n139-k10 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

Another good example for the problem-dependent efficiency of the algorithmic
tuning is the X-n139-k10 problem: In this case, all three devices end up with a
very similar performance gap of around 5-8%, but again CPU is faster than both

127

X-n139-k10
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
1.4

Percentage Quality Gap to the Best Known Solution

1.2

1

0.8
GPU
CPU
0.6

FPGA

0.4

0.2

0
0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Computing time in sec

Figure 49: Performance of GPU, CPU and FPGA for the X-n139-k10 problem.
The FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.
GPU and FPGA (Figure 49).
9.2.7

X-n148-k46 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

The most striking observation one can make from the X-n148-k46 problem is
that the GPU begins to show its advantages at this problem size: While the CPU
is still a little bit faster in the beginning, the GPU in the long run achieves a better
performance gap. Generally speaking, one would assume that the GPU definitely
outperforms both other platform when it comes to large problem sets. In those
cases, it can really benefit from its massive parallel core-based architecture to run
a maximum amount of crossovers at the same time. The control-flow overhead

128

X-n148-k46
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
0.6

Percentage Quality Gap to the Best Known Solution

0.5

0.4

GPU

0.3

CPU
FPGA
0.2

0.1

0
0

5000

10000

15000

20000

25000

Computing time in sec

Figure 50: Performance of GPU, CPU and FPGA for the X-n148-k46 problem.
The FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.
steps back in importance and thus also the generally higher clock speed of the
CPU, which can’t compete with the GPU in terms of parallelism.
For this specific problem, the FPGA is not really competitive - neither time wise
nor regarding the achieved solution quality in the end. However, this does not seem
to be related to the problem size at this point since the achieved performance gap
for the next bigger problem is absolutely on par with the other devices again.
9.2.8

X-n167-k10 - with elitism and all-euclidean distance calculations
for 2-opt and fitness

The trends described before can be observed again in this 166-node-problem:
GPU and CPU are now almost identical in their development towards a

129

X-n167-k10
Comparison of GPU, CPU and FPGA (Elitism, all-Euclidean)
1.4

Percentage Quality Gap to the Best Known Solution

1.2

1

0.8
GPU
CPU
0.6

FPGA

0.4

0.2

0
0

5000

10000

15000

20000

25000

30000

35000

Computing time in sec

Figure 51: Performance of GPU, CPU and FPGA for the X-n167-k10 problem.
The FPGA was set up with an Elitist list and Euclidean distance metrics for both
fitness- and 2-opt calculation. Displayed is the quality gap to the best known
solution over time.
performance gap of around 10%. The FPGA shows that its tuning fits this problem
almost ideally, so that the development is even faster than on GPU and CPU in the
beginning, and the final performance gap is at around 11.3% in the end (Figure 51).

130

CHAPTER 10
Conclusion and Future Work
Looking back on the initial objective of this thesis and the final results, once
record quite a lot successful milestones:
• A fully functional SystemVerilog-implementation of the GA for realization
on a high-performance Xilinx FPGA was created and tested. It works stable
and reproducible for the various standardized testbenches.
• This design further achieves the main goal of higher calculation performance
than the competitive implementations on GPU and CPU - more crossovers
per time can be executed than on those platforms, at least if a non-Euclidean
distance metric is used for the 2-opt-algorithm.
• Furthermore, with applied elitism and Euclidean distance metric for all
processing steps including 2-opt, the FPGA-implementation shows an
algorithmic efficiency that is at least for problems up to around 170 nodes
completely comparable to both GPU and CPU. However, in this setup it
loses its superiority in pure processing speed due to the long square-root
calculation times in 2-opt.
• Finally, a highly functional framework for automatized testing, including
the generation of Verilog-memory modules and the input reading and
interpretation of UART-transmission was written in Python, that allows for
easy usage of the FPGA-design even for researchers new to the topic.
Keeping in mind the two different results in the experiments with the FPGA
in comparison to GPU and CPU, two different further research directions might
be interesting to follow:
131

Using pure processing power in a hybrid computer architecture
Keeping in mind the first experimental run of the FPGA for the 100-node
problem (Figure 41), it is obvious that the original FPGA design with a squared
distance metric for the 2-opt-algorithm has a huge benefit in pure processing power
(Figure 44), but lacks the efficiency in algorithmic processing that one can achieve
with GPU and CPU. Furthermore, the observations regarding the memory usage
(Figure 39) implies that the current FPGA-setup can’t be used the largest CVRP
benchmarks known today.
Both algorithmic efficiency and memory availability are only linked to the macroalgorithm and its implementation in hardware, while the micro-algorithm as
directly implemented in the ProcessingNodes is exactly the same as it can be
found on GPU and CPU. Therefore, it seems to make sense to combine the
best of both worlds: While using the superior algorithmic efficiency and basically
unlimited memory space of the classic computational platforms GPU and CPU,
it seems beneficial to trust the FPGA with all micro-algorithmic processing steps
(Figure 52). Thus a hybrid architecture would use a high-bandwidth interconnect
between population memory and macro-algorithm on for example a CPU and
multiple ProcessingNodes on FPGAs, over which parents and children could be
transmitted. From a hardware perspective, the chosen Xilinx-FPGA seems to be
the perfect platform for experiments of that kind, since it provides four QSFP28
interfaces with 30 GB/s as I/O-interfaces.

This high-bandwidth interconnect

would allow for an almost uninterrupted flow of data between Processing Nodes
and CPU. Therefore future works on the topic could concentrate on implementing
such a hybrid design to test it especially against the latest approaches with multiGPU networks for larger problems. Interestingly, this conclusion is similar to
the approaches found in the early days of study with FPGAs for the VRP and

132

Workstation

FPGA
QFSP28 interfaces

MacroAlgorithm
• Easier
memorymanagement
for
larger
populations
and elitism
• Easier
to
debug

HYBRID

MicroAlgorithm
• Superior
processing
power in the
ProcessingN
odes
• Completely
equal
to
GPU/CPUimplementati
on

Figure 52: Design study for a hybrid computer system consisting of CPU for
the macro-algorithm and one or multiple FPGAs for the micro-processing steps,
connected via a high-bandwidth interconnect.
TSP [26]. While still existing limitations in hardware resources in combination
with a beneficial algorithmic performance in software strengthen this idea as a
possible way for very large problems, the promising results with elitism and better
tuning have also shown that a fully integrated GA for CVRP on FPGA is possible
nowadays.
Finding the perfect tuning for the FPGA-based algorithm
Contrary to the first findings above, the later addition of elitism and the
introduction of Euclidean distance metrics for all processing steps proved that
the FPGA can achieve a competitive algorithmic efficiency at least for smaller
problems. However, the experiments for various problems of the Uchoa benchmark

133

sets revealed that the algorithmic outcome is very much dependent on the specific
problem (Figure 45, Figure 48, Figure 49, Figure 50, Figure 51). It seems like
the tuning of the algorithm is much more important on the FPGA than it is
on both GPU and CPU, while it also includes parameters unknown to the other
platforms - for example regarding the WildChild-mechanism. At the same time, the
process of algorithmic tuning is much more time-consuming and therefore difficult
than on the other platforms: Each set of parameters has to be used as a starting
point for its own synthesis run, followed by the actual test procedure. On top of
that, the FPGA itself is a “closed device”, which does not allow an insight for
example on the development of the population diversity over run time opposed
to what is possible with the software-controlled GPU and CPU. Furthermore, the
research recently conducted at URI features specific AI-based tuning mechanisms
for software implementations of GAs for the CVRP, which is in this way not directly
applicable to the FPGA. Hence, it seems to be beneficial to develop an exact
software replica of the FPGA-based hardware algorithm in the next step to be
able to use this “software twin” for further tuning measures. Such a simulator
could be run in parallel on multiple servers to test various parameter combinations
against each other while revealing a detailed insight in the algorithmic structures
during run-time. This would finally definitely speed up the development process
and allow for a better problem-adaption of the FPGA-based GA in the future.

134

LIST OF REFERENCES
[1] J.-P. Rodrigue, “Logistics costs, United States, 1980-2017,” The Geography
of Transport Systems. [Online]. Available: https://transportgeography.org/
contents/chapter7/logistics-freight-distribution/logistics-costs-united-states/
[2] International Energy Agency. “Tracking transport 2020.” Website. Date
accessed: 01/19/2022. [Online]. Available: https://www.iea.org/reports/
tracking-transport-2020
[3] G. B. Dantzig and J. H. Ramser, “The truck dispatching problem,”
Management science, vol. 6, no. 1, pp. 80–91, 1959.
[4] M. F. Abdelatti and M. S. Sodhi, “An Improved GPU-Accelerated
Heuristic Technique Applied to the Capacitated Vehicle Routing Problem,”
in Proceedings of the 2020 Genetic and Evolutionary Computation
Conference, ser. GECCO ’20. New York, NY, USA: Association for
Computing Machinery, 2020, p. 663–671. [Online]. Available: https:
//doi.org/10.1145/3377930.3390159
[5] H. Zhang, H. Ge, J. Yang, and Y. Tong, “Review of Vehicle Routing
Problems: Models, Classification and Solving Algorithms,” Archives of
Computational Methods in Engineering, vol. 29, pp. 195–221, 01 2022.
[Online]. Available: https://doi.org/10.1007/s11831-021-09574-x
[6] P. Bowes. “Pitney Bowes Parcel Shipping index.” Website. Date
accessed: 01/19/2022. [Online]. Available: https://www.pitneybowes.
com/us/shipping-index.html
[7] Amazon.com, Inc.; Massachusetts Institute of Technology; MIT Center
for Transportation & Logistics. “Amazon Last Mile Routing Research
Challenge.” Website. Date accessed: 01/17/2022. 2021. [Online]. Available:
https://routingchallenge.mit.edu/
[8] A. Fender. “Nvidia brings ai to the supply chain.” Website. Date accessed:
01/19/2022. 2021. [Online]. Available: https://blogs.nvidia.com/blog/2021/
11/09/reopt-ai-software-supply-chain/
[9] G. Laporte, “The vehicle routing problem: An overview of exact and
approximate algorithms,” European Journal of Operational Research, vol. 59,
no. 3, pp. 345–358, 1992. [Online]. Available: https://www.sciencedirect.
com/science/article/pii/037722179290192C

135

[10] G. Laporte and Y. Nobert, “Exact Algorithms for the Vehicle Routing
Problem,” in Surveys in Combinatorial Optimization, ser. North-Holland
Mathematics Studies, S. Martello, G. Laporte, M. Minoux, and C. Ribeiro,
Eds. North-Holland, 1987, vol. 132, pp. 147–184. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0304020808732353
[11] K. Deb, “An introduction to genetic algorithms,” Sadhana, vol. 24, pp.
293–315, 1999. [Online]. Available: https://doi.org/10.1007/BF02823145
[12] B. M. Baker and M. Ayechew, “A genetic algorithm for the vehicle routing
problem,” Computers & Operations Research, vol. 30, no. 5, pp. 787–800,
2003. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S0305054802000515
[13] C. Prins, “A simple and effective evolutionary algorithm for the vehicle
routing problem,” Computers & Operations Research, vol. 31, no. 12,
pp. 1985–2002, 2004. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0305054803001588
[14] A. Garcia-Najera and J. A. Bullinaria, “Comparison of Similarity
Measures for the Multi-Objective Vehicle Routing Problem with Time
Windows,” in Proceedings of the 11th Annual Conference on Genetic
and Evolutionary Computation, ser. GECCO ’09. New York, NY, USA:
Association for Computing Machinery, 2009, p. 579–586. [Online]. Available:
https://doi.org/10.1145/1569901.1569982
[15] K. Sörensen, “Distance measures based on the edit distance for permutationtype representations,” Journal of Heuristics, vol. 13, no. 1, pp. 35–47, 2007.
[Online]. Available: https://doi.org/10.1007/s10732-006-9001-3
[16] A. Garcia-Najera and J. A. Bullinaria, “A multi-objective density
restricted genetic algorithm for the vehicle routing problem with time
windows,” in 2008 UK Workshop on Computational Intelligence, 2008,
pp. 47–52. [Online]. Available: https://www.researchgate.net/publication/
228641635 A Multi-Objective Density Restricted Genetic Algorithm for
the Vehicle Routing Problem with Time Windows
[17] A. Garcia-Najera and J. A. Bullinaria, “Bi-objective optimization for the
vehicle routing problem with time windows: Using route similarity to enhance
performance,” in Evolutionary Multi-Criterion Optimization, M. Ehrgott,
C. M. Fonseca, X. Gandibleux, J.-K. Hao, and M. Sevaux, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2009, pp. 275–289.
[18] A. Garcia-Najera and J. A. Bullinaria, “An improved multi-objective
evolutionary algorithm for the vehicle routing problem with time
windows,” Computers & Operations Research, vol. 38, no. 1, pp.

136

287–300, 2011, project Management and Scheduling. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0305054810001176
[19] A. Garcia-Najera, “Preserving Population Diversity for the Multi-Objective
Vehicle Routing Problem with Time Windows,” in Proceedings of the 11th
Annual Conference Companion on Genetic and Evolutionary Computation
Conference: Late Breaking Papers, ser. GECCO ’09. New York, NY,
USA: Association for Computing Machinery, 2009, p. 2689–2692. [Online].
Available: https://doi.org/10.1145/1570256.1570385
[20] R. Bent and P. V. Hentenryck, “A two-stage hybrid algorithm for
pickup and delivery vehicle routing problems with time windows,”
Computers & Operations Research, vol. 33, no. 4, pp. 875–893,
2006, part Special Issue: Optimization Days 2003. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0305054804001911
[21] P. Larrañaga, C. Kuijpers, R. Murga, I. Inza, and S. Dizdarevic, “Genetic
algorithms for the travelling salesman problem: A review of representations
and operators,” Artificial intelligence review: An international survey and
tutorial journal, vol. 13, no. 2, pp. 129–170, Apr. 1999.
[22] A. Benaini and A. Berrajaa, “Genetic algorithm for large dynamic vehicle
routing problem on gpu,” in 2018 4th International Conference on Logistics
Operations Management (GOL). IEEE, 2018, pp. 1–9. [Online]. Available:
https://ieeexplore.ieee.org/document/8378082
[23] I. Coelho, P. Munhoz, L. Ochi, M. Souza, C. Bentes, and R. Farias, “An
integrated CPU–GPU heuristic inspired on variable neighbourhood search
for the single vehicle routing problem with deliveries and selective pickups,”
International Journal of Production Research, vol. 54, no. 4, pp. 945–962,
2016. [Online]. Available: https://doi.org/10.1080/00207543.2015.1035811
[24] P. Jog, J.-Y. Suh, and D. V. Gucht, “Parallel genetic algorithms applied to
the traveling salesman problem,” SIAM J. Optim., vol. 1, pp. 515–529, 1991.
[25] F. Hougardy, Stefan; Zaiser and X. Zhong, “The approximation ratio of the 2opt heuristic for the metric traveling salesman problem,” Operations Research
Letters, vol. 48, 2020.
[26] P. Graham and B. Nelson, “A hardware genetic algorithm for the traveling
salesman problem on splash 2,” in Field-Programmable Logic and Applications,
W. Moore and W. Luk, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
1995, pp. 352–361.
[27] I. Skliarova and A. F. Ferrari, “FPGA-Based Implementation of Genetic
Algorithm for the Traveling Salesman Problem and Its Industrial
Application,” Lecture Notes in Computer Science, vol. 2358, 2002.
137

[28] Y. Jewajinda and P. Chongstitvatana, “FPGA Implementation of a Cellular
Compact Genetic Algorithm,” in 2008 NASA/ESA Conference on Adaptive
Hardware and Systems, 2008, pp. 385–390.
[29] Z. Yan-cong, G. Jun-hua, D. Yong-feng, and H. Huan-ping, “Implementation
of genetic algorithm for TSP based on FPGA,” in 2011 Chinese Control and
Decision Conference (CCDC), 2011, pp. 2226–2231.
[30] M. Vega-Rodriguez, R. Gutierrez-Gil, J. Avila-Roman, J. Sanchez-Perez, and
J. Gomez-Pulido, “Genetic algorithms using parallelism and FPGAs: the
TSP as case study,” in 2005 International Conference on Parallel Processing
Workshops (ICPPW’05), 2005, pp. 573–579.
[31] U. Farooq, Z. Marrakchi, and H. Mehrez, Tree-Based Heterogeneous FPGA
Architectures: Application Specific Exploration and Optimization. Springer
Publishing Company, Incorporated, 2014.
[32] Y. Tavares, D. Belfort, S. Catunda, S. Rodrigues, and J. Tandon,
“On the development of an island-style FPGA,” Revista Principia
- Divulgação Cientı́fica e Tecnológica do IFPB, vol. 1, p. 85,
03 2020. [Online]. Available: https://periodicos.ifpb.edu.br/index.php/
principia/article/viewFile/3494/1243
[33] C. Darwin, On the Origin of Species by Means of Natural Selection. London:
Murray, 1859.
[34] H. Spencer, The Principles of Biology. London: Williams and Norgate, 1864.
[35] R. Merrick. “LFSR in an FPGA - VHDL & Verilog Code.” Website. Date
accessed: 03/15/2022. [Online]. Available: https://www.nandland.com/vhdl/
modules/lfsr-linear-feedback-shift-register.html
[36] R. Merrick. “UART, Serial Port, RS-232 Interface.” Website. Date accessed:
07/01/2022. [Online]. Available: https://www.nandland.com/vhdl/modules/
module-uart-serial-port-rs232.html
[37] W. Green. “Square Root in Verilog.” Website. Date accessed: 06/20/2022.
[Online]. Available: https://projectf.io/posts/square-root-in-verilog/
[38] Intel
Corporation.
“Intel®Stratix®10
MX
2100
FPGA.”
Website.
Date
accessed:
05/20/2022.
[Online].
Available:
https://www.intel.com/content/www/us/en/products/sku/
210297/intel-stratix-10-mx-2100-fpga/specifications.html
[39] AMD Xilinx. “Virtex UltraScale+ HBM VCU128 FPGA Evaluation
Kit.” Website. Date accessed: 05/20/2022. 2022. [Online]. Available:
https://www.xilinx.com/products/boards-and-kits/vcu128.html

138

[40] I. Lima, E. Uchoa, D. Pecin, A. Pessoa, M. Poggi, T. Vidal, A. Subramian,
D. Oliveira, and E. Queiroga. “CVRPLIB - Capacitated Vehicle Routing
Problem Library.” Date accessed: 08/02/2022. 2014 - present. [Online].
Available: http://vrp.atd-lab.inf.puc-rio.br/index.php/en/about
[41] E. Uchoa, D. Pecin, A. Pessoa, M. Poggi, T. Vidal, and A. Subramanian,
“New benchmark instances for the Capacitated Vehicle Routing Problem,”
European Journal of Operational Research, vol. 257, no. 3, pp. 845 – 858,
2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0377221716306270
[42] P. Augerat, J. M. Belenguer, E. Benavent, A. Corberán, D. Naddef, and
G. Rinaldi, “Computational results with a branch and cut code for the
capacitated vehicle routing problem,” 01 1995.

139

BIBLIOGRAPHY
Abdelatti, M. F. and Sodhi, M. S., “An Improved GPU-Accelerated
Heuristic Technique Applied to the Capacitated Vehicle Routing Problem,”
in Proceedings of the 2020 Genetic and Evolutionary Computation
Conference, ser. GECCO ’20. New York, NY, USA: Association for
Computing Machinery, 2020, p. 663–671. [Online]. Available: https:
//doi.org/10.1145/3377930.3390159
Amazon.com, Inc.; Massachusetts Institute of Technology; MIT Center
for Transportation & Logistics. “Amazon Last Mile Routing Research
Challenge.” Website. Date accessed: 01/17/2022. 2021. [Online]. Available:
https://routingchallenge.mit.edu/
AMD Xilinx. “Virtex UltraScale+ HBM VCU128 FPGA Evaluation Kit.”
Website. Date accessed: 05/20/2022. 2022. [Online]. Available: https:
//www.xilinx.com/products/boards-and-kits/vcu128.html
Augerat, P., Belenguer, J. M., Benavent, E., Corberán, A., Naddef, D., and
Rinaldi, G., “Computational results with a branch and cut code for the
capacitated vehicle routing problem,” 01 1995.
Baker, B. M. and Ayechew, M., “A genetic algorithm for the vehicle routing
problem,” Computers & Operations Research, vol. 30, no. 5, pp. 787–800,
2003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0305054802000515
Benaini, A. and Berrajaa, A., “Genetic algorithm for large dynamic vehicle
routing problem on gpu,” in 2018 4th International Conference on Logistics
Operations Management (GOL). IEEE, 2018, pp. 1–9. [Online]. Available:
https://ieeexplore.ieee.org/document/8378082
Bent, R. and Hentenryck, P. V., “A two-stage hybrid algorithm for
pickup and delivery vehicle routing problems with time windows,”
Computers & Operations Research, vol. 33, no. 4, pp. 875–893,
2006, part Special Issue: Optimization Days 2003. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0305054804001911
Bowes, P. “Pitney Bowes Parcel Shipping
accessed:
01/19/2022. [Online]. Available:
com/us/shipping-index.html

index.” Website. Date
https://www.pitneybowes.

Coelho, I., Munhoz, P., Ochi, L., Souza, M., Bentes, C., and Farias, R., “An
integrated CPU–GPU heuristic inspired on variable neighbourhood search
140

for the single vehicle routing problem with deliveries and selective pickups,”
International Journal of Production Research, vol. 54, no. 4, pp. 945–962,
2016. [Online]. Available: https://doi.org/10.1080/00207543.2015.1035811
Dantzig, G. B. and Ramser, J. H., “The truck dispatching problem,” Management
science, vol. 6, no. 1, pp. 80–91, 1959.
Darwin, C., On the Origin of Species by Means of Natural Selection.
Murray, 1859.

London:

Deb, K., “An introduction to genetic algorithms,” Sadhana, vol. 24, pp. 293–315,
1999. [Online]. Available: https://doi.org/10.1007/BF02823145
Farooq, U., Marrakchi, Z., and Mehrez, H., Tree-Based Heterogeneous FPGA
Architectures: Application Specific Exploration and Optimization. Springer
Publishing Company, Incorporated, 2014.
Fender, A. “Nvidia brings ai to the supply chain.” Website. Date accessed:
01/19/2022. 2021. [Online]. Available: https://blogs.nvidia.com/blog/2021/
11/09/reopt-ai-software-supply-chain/
Garcia-Najera, A., “Preserving Population Diversity for the Multi-Objective
Vehicle Routing Problem with Time Windows,” in Proceedings of the 11th
Annual Conference Companion on Genetic and Evolutionary Computation
Conference: Late Breaking Papers, ser. GECCO ’09. New York, NY,
USA: Association for Computing Machinery, 2009, p. 2689–2692. [Online].
Available: https://doi.org/10.1145/1570256.1570385
Garcia-Najera, A. and Bullinaria, J. A., “A multi-objective density restricted
genetic algorithm for the vehicle routing problem with time windows,”
in 2008 UK Workshop on Computational Intelligence, 2008, pp.
47–52. [Online]. Available:
https://www.researchgate.net/publication/
228641635 A Multi-Objective Density Restricted Genetic Algorithm for
the Vehicle Routing Problem with Time Windows
Garcia-Najera, A. and Bullinaria, J. A., “Bi-objective optimization for the vehicle
routing problem with time windows: Using route similarity to enhance
performance,” in Evolutionary Multi-Criterion Optimization, Ehrgott, M.,
Fonseca, C. M., Gandibleux, X., Hao, J.-K., and Sevaux, M., Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2009, pp. 275–289.
Garcia-Najera, A. and Bullinaria, J. A., “Comparison of Similarity Measures
for the Multi-Objective Vehicle Routing Problem with Time Windows,” in
Proceedings of the 11th Annual Conference on Genetic and Evolutionary
Computation, ser. GECCO ’09. New York, NY, USA: Association for
Computing Machinery, 2009, p. 579–586. [Online]. Available: https:
//doi.org/10.1145/1569901.1569982
141

Garcia-Najera, A. and Bullinaria, J. A., “An improved multi-objective
evolutionary algorithm for the vehicle routing problem with time
windows,” Computers & Operations Research, vol. 38, no. 1, pp.
287–300, 2011, project Management and Scheduling. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0305054810001176
Graham, P. and Nelson, B., “A hardware genetic algorithm for the traveling
salesman problem on splash 2,” in Field-Programmable Logic and Applications,
Moore, W. and Luk, W., Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
1995, pp. 352–361.
Green, W. “Square Root in Verilog.” Website. Date accessed: 06/20/2022.
[Online]. Available: https://projectf.io/posts/square-root-in-verilog/
Hougardy, Stefan; Zaiser, F. and Zhong, X., “The approximation ratio of the 2opt heuristic for the metric traveling salesman problem,” Operations Research
Letters, vol. 48, 2020.
Intel Corporation. “Intel®Stratix®10 MX 2100 FPGA.” Website. Date accessed:
05/20/2022. [Online]. Available: https://www.intel.com/content/www/us/
en/products/sku/210297/intel-stratix-10-mx-2100-fpga/specifications.html
International Energy Agency. “Tracking transport 2020.” Website. Date
accessed: 01/19/2022. [Online]. Available: https://www.iea.org/reports/
tracking-transport-2020
Jewajinda, Y. and Chongstitvatana, P., “FPGA Implementation of a Cellular
Compact Genetic Algorithm,” in 2008 NASA/ESA Conference on Adaptive
Hardware and Systems, 2008, pp. 385–390.
Jog, P., Suh, J.-Y., and Gucht, D. V., “Parallel genetic algorithms applied to the
traveling salesman problem,” SIAM J. Optim., vol. 1, pp. 515–529, 1991.
Laporte, G., “The vehicle routing problem: An overview of exact and approximate
algorithms,” European Journal of Operational Research, vol. 59, no. 3, pp.
345–358, 1992. [Online]. Available: https://www.sciencedirect.com/science/
article/pii/037722179290192C
Laporte, G. and Nobert, Y., “Exact Algorithms for the Vehicle Routing
Problem,” in Surveys in Combinatorial Optimization, ser. North-Holland
Mathematics Studies, Martello, S., Laporte, G., Minoux, M., and Ribeiro,
C., Eds. North-Holland, 1987, vol. 132, pp. 147–184. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0304020808732353
Larrañaga, P., Kuijpers, C., Murga, R., Inza, I., and Dizdarevic, S., “Genetic
algorithms for the travelling salesman problem: A review of representations
and operators,” Artificial intelligence review: An international survey and
tutorial journal, vol. 13, no. 2, pp. 129–170, Apr. 1999.
142

Lima, I., Uchoa, E., Pecin, D., Pessoa, A., Poggi, M., Vidal, T., Subramian, A.,
Oliveira, D., and Queiroga, E. “CVRPLIB - Capacitated Vehicle Routing
Problem Library.” Date accessed: 08/02/2022. 2014 - present. [Online].
Available: http://vrp.atd-lab.inf.puc-rio.br/index.php/en/about
Merrick, R. “LFSR in an FPGA - VHDL & Verilog Code.” Website. Date
accessed: 03/15/2022. [Online]. Available: https://www.nandland.com/vhdl/
modules/lfsr-linear-feedback-shift-register.html
Merrick, R. “UART, Serial Port, RS-232 Interface.” Website. Date accessed:
07/01/2022. [Online]. Available: https://www.nandland.com/vhdl/modules/
module-uart-serial-port-rs232.html
Prins, C., “A simple and effective evolutionary algorithm for the vehicle routing
problem,” Computers & Operations Research, vol. 31, no. 12, pp. 1985–2002,
2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0305054803001588
Rodrigue, J.-P., “Logistics costs, United States, 1980-2017,” The Geography
of Transport Systems. [Online]. Available: https://transportgeography.org/
contents/chapter7/logistics-freight-distribution/logistics-costs-united-states/
Skliarova, I. and Ferrari, A. F., “FPGA-Based Implementation of Genetic
Algorithm for the Traveling Salesman Problem and Its Industrial
Application,” Lecture Notes in Computer Science, vol. 2358, 2002.
Sörensen, K., “Distance measures based on the edit distance for permutation-type
representations,” Journal of Heuristics, vol. 13, no. 1, pp. 35–47, 2007.
[Online]. Available: https://doi.org/10.1007/s10732-006-9001-3
Spencer, H., The Principles of Biology. London: Williams and Norgate, 1864.
Tavares, Y., Belfort, D., Catunda, S., Rodrigues, S., and Tandon, J.,
“On the development of an island-style FPGA,” Revista Principia
- Divulgação Cientı́fica e Tecnológica do IFPB, vol. 1, p. 85,
03 2020. [Online]. Available: https://periodicos.ifpb.edu.br/index.php/
principia/article/viewFile/3494/1243
Uchoa, E., Pecin, D., Pessoa, A., Poggi, M., Vidal, T., and Subramanian, A.,
“New benchmark instances for the Capacitated Vehicle Routing Problem,”
European Journal of Operational Research, vol. 257, no. 3, pp. 845 – 858,
2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0377221716306270
Vega-Rodriguez, M., Gutierrez-Gil, R., Avila-Roman, J., Sanchez-Perez, J., and
Gomez-Pulido, J., “Genetic algorithms using parallelism and FPGAs: the
TSP as case study,” in 2005 International Conference on Parallel Processing
Workshops (ICPPW’05), 2005, pp. 573–579.
143

Yan-cong, Z., Jun-hua, G., Yong-feng, D., and Huan-ping, H., “Implementation
of genetic algorithm for TSP based on FPGA,” in 2011 Chinese Control and
Decision Conference (CCDC), 2011, pp. 2226–2231.
Zhang, H., Ge, H., Yang, J., and Tong, Y., “Review of Vehicle Routing Problems:
Models, Classification and Solving Algorithms,” Archives of Computational
Methods in Engineering, vol. 29, pp. 195–221, 01 2022. [Online]. Available:
https://doi.org/10.1007/s11831-021-09574-x

144

