Representation of Evolutionary Algorithms in FPGA Cluster for Project of
  Large-Scale Networks by Perina, Andre B. et al.
Representation of Evolutionary Algorithms in FPGA Cluster
for Project of Large-Scale Networks
Andre B. Perina1, Marcilyanne M. Gois2, Paulo Matias3, Joao M. P. Cardoso4,
Alexandre C. B. Delbem1, Vanderlei Bonato1
1Institute of Mathematical and Computer Sciences
University of Sa˜o Paulo (USP) – Sa˜o Carlos, SP – Brazil
2Sa˜o Carlos School of Engineering
University of Sa˜o Paulo (USP) – Sa˜o Carlos, SP – Brazil
3Sa˜o Carlos Institute of Physics
University of Sa˜o Paulo (USP) – Sa˜o Carlos, SP – Brazil
4Faculty of Engineering
University of Porto (UP) – Porto – Portugal
{abperina,mmgois,paulo.matias}@usp.br,jmpc@acm.org,{acbd,vbonato}@usp.br
Abstract. Many problems are related to network projects, such as electric dis-
tribution, telecommunication and others. Most of them can be represented by
graphs, which manipulate thousands or millions of nodes, becoming almost
an impossible task to obtain real-time solutions. Many efficient solutions use
Evolutionary Algorithms (EA), where researches show that performance of EAs
can be substantially raised by using an appropriate representation, such as the
Node-Depth Encoding (NDE). The objective of this work was to partition an
implementation on single-FPGA (Field-Programmable Gate Array) based on
NDE from 512 nodes to a multi-FPGAs approach, expanding the system to 4096
nodes.
1. Introduction
Nowadays, it is not so difficult to find problems that can be mapped to a network in order
to improve solutions for power distribution networks, highways, etc. All these problems
can be modeled as graphs, and many times the solution of the problem is based on finding
the Minimum Spanning Tree (MST) of the modeled graph. It must be noted that normally
these graphs are extense and complex, leading to complex computational systems with
high time cost. Many efficient solutions for these problems are based on Evolutionary
Algorithms (EA), which showed to have good perfomance and efficiency [De Jong 2006].
Finding a MST is solvable in polynomial time by using two well-known algo-
rithms: Prim and Kruskal. However, these algorithms fail when a degree restriction is
applied to all nodes of a graph, which is not an uncommon cenario on real-world prob-
lems. In this case, the problem becomes NP-Hard [Knowles and Corne 2000].
In this paper we propose an expansion to a platform developed by
[Gois et al. 2014] in order to solve bigger problems, by exploiting the use of multi-
FPGAs. We present a framework for modeling and then a real implementation on hard-
ware.
ar
X
iv
:1
41
2.
53
84
v1
  [
cs
.D
C]
  1
7 D
ec
 20
14
2. Materials and Methods
In a previous work, [Gois et al. 2014] developed a platform called NDEWG, which finds
a MST for a given graph by using an EA with complexity of O(
√
n) [Delbem et al. 2012]
with serial processing elements, or near O(
√
n√
n
) = O(1) if
√
n processing ele-
ments (henceforth called Workers) work in parallel. This complexity was achieved
by using a special representation for trees, called Node-Depth Encoding, or NDE
[Santos et al. 2010].
The NDE basically represents a tree on a linear list containing a pair of values
(nx, dx), where nx are the tree nodes and dx their depths. The pairs are disposed in the
list based on the depth search algorithm traversing.
The NDEWG hardware transforms a graph into a forest of spanning trees
by using a modified version of the Kruskal algorithm. A pair of trees is
randomly chosen and a tree operator called Preserve Ancestor Operator (PAO)
[Santos et al. 2010],[De Lima et al. 2008] is applied to both, generating two different but
still spanning trees from the original graph, respecting any applied restriction degree. The
results are analised for any improvements in overall weight, and discarded if no optimi-
sation is achieved. Many Workers may apply in parallel the PAO operator with different
random seeds and each analyses for best weight reduction. After many iterations, the
result is an optimised spanning tree with degree restriction.
This project was implemented on an Altera Cyclone II FPGA, running up to 512
nodes and later on a Stratix IV GX, reaching up to 1024 nodes. Higher expansions weren’t
possible due to bus-width restrictions. To reach a higher size of solvable graphs, the
system bus was expanded to 64 bits.
Since most of the hardware resources of the platform are dedicated to Worker Syn-
thesis, which increases linearly with the graph size, the Workers were moved to separate
set of FPGAs, enabling scalability. The proposed platform was a star architecture con-
sisted by one Central FPGA, responsible for managing the spanning trees to be worked by
the Workers and many Satellite FPGAs, each having a number of Workers. Each FPGA
communicates with the Central FPGA by using a dedicated link. Figure 1 shows how
each FPGA logic was implemented.
Figure 1. Central FPGA (left) and Sat. FPGA (right) organisations
For the validation on physical hardware, it was still necessary to develop a network
system for connecting the FPGAs. We connected the systems by using high-speed 10
Gbps Ethernet Interfaces. A whole system was developed to abstract all the network layer
aspects. This system was called Network Abstraction System, or NAS.
The NAS implements master and slave Avalon Memory-Mapped frontends, which
are compatible with the network ports of the previously Central and Satellite modules.
The system converts the transactions to a streaming-based Avalon bus, which is used to
communicate with the Ethernet MAC. The Ethernet MAC and XAUI PHY operates on
the physical network layer.
The Central, Satellite and NAS projects were merged to validate the whole system.
It was used two Stratix V GX Development Kits (5SGXEA7) connected using two DUAL
XAUI TO SFP+ HSMC BOARD, and two full-duplex SFP+ Avago AFBR-703SDZ opti-
cal interfaces, with speed up to 10 Gbps. Figure 2 shows the final validation system.
Central
Cont.NIOS II
DDR3 MemoryJTAG
Flow
Contr.
On-chip
RAM
Worker
Set NIOS II
DDR3 Memory JTAG
NAS
On-chip
RAM
Central FPGA Satellite FPGA
PC PC
Figure 2. Final Validation Architecture
3. Results
The Central and Satellites modules were first tested by simulation, where no memories
and interconnection delays were considered. Table 1 shows some results from simulation
and synthesis tools, where one local memory was used for each Sat. module, ccelem and
cselemis the amount of Logic Elements (ALUTs) used on the Central and Satellites modules
respectively, f cmax and f
s
max the maximum frequency (MHz) for Central and Sat. modules,
and the average time per iteration (in seconds) to solve graphs up to 1024 nodes.
Table 1. Results for Central/Sat Simulation Model
Mem. Sat. FPGAs ccelem c
s
elem f
c
max f
s
max Avg. Time (per Iter.) Speedup
1 1 1552 56352 174.4 75.91 1.44e-07 1x
1 4 2945 13956 124.42 122.43 8.34e-08 1.73x
1 8 5051 7142 122.99 123.92 9.43e-08 1.53x
On the final hardware platform, the size of solvable graphs was upgraded to 4096
nodes, since expansion to 8192 nodes was not possible due to lack of available on-chip
memory. Figure 3 shows the average processing time on DNDEWG-64 (multi-FPGAs
approach), NDEWG-32 (previous work) and on a Intel Core 2 Quad Q6600 (2.40 GHz)
with the operator coded in C, running in parallel by using OpenMP. Data for NDEWG-32
and PC are available only up to 512 nodes.
The NDEWG shows a better efficiency, since it is not affected by a latency gen-
erated by the network interface. However its scalability is strongly held by its monolithic
nature, restricted to resources of only one FPGA. For 1024 nodes, the DNDEWG-64 has
 0
 2e-06
 4e-06
 6e-06
 8e-06
 1e-05
 1.2e-05
 1.4e-05
 500  1000  1500  2000  2500  3000  3500  4000
Tim
e p
er 
ite
rat
ion
 (s
)
Number of Nodes
DNDEWG-64
NDEWG
PC
Figure 3. Average processing time between systems
an average of approx. 85 us, whereas in simulation (1 memory, 1 Sat. module), the aver-
age time was of 0.144 us. Thus the bottleneck is concentrated on the network system and
must be investigated for optimisations.
4. Conclusion
Parallelised Evolutionary Algorithms are good candidates for optimising NP-Hard prob-
lems. This kind of solution can be applied to real world problems, such as electric dis-
tribution, where complex networks can have up to 100,000 nodes. Much work must still
be done in order to improve the multi-FPGAs approach. The NAS works on a low speed
compared to the high-speed link, hence improvements on the system must be made and
are objectives for further research.
References
De Jong, K. A. (2006). Evolutionary computation: a unified approach, volume
262041944. MIT press Cambridge.
De Lima, T. W., Rothlauf, F., and Delbem, A. C. (2008). The node-depth encoding:
analysis and application to the bounded-diameter minimum spanning tree problem. In
GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary
computation, pages 969–976, New York, NY, USA. ACM.
Delbem, A. C. B., De Lima, T., and Telles, G. P. (2012). Efficient forest data structure for
evolutionary algorithms applied to network design. IEEE Transactions on Evolution-
ary Computation, PP(99):1.
Gois, M. M., Matias, P., Perina, A. B., Bonato, V., and Delbem, A. C. B. (2014). A
parallel hardware architecture based on node-depth encoding to solve network design
problems. International Journal of Natural Computing Research, 4(1):54–75.
Knowles, J. and Corne, D. (2000). A new evolutionary approach to the degree-constrained
minimum spanning tree problem. IEEE Transactions on Evolutionary Computation,
4(2):125–134.
Santos, A., Delbem, A., London, J., and Bretas, N. (2010). Node-depth encoding and
multiobjective evolutionary algorithm applied to large-scale distribution system recon-
figuration. IEEE Transactions on Power Systems, 25(3):1254–1265.
