Mapping applications onto FPGA-centric clusters by Guo, Anqi
Boston University
OpenBU http://open.bu.edu
Theses & Dissertations Boston University Theses & Dissertations
2020
Mapping applications onto
FPGA-centric clusters
https://hdl.handle.net/2144/40943
Boston University
BOSTON UNIVERSITY
COLLEGE OF ENGINEERING
Thesis
MAPPING APPLICATIONS ONTO FPGA-CENTRIC
CLUSTERS
by
ANQI GUO
B.S., Lanzhou University, 2014
Submitted in partial fulfillment of the
requirements for the degree of
Master of Science
2020
c© 2020 by
ANQI GUO
All rights reserved
Approved by
First Reader
Martin C. Herbordt, Ph.D.
Professor of Electrical and Computer Engineering
Second Reader
Tali Moreshet, Ph.D.
Senior Lecturer & Research Assistant Professor of Electrical and
Computer Engineering
Third Reader
Anthony Skjellum, Ph.D.
Professor of Computer Science and Engineering
University of Tennessee – Chattanooga
When pride comes, then comes disgrace, but with the humble is wisdom.
Proverbs 11:12
iv
Acknowledgments
First and foremost, I would like to thank my advisor, Prof. Martin Herbordt. During
my Masters’ years working with him, he has offered me limitless help and guidance on
my work. I felt motivated because he inspired me in many ways and understood every
detail in my project. I feel touched because he has been thoughtful to my personal
life. I feel honored and proud to have him as my advisor.
Secondly, I would like to thank my other committee members. Professor Tali
Moreshet has invested a significant amount of time on my thesis and addressed many
points that I missed. Professor Anthony Skjellum, who is also the collaborator of our
project, has provided brilliant ideas and helped me address several technical issues.
Thirdly, I would like to take this opportunity to thank my colleagues in the CAAD
lab, Dr. Chen Yang, Qingqing Xiong, Tianqi Wang, Tong Geng, Ahmed Sanaullah,
Rushi Patel, Robert Munafo, Chunshu Wu, Pouya Haghi, and Pierre-Franc¸ois Wolfe.
They offered large amounts of help and guided me on my work. I also want to thank
my friends who share the same lab with me for their support and encouragement. I
also want to thank my friends with whom I’ve spent happy years during my Masters’
life.
Lastly, I want to thank my family for their unconditional and limitless support. I
thank my dad and mum for their love and fully supporting my decisions in my life. I
cannot be who I am without my dearest family.
v
MAPPING APPLICATIONS ONTO FPGA-CENTRIC
CLUSTERS
ANQI GUO
ABSTRACT
High Performance Computing (HPC) is becoming increasingly important through-
out science and engineering as ever more complex problems must be solved through
computational simulations. In these large computational applications, the latency of
communication between processing nodes is often the key factor that limits perfor-
mance. An emerging alternative computer architecture that addresses the latency
problem is the FPGA-centric cluster (FCC); in these systems, the devices (FP-
GAs) are directly interconnected and thus many layers of hardware and software
are avoided. The result can be scalability not currently achievable with other tech-
nologies.
In FCCs, FPGAs serve multiple functions: accelerator, network interface card
(NIC), and router. Moreover, because FPGAs are configurable, there is substan-
tial opportunity to tailor the router hardware to the application; previous work has
demonstrated that such application-aware configuration can effect a substantial im-
provement in hardware efficiency. One constraint of FCCs is that it is convenient
for their interconnect to be static, direct, and have a two or three dimensional mesh
topology. Thus, applications that are naturally of a different dimensionality (have a
different logical topology) from that of the FCC must be remapped to obtain optimal
performance.
In this thesis we study various aspects of the mapping problem for FCCs. There
are two major research thrusts. The first is finding the optimal mapping of logical
vi
to physical topology. This problem has received substantial attention by both the
theory community, where topology mapping is referred to as graph embedding, and
by the High Performance Computing (HPC) community, where it is a question of
process placement. We explore the implications of the different mapping strategies
on communication behavior in FCCs, especially on resulting load imbalance.
The second major research thrust is built around the hypothesis that applications
that need to be remapped (due to differing logical and physical topologies) will have
different optimal router configurations from those applications that do not. For ex-
ample, due to remapping, some virtual or physical communication links may have
little occupancy; therefore fewer resources should be allocated to them. Critical here
is the creation of a new set of parameterized hardware features that can be config-
ured to best handle load imbalances caused by remapping. These two thrusts form
a codesign loop: certain mapping algorithms may be differentially optimal due to
application-aware router reconfiguration that accounts for this mapping.
This thesis has four parts. The first part introduces the background and previous
work related to communication in general and, in particular, how it is implemented
in FCCs. We build on previous work on application-aware router configuration. The
second part introduces topology mapping mechanisms including those derived from
graph embeddings and a greedy algorithm commonly used in HPC. In the third part,
topology mappings are evaluated for performance and imbalance; we note that differ-
ent mapping strategies lead to different imbalances both in the overall network and in
each node. The final part introduces reconfigure router design that allocates resources
based on different imbalance situations caused by different mapping behaviors.
vii
Contents
1 Introduction 1
1.1 High Performance Computing and FPGA-Centric Clusters . . . . . . 1
1.2 Application Mapping in FCCs . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background, Context, and Methods 10
2.1 HPC and FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Communication Background . . . . . . . . . . . . . . . . . . 11
2.2.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Switch Arbitration Policies . . . . . . . . . . . . . . . . . . . . 15
2.3 HPC Applications on FPGA-Centric Clusters . . . . . . . . . . . . . 16
2.4 Basics of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Livelock and Deadlock . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 VC-based Wormhole Router . . . . . . . . . . . . . . . . . . . 22
2.5.3 Wormhole Router with Advance Flow Control . . . . . . . . . 24
2.5.4 Virtual Cut-Through Router . . . . . . . . . . . . . . . . . . . 25
2.6 Related Work on Topology Mapping . . . . . . . . . . . . . . . . . . 28
2.7 Usage Models and Configurable Router Design Space . . . . . . . . . 29
2.7.1 Soft Configuration . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7.2 Fully Dynamic Configuration . . . . . . . . . . . . . . . . . . 30
viii
2.7.3 Hard Configuration . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.4 Partial (Hard) Configuration . . . . . . . . . . . . . . . . . . . 31
2.8 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Topology Mapping 34
3.1 Communication Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Packet Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 The Cut-and-Fold Mapping Algorithm . . . . . . . . . . . . . . . . . 36
3.4 Topology Emulation Performance . . . . . . . . . . . . . . . . . . . . 42
3.4.1 Fold and Cut Performance . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Comparison Between Graph Embedding and Greedy Algorithm 47
3.4.3 Resource and Bandwidth Imbalance . . . . . . . . . . . . . . . 49
4 Dynamic Hardware Design 55
4.1 Virtual Cut-Through Router with Dynamic Buffers . . . . . . . . . . 55
4.2 Dynamic Buffer Design Performance Evaluation . . . . . . . . . . . . 60
4.2.1 Regular Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Irregular Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Resource Utilization Comparison . . . . . . . . . . . . . . . . 64
4.2.4 Impact of mapping on hardware usage . . . . . . . . . . . . . 65
4.3 Switch Design Based on Imbalance Analysis . . . . . . . . . . . . . . 66
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Conclusions and Future Work 71
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
References 74
Curriculum Vitae 88
ix
List of Tables
3.1 HPC Application Communication Pattern . . . . . . . . . . . . . . . 36
3.2 Max Load Table using different mapping mechanism . . . . . . . . . 50
3.3 Min Load Table using different mapping mechanism . . . . . . . . . . 50
x
List of Figures
1·1 Model Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2·1 indirect network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2·2 direct network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2·3 Mesh and Torus Topologies of two and three dimensions . . . . . . . 13
2·4 2x2 grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2·5 Packets on clockwise and counterclockwise direction . . . . . . . . . . 21
2·6 Forbidden turns in 3D-torus . . . . . . . . . . . . . . . . . . . . . . . 22
2·7 Wormhole VC-based router architecture . . . . . . . . . . . . . . . . 23
2·8 virtual channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2·9 Wormhole Router with Advanced Flow Control . . . . . . . . . . . . 25
2·10 Virtual Cut Through Router Architecture . . . . . . . . . . . . . . . 26
2·11 Virtual Cut Through Router Architecture . . . . . . . . . . . . . . . 27
2·12 Rectangular mesh into cube mesh (from Tvrdik (Tvrdik, 1999)) . . . 28
3·1 Packet Visualization shows network congestion situation of each cycle 37
3·2 Step 1: Fold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3·3 Step 2: Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3·4 Step 3: Z-Fold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3·5 communication intensity matrix and physical topology matrix . . . . 40
3·6 Generic Topology Mapping Strategies from Torsten Hoefler . . . . . . 41
3·7 2D to 3D Mapping Performance: All-to-All Batch Latency . . . . . . 43
3·8 2D to 3D Mapping Performance: All-to-All Throughput . . . . . . . . 44
xi
3·9 2D to 3D Mapping Performance: Square Nearest Neighbor Batch Latency 45
3·10 2D to 3D Mapping Performance: Square Nearest Neighbor Throughput 46
3·11 Link usage under different fold and cut configurations . . . . . . . . . 47
3·12 16x16 2D mapping to 8x8x8 3D, All to All and Square Nearest Neigh-
bor latency comparison(X-axis: Packet size(flit size=16B), Y-axis: La-
tency) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3·13 16x16 2D mapping to 8x8x8 3D, All to All and Square Nearest Neigh-
bor worst case comparison(X-axis: Packet size(flit size=16B), Y-axis:
Latency) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3·14 Buffer Usage All to All node wave(X-axis: Cycle, Y-axis: Buffer Usage) 51
3·15 Buffer Usage Square NN node wave(X-axis: Cycle, Y-axis: Buffer Usage) 52
3·16 Overall Link Usage All to All(X-axis: Direction, Y-axis: Link Usage) 52
3·17 Overall Link Usage Square NN(X-axis: Direction, Y-axis: Link Usage) 53
3·18 Link Usage All to All node wave(X-axis: Cycle, Y-axis: Link Usage) . 53
3·19 Link Usage Square NN node wave(X-axis: Cycle, Y-axis: Link Usage) 54
4·1 Dynamic Router Architecture . . . . . . . . . . . . . . . . . . . . . . 56
4·2 Dynamic Router Architecture . . . . . . . . . . . . . . . . . . . . . . 57
4·3 Basic Shared Buffer Block . . . . . . . . . . . . . . . . . . . . . . . . 58
4·4 Shared Buffer Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4·5 Dynamic VCT and VCT comparison running 3D all to all pattern(X-
axis: Packet Size(flit=16B), Y-axis: Latency) . . . . . . . . . . . . . 61
4·6 Dynamic VCT and VCT comparison running 3D cube nearest neighbor
pattern(X-axis: Packet Size(flit=16B), Y-axis: Latency) . . . . . . . . 62
4·7 Dynamic VCT and VCT comparison running 2D all to all pattern(X-
axis: Packet Size(flit=16B), Y-axis: Latency) . . . . . . . . . . . . . 63
xii
4·8 Dynamic VCT and VCT comparison running 2D square nearest neigh-
bor pattern(X-axis: Packet Size(flit=16B), Y-axis: Latency) . . . . . 63
4·9 Dynamic Buffer Usage All to All(with no mapping) node wave(X-axis:
Cycle, Y-axis: Buffer Block Number(Block memory size=256B)) . . . 64
4·10 Dynamic Buffer Usage All to All node wave(X-axis: Cycle, Y-axis:
Buffer Block Number(Block memory size=256B)) . . . . . . . . . . . 65
4·11 Dynamic Buffer Usage Square NN node wave(X-axis: Cycle, Y-axis:
Buffer Block Number) . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4·12 Buffer Usage All to All node wave(X-axis: Cycle, Y-axis: Buffer Usage) 66
4·13 Crossbar and Reduction tree besed switch . . . . . . . . . . . . . . . 68
4·14 Dynamic Buffer Usage fold and cut node wave(X-axis: Cycle, Y-axis:
Buffer Block Number) . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4·15 Reconfigure crossbar based on resource usage . . . . . . . . . . . . . . 69
xiii
List of Abbreviations
2D . . . . . . . . . . . . . Two Dimensional
3D . . . . . . . . . . . . . Three Dimensional
AMG . . . . . . . . . . . . . Algebraic Multigrid
ASIC . . . . . . . . . . . . . Application-Specific Integrated Circuit
BRAM . . . . . . . . . . . . . Block RAM
CPU . . . . . . . . . . . . . Central Processing Unit
DOR . . . . . . . . . . . . . Dimension Order Routing
DSP . . . . . . . . . . . . . Digital Signal Processor
FCC . . . . . . . . . . . . . FPGA-Centric Cluster
FIFO . . . . . . . . . . . . . First In First Out
FLOPS . . . . . . . . . . . . . Floating Point Operations per Second
FPGA . . . . . . . . . . . . . Field Programmable Gate Array
GPU . . . . . . . . . . . . . Graphics Processor Unit
HDL . . . . . . . . . . . . . Hardware Description Language
HLS . . . . . . . . . . . . . High-Level Synthesis
HPC . . . . . . . . . . . . . High Performance Computing
IC . . . . . . . . . . . . . Integrated Circuit
MD . . . . . . . . . . . . . Molecular Dynamics
MGT . . . . . . . . . . . . . Multi-Gigabit Transceiver
MPI . . . . . . . . . . . . . Message Passing Interface
NIC . . . . . . . . . . . . . Network Interface Card
RAM . . . . . . . . . . . . . Random Access Memory
RC . . . . . . . . . . . . . Routing Compute
VCT . . . . . . . . . . . . . Virtual Cut Through
VLSI . . . . . . . . . . . . . Very Large Scale Integration
xiv
1Chapter 1
Introduction
1.1 High Performance Computing and FPGA-Centric Clus-
ters
High Performance Computing (HPC) is becoming increasingly important through-
out science and engineering (NSF, 2016; DOE ASCR, 2017; DOE ASCR, 2018)
as ever more complex problems must be solved through computational simulations
(NSF CFD, 2014); these are often essential in augmenting the traditional methods
of experiment and mathematical theory (PITAC, 2005; Lindtjorn et al., 2011; Bailey
et al., 2002). There has been an immense amount of research supporting computa-
tional simulation, including parallel algorithms, parallel programming, network-based
HPC environments, and communication mechanisms such message passing. An im-
portant mechanism in advancing HPC (as with computer systems in general) is code-
sign where applications push the progress of HPC cluster development and then
cluster advances lead to modifications and tuning of the applications (Shalf et al.,
2011; Barrett et al., 2013).
A critical issue in HPC architecture research has been the fact that while advances
in process technology may have slowed somewhat, they are far from over (Arden,
2002; IRDS, 2018). For this, and other technical reasons, processing nodes continue
to be ever more powerful with more devices and more complex internal interconnects
(Hager and Wellein, 2010). In contrast, communication mechanisms have not been
able to keep up: while bandwidth can be handled somewhat by shifting resources,
2i.e., by spending more resources on the network, improving latency appears to be
intractable. As a result, many applications fail in strong scaling: as the HPC system
grows, there is more internal overhead and packets traverse longer distances making
the communication/computation imbalance even worse (Gropp et al., 1996). Com-
munication performance has thus become a bottleneck and is a critical research topic
(Expo´sito et al., 2013; Muhammed et al., 2020).
A continuing trend in HPC systems is that they are being built with more power-
ful nodes whose increasing computational power (in FLOPs) is largely due to the use
of accelerators, especially GPUs. These integrated circuits (ICs) have more hardware
resources devoted to computation than CPUs and for many applications their use
results in increased performance. Recently another accelerator candidate, FPGAs,
have also been explored (Herbordt et al., 2007b; Herbordt et al., 2008a; VanCourt
and Herbordt, 2009). FPGAs have a number of distinguishing characteristics; we now
describe the two most important for this study. First, unlike Application-Specific In-
tegrated Circuits (ASICs), FPGAs can be reconfigured to fit the application rather
than vice versa. As a result, applications running on FPGAs often obtain high uti-
lization which can rarely be realized by other computing technologies. And second,
since high-end FPGAs have been used primarily as communication devices, they are
built with a large number of very high-speed communication ports (multi-gigabit
transceivers or MGTs). For example, certain current FPGAs have over 120 MGTs
each operating at 40Gbps (Xilinx, 2009; Intel, 2019).
These FPGA attributes–high-performance, low power, configurability, connectiv-
ity (Hauck and DeHon, 2010)–lead to unique usage scenarios. In particular, FPGAs
can perform the role of any system component: CPU, accelerator, NIC, and router.
While any specialized IC will do a better job than the FPGA for a particular task,
the fact that the FPGA can do them all means that the functionality of these diverse
3tasks can be co-located on the same device (Munafo, 2018). FPGA-centric clusters
(FCCs) are computing systems that take advantage of this capability (Porrmann
et al., 2010; Putnam, 2014; George et al., 2016; Russinovich, 2017; Miyajima et al.,
2018; Plessl, 2018; Boku et al., 2019; Mondigo et al., 2020). In one common design,
FCCs have a number of standard nodes with FPGAs as the accelerator connected
to the CPU through PCIe. But the FPGAs themselves are interconnected with each
other through their MGTs (Sheng et al., 2014; Sheng et al., 2015b; George et al.,
2016). This allows packets to traverse the cluster without going through any other
device and cuts layers of hardware and software overhead (Sheng et al., 2016b; Sheng
et al., 2016a; Sheng et al., 2017; Sheng et al., 2018b; Sheng et al., 2018a). Point-
to-point communication is thus application-to-application and can take place in less
than 100ns (Sheng et al., 2015b).
To take maximal advantage of these direct connections, FCCs are generally built
with a direct network having a topology that matches the problems being studied,
generally a mesh/torus. Ideally, the dimension of the network also matches the prob-
lem, typically three; however, this also depends on the available connectivity of the
FPGA boards being used. The model application in our work is Molecular Dynamics
simulation (MD). While this application has many and diverse phases, an FCC with
a 3D torus interconnect appears to be preferred. A detriment of this fixed direct con-
nectivity is that physical rewiring is generally impractical and so some applications
will map better to a given FCC than others.
Not surprisingly, configurability provides a key benefit in FCC use. As is standard
when FPGAs are used as accelerators, they are reconfigured for each application. But
when the FPGA is also performing basic system functions–e.g., as controller, NIC,
and router–per-application reconfiguration of those parts is also beneficial. Note that
this latter reconfiguration of system functions is very different from the common
4reprogramming of the device to implement an application: it is rather a change in
common system components that are invariably assumed to be fixed. Yet, it has
been found in previous work that different applications perform better with different
router designs (Sheng, 2017; Yang, 2019). For example, different applications prefer
different routing algorithms or arbitration policies, or might need less or more buffer
space than usual.
Another issue with FCCs is integration with standard communication middleware
so that the programmer does not need to worry about the underlying hardware. To
do this, HPC programmers almost always use an MPI framework to express and im-
plement communication. The function of MPI is to provide a standard portable user
interface that guarantees reasonable performance. Achieving these goals for new sys-
tems, in this case FCCs, requires implementing the basic operations and integrating
them into a standard package (e.g., MPICH or OpenMPI). Previous work has found
that FCCs, due to their close integration of communication and computation func-
tions, do this particularly efficiently (Xiong et al., ; Xiong et al., 2018; Xiong et al.,
2019; Xiong et al., 2020; Stern et al., 2017; Stern et al., 2018).
1.2 Application Mapping in FCCs
We now come to the problem addressed in this thesis. While FCCs are often con-
figured preferably for certain applications, e.g., as a three dimensional torus for MD,
for other applications they may not be. Applications that are naturally of a different
dimensionality (what we refer to as having a different logical topology) from the target
FCC must be remapped to obtain good performance. That is, logical nodes in the
application (e.g., MPI ranks) should be mapped to physical nodes in such a way as
to reduce additional communication overhead due to the logical/physical topology
mismatch. Note that this mapping is transparent to the program: an application
5that where the programmer has assumed a two dimensional mesh to facilitate com-
munication is not rewritten for the three dimensional physical network. Rather, the
underlying communication mechanism handles the remapping.
The general mapping problem arises in a number of areas in computer science
including, data structures to storage, VLSI layout (Leiserson, 1980), and processes to
distributed compute nodes (Kung and Stevenson, 1977). In the theory community the
mapping problem is abstracted into graph embeddings and the goal is to minimize such
quantities as dilation, the number of extra hops that a packet must travel due to the
embedding. Some early surveys include (Rosenberg, 1980; Monien and Sudborough,
1990); of particular interest to this thesis is mapping of meshes of different types onto
one another, e.g., (Aleliunas and Rosenberg, 1982; Ma and Tao, 1988; Ma and Tao,
1993). In HPC it is a question of process placement (Kung and Stevenson, 1977;
Pellegrini and Roman, 1996; Hoefler and Snir, 2011; Chung et al., 2011).
A critical observation is that many mappings cause load imbalance on some re-
source. For example, some physical links may need to handle communication traffic
from different numbers of logical links. It follows that mappings of different logical
topologies onto the same physical topology will cause different types of load imbal-
ance. Moreover, the choice of mapping algorithm will also cause different types of
load imbalance. Ideally, to handle this situation, the router would be configured so
that resources are applied where needed. Moreover, the router would be reconfig-
ured as this resource imbalance changes, e.g., due to a change in logical topology or
in mapping algorithm. Our thesis is that a parameterized family of router
designs can be created; and, based on the application mapping, the pre-
ferred router design can be selected, the FCC can be configured with that
design, and that the result is improved performance or reduced resource
consumption.
6In this thesis we study various aspects of the mapping problem for FCCs. There
are two major research thrusts. The first is finding the optimal mapping of logical
to physical topology. As discussed, this problem has received substantial attention
by both the theory community, where topology mapping is referred to as graph em-
bedding, and by the High Performance Computing (HPC) community, where it is a
question of process placement. We explore the implications of the different mapping
strategies on communication behavior in FCCs, especially on resulting load imbal-
ance. Related to this research thrust is creating software support to perform the
mapping itself.
The second major research thrust is built around the hypothesis that applications
that need to be remapped (due to differing logical and physical topologies) will have
different optimal router configurations from those applications that do not. For ex-
ample, due to remapping, some virtual or physical communication links may have
little occupancy; therefore fewer resources should be allocated to them. Alterna-
tively, some applications may improve with some express connections that facilitate
long range communication. Critical here is the creation of a new set of parameterized
hardware features that can be configured to best handle load imbalances caused by
remapping.
While previous research in application-aware configuration of FCC routers has
concentrated on hardware changes, much previous router research has been done on
soft reconfiguration. In fact, it is common for routers to allocate resources as needed.
Examples include sharing buffer space among virtual channels (Nicopoulos et al.,
2006; Lu et al., 2007; Parik et al., 2008; Su et al., 2018), or dynamically allocating
link bandwidth (Shin and Daniel, 1996). Certainly only soft methods can work at
a fine-grained time scale. One interesting question is the hardware costs required to
implement soft methods and whether these can be eliminated with hard configuration.
7In this thesis we explore both hard and soft methods.
These two thrusts, mapping and reconfiguration, form a codesign loop: certain
mapping algorithms may be differentially optimal due to application-aware router
reconfiguration that accounts for this mapping.
1.3 Outline and Contributions
This thesis has four parts. The first part introduces the background and previous
work related to communication in general and, in particular, how it is implemented
in FCCs. We build on previous work on application-aware router configuration. The
second part introduces topology mapping mechanisms including those derived from
graph embeddings and a greedy algorithm commonly used in HPC. In the third part,
topology mappings are evaluated for performance and imbalance; we note that differ-
ent mapping strategies lead to different imbalances both in the overall network and in
each node. The final part introduces reconfigure router design that allocates resources
based on different imbalance situations caused by different mapping behaviors.
Figure 1·1: Model Work Flow
Figure 1·1 shows the basic workflow. As always in computer architecture, we begin
the set of likely applications. Since this is much too large a set to be practical, we
8analyze those applications and obtain characteristic communication patterns. These
patterns, which naturally have a logical topology embedded in them, provide the
input for the mapping algorithm. The other input at this stage is the topology and
the embedding method. In parallel, we develop a configurable hardware router model.
Since it is difficult to optimize both simultaneously, we begin with proxy metrics such
as link occupancy, Latency and Resources Utilization. Based on a previous router
design in FCC, hardware reconfiguration aims to mitigate imbalance issues. With
different mapping behavior and communication pattern, workload varies both with
node location and time. Generating hardware configure profile for each node to save
resources and improve network performance.
In FCC we have an opportunity to make an application aware design wit con-
figurable devices. We use Molecular Dynamics as our model application, extracting
communication patterns from real HPC application. With the support of our sup-
port of extensible C++ cycle-accurate simulator which exactly matches the hardware
behavior. In this design, Virtual Cut-Through router architecture is selected and im-
plemented in simulator because of the simple design, simple flow control and better
capability of handling head of line blocking issue. In terms of network performance
metric, we use link usage, latency and resources utilization. We measured performance
of different mapping mechanism, imbalanced network, and corresponding hardware
configuration.
We now summarize the contributions of this thesis.
1. New mapping algorithm We propose a new fold-and-cut algorithm that
transforms 2D topology into 3D. Graph embedding was we studied in 1980s in
theory community. Our approach is similar to them but with optimal changes
to better fit into our 3D torus FCC. The optimal fold and cut number increases
mapping efficiency based on given topology.
92. Comparison of mapping efficiency The well known greedy algorithm com-
monly used in HPC systems is compared with our new fold-and-cut algorithm.
We observe that fold-and-cut takes user communication into account. When
communication relies more on locality, fold-and-cut performs better than the
greedy algorithm.
3. A parameterized router model With support of a parameterized application-
aware model, using optimal parameters helps to boost communication perfor-
mance in comparison with the previous parameterized virtual cut through router
design. We parameterized mapping algorithm and hardware reconfigure in order
to find the best optimal network configure.
4. Evaluation of the model with respect to mappings Network performance
metrics including latency, link usage, and resource utilization are profiled to in-
dicate the network performance and utilization. We evaluate both in overall
scope and in local scope. With regard of mapping mechanism, communication
pattern, performance is differentially effected. We find that load imbalance be-
come the significant issue that we need to solve. A general hardware architecture
and customized configuration for each node targets both scopes of evaluation.
5. Hardware configurability corresponding to evaluation of mapping helps
performance of remapped applications Hardware configuration is intro-
duced to target load imbalance and performance improvement. Dynamic buffer
solves imbalance in general scope. Customized configure on each node would
boost performance and resource utilization in a higher level.
10
Chapter 2
Background, Context, and Methods
In this chapter, we provide background and related work related to this thesis and an
overview of the methods used. We start from the basic background of HPC systems
and the advantages of using reconfigurable devices. Meanwhile, the basis of MPI is
also introduced and typical HPC applications are described. Related previous work
on FCC design and topology mapping are presented.
2.1 HPC and FPGAs
HPC now has become a crucial part of science and engineering development and re-
search. As the performance demand for HPC application increases, different types
of accelerators have been integrated into HPC systems. Often accelerators are used
to boost the performance of the cluster’s compute performance, but huge power con-
sumption would be another issue to trade-off. Meanwhile, with a more powerful
compute unit, data movement becomes the bottleneck of many applications. As a
result, compute efficiency, power consumption, and communication latency are three
major factors that limit HPC performance. FPGAs may be emerging as a device that
could address all these challenges.
FPGAs have certain advantages, especially when applications need a close com-
putation and communicating relationship. With a large number of configurable logic
units, Block RAMs, DSPs, and high-speed communication ports, FPGA are widely
used is routers and in signal and image processing. Lately their use has expanded
11
into datacenters, especially for use in I/O devices and controllers, NICs, and in a
Bump-in-the-Wire (BitW) configuration. Offloading communication operations onto
FPGAs instead software reduces the overhead of communication performance. Tradi-
tionally FPGAs have been programmed with Hardware Description Languages (HDL)
to define the configuration file; lately use of High-Level Synthesis (HLS)(Xilinx, 2018)
methods, including OpenCL(Altera, 2017), have become widespread making it much
easier to reconfigure FPGAs (Yang et al., 2017a; Sanaullah and Herbordt, 2017;
Sanaullah and Herbordt, 2018b; Sanaullah and Herbordt, 2018c; Sanaullah and Her-
bordt, 2018a; Sanaullah et al., 2018a; Herbordt, 2019). Other work in programma-
bility and performance includes (Herbordt and VanCourt, 2005; VanCourt and Her-
bordt, 2005a; VanCourt and Herbordt, 2006a; VanCourt and Herbordt, 2006c; Khan
et al., 2011; Meng et al., 2016).
One FPGA characteristic especially relevant for this thesis is the availability of
Multi Gigabit Transceivers (MGTs) in high-end FPGAs. MGTs are a high-speed
connection interface which enables inter-node communication with low latency and
high bandwidth(Altera, 2014; Intel, 2019; Xilinx, 2009). In FCCs, FPGAs are di-
rectly connected with physical links between FPGAs. With switch inside the FPGA
and connection using directly physical MGT links, coupling computation and com-
munication could serve HPC application’s high-performance demand and, especially,
facilitate strong scaling.
2.2 Related Communication Background
An important part of this thesis work is about parameter space exploration in the
design of communication routers. We therefore go into some detail reviewing this
design space (Herbordt et al., 1999; Dally and Towles, 2004).
There are two types of networks, direct and indirect networks. With endpoint
12
residing outside the network is defined as an indirect network (Figure 2·1); there is a
clear boundary between compute nodes and network. In direct networks, endpoints
sit inside the network (Figure 2·2). Each node has associated with it both compute
and switching components node.
Direct networks are appropriate for FCCs for two reasons. The first is that they
better take advantage of the great strength of FPGAs: the co-location of compute and
communication. The second is that many of the applications of greatest interested
for FCCs, those that struggle with strong scaling, have mesh logical topologies. Since
for meshes an indirect network makes no sense, again a direct network is the obvious
choice.
Figure 2·1: indirect network
2.2.1 Network Topology
Since FCCs are likely to have direct networks, we can safely ignore topologies that are
preferred for indirect networks such as fat trees and variants of the butterfly. For direct
networks the likely choices are variants of meshes where the primary design decisions
13
Figure 2·2: direct network
are about wrap-around and dimensionality. To constrain the scope of this thesis we
focus on the 3D Torus. The 3D is good for many spatially mapped applications; with
the torus the edge nodes are connected to each other so performance is not sensitive
to the tasks operating at the edge of the cube (Figure 2·3).
Figure 2·3: Mesh and Torus Topologies of two and three dimensions
2.2.2 Routing Algorithms
The space of routing algorithms is vast. In previous work we have found that a few
simple algorithms are most likely to be preferred in our configuration space. We now
14
briefly describe them.
Dimension Ordering Routing (DOR)
The first one is a simple DOR routing algorithm. In the 3D torus FCC network,
there are 6 choices of this algorithm (XYZ, XZY, YXZ, YZX, ZXY, ZYX). The packets
need to finish the first dimension to be allowed to enter the second and the third.
While there is no performance difference between these six choices, we choose XYZ
as our algorithm. DOR is an oblivious routing algorithm that can be implemented
with low cost and simplicity.
Orthogonal One-turn Routing (O1TURN)
This is another oblivious routing algorithm that is similar to DOR. We know that in
the 3D torus network, DOR chooses one of the six choices as a packet routing path.
O1TURN takes all six choices and randomly chooses one of the six choices as the
packet’s routing path. The dimension order is determined before the packet is sent
out from the source. In order to avoid deadlock, some of the turns are forbidden, so
some of the choices are removed.
Randomized, Oblivious, Multi-phase, Minimal Routing (ROMM)
The ROMM algorithm randomly selects some intermediate node on the path from
the packets’ source to destination node. These intermediate nodes define minimum
submeshes between source and destination node. This algorithm adds randomization
into the routing paths.
Randomized Load Balance Routing (RLB)
RLB is a non-minimal, oblivious routing that provides a more balanced network
routing algorithm. Unlike DOR, in which each dimension packets are always starting
from the same direction that gives the minimal routing path, in RLB, packets have two
15
choices of sending packets in each dimension. In order to avoid adding to congestion
in the network, packets randomly choose one of the two directions so that network
imbalance is mitigated.
Adaptive Routing Algorithms
Previously introduced algorithms are unaware of congestion information of the overall
network, the just follow the rules (obliviously). So an unbalanced situation would be
an issue for the system. With an adaptive routing algorithm, collecting local/global
information to decide the path of packets would be another approach to improve com-
munication performance. The information we can collect can be surrounding nodes’
ideal buffer size and VC occupancy. Base on this information, the least congested
node would be the better choice the packets may go. Credit Count Adaptive Routing
Algorithm (CCAR) is an algorithm implemented in NoC routers. In some proper cy-
cles, upstream would send a credit packet to the down steam node with information
describing the situation of current node. When the downstream node receives the
credit packets, the best path can be decided based on this information.
2.2.3 Switch Arbitration Policies
When two packets are contending the same output port (or internal link), a switch
arbitration policy is introduced for selecting one of them. We describe the arbitration
policies in our design space.
Farthest First (FF)
The farthest first policy chooses the packet that is farthest from its source as the
highest priority. The header contains the source and destination information, every
time it enters the router, priority can be calculated by the switch.
16
Oldest First (OF)
The oldest first policy chooses the packet that has the longest time in the network.
Dedicated bits in the header store the age of packets remaining inside the network.
From previous work, 16-bit age field is enough for our workloads.
Mixed
Farthest first and oldest first policies are mixed together. Starting with farthest first,
the packets which have the farthest distance from its source have a higher priority.
For packets with the same distance, the oldest first policy is used to choose one of
them.
2.3 HPC Applications on FPGA-Centric Clusters
An underlying assumption of this thesis is that FCCs are particularly well suited for
certain applications. In this section we very briefly describe methods and applications
of particular interest.
System studies. This includes basic work in programmability and performance
(Herbordt and VanCourt, 2005; VanCourt and Herbordt, 2005a; VanCourt and Her-
bordt, 2006a; VanCourt and Herbordt, 2006c; Khan et al., 2011; Meng et al., 2016),
and FPGA system design and architecture (Pascoe et al., 2010; Khan and Herbordt,
2012).
Previous Work on Molecular Dynamics Simulation on FPGAs. Surveys
include (Chiu et al., 2008; Sukhwani and Herbordt, 2009; Sukhwani and Herbordt,
2010a; Chiu and Herbordt, 2010b; Herbordt, 2013; Khan et al., 2013; Sukhwani and
Herbordt, 2014). The first generation of MD work uses FPGAs only for the range
limited force while using CPU for the rest of computation (Gu et al., 2006c) and
included several studies on datapath optimization (Gu et al., 2006a; Gu et al., 2006b;
17
Gu et al., 2008) and handling neighbor lists (Chiu and Herbordt, 2009; Chiu and
Herbordt, 2010a; Chiu et al., 2011). Work on the long range force has included
particle mapping (Sanaullah et al., 2016a; Sanaullah et al., 2016b), multigrid (Gu
and Herbordt, 2007b; Gu and Herbordt, 2007a), and the 3D FFT (Humphries et al.,
2014; Sheng et al., 2014). Other MD work with FPGAs has been on the bonded force
(Xiong and Herbordt, 2017) and complete FPGA integration (Yang et al., 2019b;
Yang et al., 2019a).
Previous Work on Accelerating Algebraic Multigrid on FPGAs. With
a configurable datapath and flexible memory, FPGAs is an efficient device for AMG
computing (Haghi et al., 2020). The novel and salable architecture have been pro-
posed to obtain full utilization. Meanwhile, multi-node FPGAs work is being con-
ducted on AMG computing, relating to internode communication.
Previous Work on Accelerating Bioinformatics on FPGAs. This includes
studies of dynamic programming based algorithms (VanCourt and Herbordt, 2004;
VanCourt and Herbordt, 2007), heuristic sequence alignment such as BLAST (Her-
bordt et al., 2006; Herbordt et al., 2007a; Park et al., 2009; Park et al., 2010; Mahram
and Herbordt, 2010; Mahram and Herbordt, 2012b; Mahram and Herbordt, 2012a;
Mahram and Herbordt, 2015), multiple sequence alignment (Mahram and Herbordt,
2012b), and other string matching applications (Conti et al., 2004).
Other HPC work on FPGAs. Other HPC applications include Discrete Molec-
ular Dynamics (Model and Herbordt, 2007; Herbordt et al., 2008b; Herbordt et al.,
2009; Khan and Herbordt, 2011), Molecular Docking (VanCourt et al., 2004a; Van-
Court and Herbordt, 2005b; VanCourt and Herbordt, 2006b; Sukhwani and Herbordt,
2008; Sukhwani and Herbordt, 2010b; Landaverde and Herbordt, 2014), Microarray
Analysis (VanCourt et al., 2003; VanCourt et al., 2004b), Adaptive Mesh Refinement,
(Wang et al., 2019b; Wang et al., 2019a), Sensing (Sheng et al., 2015a; Liu et al.,
18
2016; Yang et al., 2017b; Xiang et al., 2018), and Machine Learning (Liu et al., 2016;
Sanaullah et al., 2018b; Geng et al., 2018b; Geng et al., 2018a; Geng et al., 2019c;
Geng et al., 2019b; Geng et al., 2019a; Li et al., 2019; Shi et al., 2020).
2.4 Basics of MPI
HPC applications are written using MPI. Since MPI has certain characteristics that
influence how communication gets implemented we give some background here.
Large numbers of nodes are required in HPC computing. MPI which is a well
known communicating middleware handles communication process between each node.
The efficiency and standardization make MPI the most popular programming models
for large computing systems, supporting point to point communication and collective
operations. MPI was developed in 1990s, before that it was a difficult and tedious
task to write and run parallel applications with no standard way of doing it. At that
time most of the applications were in science and research area. Message passing
modes mean that application passes messages among processes in order to perform
a task which is an optimal model for the parallel program. For instance, the Master
process has the ability to assign work to the slave process by passing description of
that work. By 1994, MPI-1, a well-defined standard and interface of message passing
interface have been introduced to developers. As the complete implementation of
MPI, MPI was widely adopted and become a standard method of message passing
applications.
In MPI, MPI COMM WORLD in MPI variable defines the group of nodes in the
MPI application which is the world. This is a communicator in the world for all the
nodes talking to each other and do the message passing. The communicator prevents
the message from other world to interfere with each other. Rank represents the node
in communicator to identify the processor’s id in the communication world. To start
19
MPI applications, MPI Init must be called to enable communications before any other
communication routines.
Point to Point operation Point to point operation is the most fundamental
message passing operations between two nodes. The communication happens between
two nodes, one call send and the other calls received. This performs the point to
point communication. Message sent by the sender is composed by two-part, data
and message envelop. Message envelop contains packets’ information such as source,
destination, tag, and communicator.
Collectives operation Collective operations are widely used not only because
of the simplicity of programming but also for the performance of communication.
Collective communication is a method of communication that involves the partici-
pation of all processes in a communicator. Synchronization is needed for collective
communication which means that all processes must reach a point in their code be-
fore they can begin executing. Collective operations involve typically broadcasts and
reductions. Instead of doing a large amount of point to point communication, bring
data in a specific path to gather or broadcast information to reduce communication
intensity.
Topology Mapping A communicator describes a group of processes, but every
process may not communicate with every other node. For example, a computation
system defined on a Cartesian 2D grid, the node is only communicating with its neigh-
bor nodes which are N/S/E/W neighbors. If MPI knows such information, it could
conceivably optimize the virtual topology mapping. Renumbering the ranks to bring
communication processes closer in physical topology to achieve better communication
performance.
MPI provides two types of typologies: one based on Cartesian grid and the other
based on graphs. Graph based topology stands for processes and edges connect pro-
20
cesses that communicate with each other. A large amount of applications are based
on grid topology with two or more dimensions. Grid structures are mapped using a
row-major numbering system. Figure shows an example of 2x2 grid(Figure 2·4).
Figure 2·4: 2x2 grid
2.5 Router Architecture
In this section, three router architectures and the flow control mechanism are ex-
plained. These are again based on work done by Sheng (Sheng, 2017) and Yang (Yang,
2019). Two wormhole style router and a virtual cut-through router are covered in
this section. Based on FCC characteristic, major function units are implemented in
FCC router.
2.5.1 Livelock and Deadlock
Livelock
Livelock happens when packets are making no progress arriving at the destination.
Packets are traversing and move through the network. However, using the routing
algorithm introduced before, livelock does not happen. For instance, DOR algorithm,
packets are following rules of entering the next dimension before the have finished the
previous dimension. Once the packet is injected into the network, the routing path
is determined and it will definitely make progress of heading to the destination.
21
Deadlock
Deadlock happens when a sequence of packets is freeze because they are all waiting
for each other to release the resources. This situation is when multiple packets are
having similar moving trends and forming a dependency loop. Each packet is waiting
for its previous packet to release resource but itself is also being waited by its rear
one. Deadlock can be a disaster to the network, those packets will be stuck inside the
network never reaching the destination. In our FCC 3D-torus, deadlock can happen
in two ways, on single dimension or on multiple dimensions.
On single dimension, the wraparound link forms a ring easily in any dimension.
(Dally and Seitz, 1987) proposed that the deadlock can be avoided by using VCs and
dateline. So we divide the packets into two groups, class 0 and class 1. Only a certain
class can enter their belonging VCs or resources. In order to avoid ring link happens
on the single dimension, within the same class, loop can not form. So a Dateline is
added on each torus loop between node(N-1) and Node 0. Each packet is injected
with class 0, when passing the dateline, the class of packet needs to be changed to 1.
Figure 2·5 shows how dateline works.
Figure 2·5: Packets on clockwise and counterclockwise direction
22
On multiple dimensions, in order to avoid deadlock Forbidden Turn on the routing
algorithm level would be a better solution. When packets enter certain direction,
packets are not allowed to head back to certain directions. Six turning directions are
forbidden to break all possible dependency loops. Y- to X+ and Y- to X- are two
forbidden turns to avoid XY plane. Z- to X+, Z- to X- are ones to avoid deadlock in
ZX plane. Z- to Y+, and Z- to Y- are to avoid ZY plane. As a result, Z- is the last
direction to go for all the packets. As Figure shows 2·6.
Figure 2·6: Forbidden turns in 3D-torus
2.5.2 VC-based Wormhole Router
Wormhole VC-based router is proposed by William Dally(Dally, 1992; Dally and Aoki,
1993) has been well studied. Wormhole router has low resource utilization and high
throughput advantages and becomes the standard of the industry. This subsection
briefly introduces the principle and mechanism of classic wormhole VC-based router.
The architecture is shown in Figure 2·7.
Firstly, packets are broken into flit, which is the basic unit of flow control. Based
on the packet, there are several types of flits, HEAD, BODY, TAIL, SINGLE, and
CREDIT. If the packet is large enough, packets are broke into HEAD, BODY and
TAIL. HEAD flit contains the basic information of the packet including source, des-
tination, size, priority and so on. Flits of packets arrive at the downstream node’s
input port sequentially. If the flit is HEAD flit, based on the information provided by
the flit, the routing compute unit would perform routing algorithm on the packet and
23
decide the next output direction. Each input port contains a set of virtual channels.
These channels are there to hold the arriving flits until the output direction is avail-
able for the current packet. Wormhole VC router tends to have a smaller buffer size
with low resource utilization. Large packets with many flits would scatter in several
nodes along the path when the virtual channel buffer is not big enough to hold the
whole packets. when blocking happens, the following flits are also blocked because
flits are not able to send the next node with not enough virtual channel buffer. In
fact, the buffer inside the virtual channels can only hold one or two flits at a time.
This is the reason why VC based wormhole router takes fewer resources utilization.
On the switch side, when the output port is available to send out new packet, switch
arbitration would start arbitration on VCs and one of the VC wins the arbitration.
Afterward, the selected VC send the flit to the switch output.
Figure 2·7: Wormhole VC-based router architecture
The reason for having multiple virtual channels is to solve blocks. The VC buffer
is holding flits if the output port is not available causing the input port is not able to
receive more packets. While it is possible that other incoming packets’ output port
is available. So with more VCs, the router’s resources are fully utilized. As figure
24
shows 2·8.
Figure 2·8: virtual channel
2.5.3 Wormhole Router with Advance Flow Control
This design is similar to the first one but with some architecture optimizations. There
are two main improvements in this design, function unit rearrangement and advanced
flow control.
Firstly, the design removes the input buffer and allocate each VC unit as a FIFO
buffer. In the previous design, since the head flit has the routing information such as
source, destination, class and so on. The consistency of the flits in the same packets
need to be maintained both on the physical link and inside the input buffer. As a
result, the flits transmission can not be intervened by flits belongs to other packets.
In order to mitigate the blocking issue, input buffer needs to be large enough to hold
the whole packets. Remove the link occupation on its path requires a large input
buffer which consumes large resources. Meanwhile, the existence of input buffer also
introduces head of block issue. with multiple packets store in the input buffer, when
25
block happens in one of the VC causing freeze in the input buffer. Packets behind
which have the change to enter unoccupied VC are impossible.
In this design, the input buffer is completely removed and changing VCs as input
buffer slots. Meanwhile, we perform the VC allocation ahead of time for the down-
stream. Only in this design, the preallocation is possible. In this way, the router has
the capability to recover the connection between upstream and downstream nodes.
When congestion happens the router could give away link occupation to other packets
that have the change for other unoccupied VCs. It is possible to resume the trans-
mission because we have the downstream VC allocation id. What’s more, VC buffers
do not need to be large enough to hold the while packets which would save resources
for the chip.
Figure 2·9: Wormhole Router with Advanced Flow Control
2.5.4 Virtual Cut-Through Router
Virtual Cut-Through is another switch that widely adopted. Comparing to the previ-
ous wormhole router, although it takes more resources, the advantage of simple flow
control and less link congestion makes it become a promising router architecture. The
router store the packets in the intermediate node when the next node is congested or
26
busy. The upper stream node can send the packet if the next node has enough space
for the current packet without waiting for the entire packet to arrive. Reasoning that
there the packets are stored in the while buffer space, credit flit only needs to carry
how much space has been left for the upper stream. Flow control would be simple
and efficient. When the next node is congested, buffer space is large enough to store
the whole packet which frees physical links and gives way for other available packets.
Figure 2·10: Virtual Cut Through Router Architecture
The architecture (Yang, 2019) is shown in Figure 2·10 and the input buffer data-
path is shown in Figure 2·11. The pipeline stages are similar to the previous designs,
the difference is the input buffer. And the credit unit is on longer the same. All
the incoming packets are stored in the RAM, and each address slot is large enough
to hold the whole packet. Available Queue and Empty Queue are two FIFOs to
indicate valid available packet that needs to be sent out and empty address of the
RAM. We call the slot in these queues Token. These FIFO can be represented as
tables describe the status of the RAM. When new packets arrives at the input port,
traversing through routing compute unit entering the packet RAM. The empty queue
would pop a token contains the address for the packet. The packet will be stored
in the RAM. In the meantime, avail queue tokens store the basic information of the
packet, including RAM address, packet flit id, size and so on. After the initialization
27
stage, the empty queue will be full and the available queue is empty because there
are no packets available for output.
Figure 2·11: Virtual Cut Through Router Architecture
The available queue is divided into 6 groups which represent 6 output port. Each
output direction can fetch their needed packet when it available. The output signal is
sent from switch. When the input buffer unit receives the signal from the switch, the
token will be pop from the available queue and send the packet to the dedicated output
direction. At the same time, an empty token is generated to replace the popped one
from the available queue. With new requests from the switch, these procedures are
repeated in the following packet sending. However, the head of block issue would
happen more likely in the design. When multiple input buffers are requesting the
same output port, packets behind the packets are freeze while there is an available
output port for the following packets. To solve the head of block issue, 10 registers are
added which is called Peek Flit which holds a copy of the top packet of the available
28
queue. These packets are connecting directly to the Switch Allocation Unit. When
the allocation unit gives the signal which is based on the result of the arbitration
policy, the matching packet will be sent to the switch for output. In this way, we can
get rid of the head of block issue.
2.6 Related Work on Topology Mapping
There has been large amount of work on topology mapping both in graph embedding
and algorithm. Some influential early work and surveys include (Rosenberg, 1975;
Rosenberg, 1980; Aleliunas and Rosenberg, 1982; Monien and Sudborough, 1990).
Lecture notes by Tvrdik provide a particularly fine introduction (Tvrdik, 1999). Em-
bedding of rectangular meshes into cube meshes is a promising way to embed 2D
mesh into 3D torus topology (Ma and Tao, 1988; Annexstein et al., 1990; Ma and
Tao, 1993; Obrenic´ et al., 1999). Given a 2D mesh, roll like Figure2·12(b). As we get
a long rectangular, cut and pile up to get a cube like(c).
Figure 2·12: Rectangular mesh into cube mesh (from Tvrdik (Tvrdik,
1999))
Another approach is proposed by Hoefler (Hoefler and Snir, 2011). The topology
mapping problem in general is NP-complete problem (although optimal solutions are
29
known for low-dimensional special cases). The proposed efficient and fast heuristic
based on graph similarity and test the performance using real application communica-
tion pattern. They have several algorithms that can support heterogeneous networks
and shows a reduction of congestion on different topology and irregular communi-
cation patterns. Greedy heuristic starts from some vertex and chooses the heaviest
vertex and greedily map the heaviest neighboring vertex to the heaviest connections.
With this procedure happen recursively, the efficient mapping is done. The second al-
gorithm recursive bisection mapping recursively split the minimum weighted edge-cut
into equal halves to determine the mapping.
SCOTCH (Pellegrini and Roman, 1996)is another software package for static map-
ping library. Static mapping means gubernatorial optimize assigning the communi-
cation processes of a parallel program onto a parallel machine to minimize the overall
execution time. SCOTCH library defines an indirect source graph that can represent
the virtual topology. Within the graph, each vertex and edge has weight used for
computing the computation and communication weight of the corresponding process
and link. Meanwhile, target physical topology is also represented in a similar way
to represent the processing cost of the processor and communication. The algorithm
recursively partitions both the source and target virtual topology and target physical
topology.
2.7 Usage Models and Configurable Router Design Space
Our overall premise is that performance can be improved (or cost in chip area reduced)
by configuring the router/switch with respect to the workload. There are several
mechanisms by which this configuration can be made, most of which are already well
studied (Dally and Towles, 2004; Sheng, 2017; Yang, 2019). They differ by the time-
scale of the configuration and by the type of change to the hardware that is made.
30
The time-scale also determines the usage model. We now categorize configuration
types and give examples of each.
2.7.1 Soft Configuration
Definition: We define soft configuration to be changes to the router that can be made
with no change to the hardware; i.e., only device memory is changed. Moreover, no
special hardware design is needed to effect this configuration, although not all switches
support each kind of soft configuration.
Examples: If the routing algorithm is implemented using table-based routing, then
the change in routing algorithm is implemented using soft configuration. Dynamic
priority is another soft configuration application. The switch can dynamically change
priority of each direction based on the buffer usage. Priority allocation for congested
direction at runtime can be defined as soft method.
Timescale: Since only a few KB to MB of memory need to be loaded, soft configu-
ration can take place on the scale of microseconds.
Usage Model: The timescale indicates that changes can be made within the running
of an application without serious loss of performance. The scope of the change,
however, means that care would need to be taken. Changing a basic mechanism like
routing algorithm cannot be done within a single batch communication, but could be
done between communications, or especially, phases of an application.
2.7.2 Fully Dynamic Configuration
Definition: Fully Dynamic configuration refers to changes in how the router works
as determined by the underlying hardware design. This method is well-known and in
general use in routers of all types.
Examples: One example is the allocation of channel band to virtual channels. An-
other is the amount of storage that is allocated to a virtual channel. We use fully
31
dynamic configuration in the dynamic version of VCT router. This design dynami-
cally allocates input buffer sizes for each of the six input directions.
Timescale: The timescale is on the order of a few cycles.
Usage Model: Fully dynamic configuration is generally transparent to the rest of
the system, although one could easily imagine having modifiable parameters.
2.7.3 Hard Configuration
Definition: We refer to as Hard Configuration changes to the router that require
the FPGA (or analogous configurable hardware) to be reconfigured. This is the
underlying mechanism of some previous FCC router work (Sheng, 2017; Yang, 2019).
Examples: One example is the switching mechanism, such as how the internal chan-
nels are multiplexed between input and output. Another is the total buffer allocation.
Timescale: The timescale depends on the support given for reconfiguration by the
board or router design. It can vary from milliseconds to seconds.
Usage Model: If the timescale is on the order of seconds, then configuration is likely
only viable when an application is being loaded. If milliseconds, then finer gradations
are possible, such as between phases of an application.
2.7.4 Partial (Hard) Configuration
Definition: Partial Hard Configuration is a variation of Hard Configuration, but
done with partial reconfiguration (Vipin and Fahmy, 2018).
Examples: There are no implemented instances, but we could envision isolated
changes being made. Perhaps the amount of allocated storage could be changed in
this way.
Timescale: On the order of milliseconds.
Usage Model: This timescale is on the order of soft configuration and so would
have a similar usage scenario.
32
2.8 Experimental Setup
A cycle-accurate simulator is built to get the result of different architectural designs
and measure the performance of our FCC system. We have used as our starting point
previous work, especially by Sheng (Sheng, 2017) and Yang (Yang, 2019). In order
to provide fast and accurate network performance of different designs and features
of the network, the simulator is written in C++. Implementing the system in HDL
would be painful and hard to debug with large simulation time. The simulator clearly
shows the details and accurately reflects the status of the router cycle by cycle. With
real parameters based on the capabilities of current FPGA devices, the torus size of
FCC cluster is 8 by 8 by 8.
Modules are implemented as a class in C++ simulating the same hierarchical
order in the real HDL system. To simulate the clock cycle, the entire simulator is in
a while loop. Each time we enter the while loop, cycle counter will increment one.
Each module has an input and output port connecting with each other to mimic
the pipeline manner. The simulator used “producer-consumer” model introduced
in (Sheng, 2017). There are three basic functions in each module, Initialization,
Consume, Produce. At the beginning of the simulation, the Initialization function is
activated to do initial work of giving variable the proper value and point pointers to
the right variables. After that, Consume function fetch the value given by the input
port in their pipeline. The following phase is Produce, its function is evaluating the
latched in value and execute the data path in each module and update the output
port’s value. At the end of this cycle, the cycle counter increment 1 and move to the
next cycle. With the help of the simulator, all possible configure architecture design
and features performance is evaluated.
This work was done in the context of creating the Novo-G# testbed on a cluster
of 64 FPGAs (George et al., 2016). The simulator is fully validated on a 4-node
33
subsystem (Sheng, 2017; Yang, 2019). In order to test our proposed FCC cluster, the
network size of simulator is set as 8 x 8 x 8. In our target system, each board is a
Stratix V FPGA, supporting six links Multigigabit Transceivers(MGT). The targeting
link bandwidth is set as 20 Gbps, with a 256-bit phit size, the MGT’s revocerd user
input frequency is around 80 MHz. Our flit size is set as 128-bit making our router
targeting frequency as 160 MHz which meets the place&router timing requirement.
With measure of previous work, link delay is set as 175 ns which is 28 cycles.
Chapter 3
Topology Mapping
As described in previous chapters, the topology of interest here is a 3D-torus. Molec-
ular Dynamics is a good example of an application that runs well on this architecture
as it assigns molecules to nodes based on their spatial location. However, not all ap-
plications follow 3D-torus topology. It would be expensive to rewire and reconfigure
the physical cluster topology to fit the application’s logical topology. Rather, the
application is mapped onto the physical cluster to maximize performance.
As early as the 1980s, topology mapping was already recognized as an important
problem. There appear to be two approaches to developing mapping strategies. One
emerges from the theory community and is based on abstract metrics such as dilation
and congestion. It gives provably optimal solutions, but is limited by the underlying
computational model. The other approach emerges from the HPC community and
has a more practical set of metrics. The drawback of this approach is that most HPC
networks nowadays are indirect “xyz-flies” or fat trees, rather than the meshes of
interest in the present study.
In this chapter, we investigate representative mapping mechanisms from both of
these communities. As stated, users do not need to consider topology issues, based
on user’s topology or application characteristics, the mechanism will find out and
optimized way to map application to the physical cluster. The first one is an algorithm
that we have developed as part of the thesis work: a “fold-and-cut” mechanism that
transforms 2D logical coordinate onto the 3D physical topology, extending work, e.g.
34
35
of (Aleliunas and Rosenberg, 1982; Ma and Tao, 1988; Ma and Tao, 1993) and generic
methods described by (Tvrdik, 1999). The second one is the greedy algorithm to
greedily mapping user’s application communication patterns and minimize the overall
communication link usage (Hoefler and Snir, 2011).
We first motivate communication patterns of interest; then introduce a packet
visualization utility we have created to help the development of intuition; we follow
this by presenting the mapping algorithms, the new fold-and-cut, and the well-known
greedy algorithm; finally come experimental results for analyzing behaviors and com-
paring performance.
3.1 Communication Patterns
One drawback of the theory approach is that it is generally limited to static properties
of the topology. It would appear that this approach ignores critical information: how
the virtual links are actually used. To do this we must first find the likely/common
communication patterns. Some patterns are extracted from real applications and
some are synthetic ones in order to test our mapping mechanism and dynamic hard-
ware. We pick some of the typical ones as our experiment communication pattern
shown as the table below 3.1.
The most significantly used in real HPC applications are All to All, Cube Nearest
Neighbor, and Nearest Neighbors. Take Molecular Dynamics (MD) as an instance,
Cube Nearest Neighbor and All to All derives from MD simulation. Cube Nearest
Neighbor performs twice for Range-Limited force evaluation iteration. At the evalu-
ation phase, each node broadcasts its particle information to its neighbor nodes, and
when each node finishes its calculation, data needs to be sent back from 16 nearest
neighbors. When performing Long Range potential evaluation on multiple FPGAs,
All to All is performed. Stencil computation needs Nearest Neighbor and 3-hop Di-
36
agonal Nearest Neighbor. For Bit Complement, Transpose and Tornado are synthetic
patterns for evaluating router performance.
Table 3.1: HPC Application Communication Pattern
Pattern Name Pattern
All-to-All (x,y,z)−→ all the other nodes
Nearest Neighbor (x,y,z)−→(x+1,y,z),(x-1,y,z)...(x,y,z-1)
Cube Nearest Neighbor (x,y,z)−→[x+1,x-1][y+1,y-1][z+1,z-1]
Transpose (x,y,z)−→(z,y,x)
3-hop Diagonal
Nearest Neighbor (x,y,z)−→(x+1,y+1,z+1),(x+1,y+1,z-1)...(x-1,y-1,z-1)
Tornado (x,y,z)→ (x,y+YSIZE/2-1,z)
Bit Complement (x,y,z)→ (bitcomplement(x,y,z))
3.2 Packet Visualization
In order to show the workload and situation of the network, we have created a visu-
alization toolkit to monitor the situation of the network shown in Figure 3·1. This
toolkit shows the network congestion situation of each cycle. As the figure shows, we
are dividing the 3D cluster into eight 2D meshes, which shows link usage of each layer.
The cross and dot show Z dimension link usage status. With green, yellow and red
shows the congestion situation and number next to it indicates the packets traversing
in the link. With the packet visualization toolkit, communication patterns are visual-
ized and network congestion situation is shown directly. With this support, it would
be easier to improve the communication mechanism and boost our communication
performance.
3.3 The Cut-and-Fold Mapping Algorithm
A great diversity of user application topology can be mapped onto physical topology,
we choose lower 2D dimension onto higher 3D dimension as our example. In the
case of knowing the user’s topology instead of communication characteristic, a graph
37
Figure 3·1: Packet Visualization shows network congestion situation
of each cycle
embedding mechanism would be suitable in this case. In terms of different mapping
behavior, load and communication imbalance situation would be different. Evaluation
of mapping strategies and imbalance issue is proposed in the last section.
Graph Embedding
In this section, we proposed generalized 2D logical mapping to 3D physical topology
using the fold-and-cut mechanism. As the user given a lower (2D) logical topology,
a middleware software level need to map the user’s topology onto our physical 3D-
torus cluster. Implementing this has two reasons. Firstly, 2D logical topology size is
larger than 3D, we could not directly map the 2D mesh onto one layer of 3D cluster.
This mechanism solves this situation by squeezing the 2D mesh as small as possible
to fit into 3D physical topology. Secondly, with converting 2D logical coordinate to
3D physical topology coordinate, the FCC router infrastructure can be directly used
with Z dimension 2 extra link provided by FCC hardware, better performance can be
38
achieved.
The graph mapping mechanism is achieved in a fold-and-cut manner. Reasoning
that we do not know the user’s application communication pattern characteristic,
based on user’s logical topology, the implementation needs to keep original logical
typology’s nearest neighbor connection as much as possible when doing the transfor-
mation.
The first step is roll and fold; Figure 3·2 shows a 3-fold. Based on the size of
our physical cluster size, software simulation could find out the appropriate folding
number and also can be chosen by the user. Less folding number is recommended for
keeping spatial locality for next steps.
Figure 3·2: Step 1: Fold
The second step is to cut the figure shown (see Figure 3·3). Cutting folded blocks
into smaller blocks evenly. As before number of cut is also calculated by the simulator
by minimizing the overall neighbor distance.
The last step is to pile up those small blocks. The piling follows a Z-fold manner,
because we need to keep nearest neighbor as much as possible, or close enough (see
Figure 3·4).
With this approach, we keep the original logical topology as much as possible. We
can vary the number of cuts and folds which would give us different performances. As
the transformed topology more cube-like, the performance would be better. Mean-
while, the less number of cut the better performance we would achieve reasoning that
39
Figure 3·3: Step 2: Cut
Figure 3·4: Step 3: Z-Fold
more cut would stretch the cutting edge nodes’ distance. The example we take is 16
by 16 2D mesh mapping onto 3D 8 by 8 by 8 physical torus. The number of fold
varies from 2 to 4 and the number of cuts varies from 1 to 3.
Greedy embedding algorithm
As we know the application communication pattern, we can map nodes to physical
topology base on nodes’ communication intensity but it would be an NP-complete
problem. So we start a greedy algorithm that is inspired by the work Hoefler (Hoefler
and Snir, 2011). As we already know the communication intensity, greedily mapping
40
the most intensely communicated node as near as possible.
We start with generating a communicating intensity table which shows the time
of each two of the nodes communication happens(Figure 3·5). The table is adjacency
matrix and the number is how many times the two nodes would communicate. The
application could have hybrid communication patterns, we combine those together to
achieve the overall optimized performance.
Figure 3·5: communication intensity matrix and physical topology
matrix
Considering the communication intensity, we could greedy map the logical topol-
ogy. Starting with some vertex in H, choose one of the heaviest vertexes in G and
map it to the available closest neighbor. And the process happens recursively. The
greedy algorithm is a general solution to topology mapping. The detailed algorithm
is shown below (Figure 3·6).
41
Figure 3·6: Generic Topology Mapping Strategies from Torsten Hoe-
fler
42
3.4 Topology Emulation Performance
3.4.1 Fold and Cut Performance
As the previous section, we have proposed the fold-and-cut algorithm to map the
user’s 2D mesh topology onto 3D physical torus cluster. Using our 8 by 8 by 8
FCC simulator to evaluate the performance of different combinations of 2D cut and
fold with the coordinate automatically transformed. We implemented all-to-all, and
square nearest neighbor as our communication patterns, reasoning that these two
patterns typically reflect different combinations that could have a significant effect on
the overall performance. The performance is tested in three router designs: Baseline,
Wormhole router, and Virtual Cut Through.
In terms of all-to-all communication pattern, the performance is shown in Fig-
ure 3·7 and 3·8.
The evaluation of latency and throughput performance shows that three router
design have a similar result. So the architecture of router design does not significantly
impact the performance of cut and fold. For batch latency, 4-fold and 1-cut give us
the lowest latency across all design, and 3-fold and 1-cut perform better in Design 1
and 3, with 4-fold and 1-cut the best in Design 2.
As the performance of the square nearest neighbor, the performance is shown in
Figure 3·9 and 3·10. 2-fold and 1-cut always give us the lowest latency with a similar
trend in all three router designs.
In all-to-all patterns, each node is sending the packet to every other node in the
cluster, so the spatial locality is less important. The more cube-like which means
logical nodes are more intense and closer to each other the better performance we
would get. With more intense the nodes are, each node could make use of extra Z link
which does not exist in 2D logical topology, to send the packet in a shorter path. As
a result, this is the reason why more folds and cuts boost performance. Nevertheless,
43
Design 1
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
6000
8000
10000
12000
14000
Design 2
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
5000
6000
7000
8000
2D SIZE=16*16   ALL TO ALL   BATCH LATENCY
Design 3
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
5000
6000
7000
8000
9000
DOR-FF DOR-OF DOR-MIX ROMM-FF ROMM-OF ROMM-MIX CCAR-FF
CCAR-OF CCAR-MIX O1TURN-FF O1TURN-OF O1TURN-MIX RLB-FF RLB-OF
RLB-MIX
Figure 3·7: 2D to 3D Mapping Performance: All-to-All Batch Latency
there is a limitation of cut and fold, more cut or fold could make the cube to flat or
can not fit into our 8 by 8 by 8 FCC cluster.
In the square nearest neighbor pattern, we can tell that the less cut and fold gives
us the best performance reasoning that with less operation on the logical 2D topology,
the original topology characteristic is kept. Square neatest neighbor depends more
44
Design 1
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0.3
0.35
0.4
0.45
Design 2
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0.38
0.4
0.42
0.44
0.46
0.48
2D SIZE=16*16   ALL TO ALL   THROUGHPUT
Design 3
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0.35
0.4
0.45
DOR-FF DOR-OF DOR-MIX ROMM-FF ROMM-OF ROMM-MIX CCAR-FF
CCAR-OF CCAR-MIX O1TURN-FF O1TURN-OF O1TURN-MIX RLB-FF RLB-OF
RLB-MIX
Figure 3·8: 2D to 3D Mapping Performance: All-to-All Throughput
on spatial locality.
Fold and Cut Link Usage Imbalance
With embedding 2D logical topology into 3D physical torus topology, increasing di-
mension provides two extra links in Z dimension. With these extra Z links, more
45
Design 1
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
120
140
160
180
200
220
Design 2
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
120
140
160
180
200
2D SIZE=16*16   SQUARE NEAREST NEIGHBOR   BATCH LATENCY
Design 3
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
120
140
160
180
200
DOR-FF DOR-OF DOR-MIX ROMM-FF ROMM-OF ROMM-MIX CCAR-FF
CCAR-OF CCAR-MIX O1TURN-FF O1TURN-OF O1TURN-MIX RLB-FF RLB-OF
RLB-MIX
Figure 3·9: 2D to 3D Mapping Performance: Square Nearest Neighbor
Batch Latency
flexibility of packet routing and bandwidth the performance performs better. While
the nodes and communication are not uniformly distributed which leads to link usage
imbalance.
Figure 3·11 depicts our measure link usage status. For each pattern and each fold-
46
Design 1
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0
0.2
0.4
0.6
Design 2
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0.1
0.2
0.3
2D SIZE=16*16   SQUARE NEAREST NEIGHBOR   THROUGHPUT
Design 3
f=2/c=1 f=3/c=1 f=4/c=1 f=2/c=2 f=2/c=3
0.1
0.2
0.3
0.4
DOR-FF DOR-OF DOR-MIX ROMM-FF ROMM-OF ROMM-MIX CCAR-FF
CCAR-OF CCAR-MIX O1TURN-FF O1TURN-OF O1TURN-MIX RLB-FF RLB-OF
RLB-MIX
Figure 3·10: 2D to 3D Mapping Performance: Square Nearest Neigh-
bor Throughput
and-cut configuration, we collect the average link usage status across all the links that
are used at least once during the routing process. From the figure, we find that for
patterns like all-to-all and nearest neighbor, the link usage is imbalanced. The figure
shows why we choose all to all and nearest neighbor as our communication pattern
47
example. These two patterns not only show both overall and local communication
case but also has the most imbalanced issue.
Figure 3·11: Link usage under different fold and cut configurations
3.4.2 Comparison Between Graph Embedding and Greedy Algorithm
For these two mapping mechanisms, we pick two typical patterns, all to all and
square nearest neighbor on 2D topology. All to all reflects the overall communication
situation because each node sent packets to every other node. Base on our fold-
and-cut mechanism, all to all patterns can make use of the extra Z link that does
not exist in 2D topology. While in the square nearest neighbor pattern, each node
sends packets to its nearest 8 surrounding nodes in the 2D square. This would be
the weak part of mapping mechanism. The cutting side extends the distance between
two neighbor nodes.
In greedy algorithm, nodes are mapped based on the distance of nodes greedily.
Regardless of original 2D topology, it only keeps the most expensive link or most
intensive communicated nodes as close as possible. So greedy algorithm fits better
than fold-and-cut in specific application that specific nodes communicate intensively
or in physical topology weight of links are not the same.
48
Figure3·12 shows the latency of 2D topology all to all and square nearest neighbor
pattern. The latency of graph embedding fold-and-cut and greedy algorithm is shown
in the figure. From our observation, in both pattern graph embedding has less latency
than the greedy algorithm, and as the packet size grows, the difference increases. In
square nearest neighbor pattern, latency can be worse than all to all patterns. Because
All to all is an overall pattern that packets are taking advantage of the extra Z link
which does not exist in 2D topology. While square nearest neighbor pattern sends
packets to the neighbors and fold-and-cut break the original topology and extend the
distance of cutting side nodes. So we assume that the difference would be less in the
square nearest neighbor pattern. The possible reason is that the greedy algorithm
does not take 2D topology into account, only consider mapping the most expensive
nodes as close as possible. These patterns depend on nodes’ spatial locality in 2D
topology, graph embedding could have better performance than the greedy algorithm.
Figure 3·13 shows the worst-case comparison of two mapping mechanism. As we
can see, the worst-case follows the same trend of latency comparison between two
mechanisms. This also verifies the conclusion we observed that the greedy algorithm
does not take topology into account, regular spatial patterns perform worse in greedy
algorithm than graph embedding.
The greedy algorithm would have better performance when running on specific
physical clusters or specific applications. The weight of links are not the same, it may
take a longer time to traverse some of the links. Meanwhile, in some specific appli-
cation, communications are not regular, some nodes communicate more intensively
than others. The greedy algorithm is also suitable for clusters running multiple ap-
plications, mapping application and nodes using greedy algorithm would have better
performance.
49
Figure 3·12: 16x16 2D mapping to 8x8x8 3D, All to All and
Square Nearest Neighbor latency comparison(X-axis: Packet size(flit
size=16B), Y-axis: Latency)
Figure 3·13: 16x16 2D mapping to 8x8x8 3D, All to All and
Square Nearest Neighbor worst case comparison(X-axis: Packet size(flit
size=16B), Y-axis: Latency)
3.4.3 Resource and Bandwidth Imbalance
With different mapping mechanisms, resource and link usage would be effected. Each
nodes’ buffer usage and link usage reflect the resource and bandwidth imbalance
condition. As Table 3.2 and Table 3.3 indicates the max load, min load, max load
node position, min load node position under specific communication pattern and
mapping mechanism. From the table, we observe that max load, min load, and the
corresponding location varies a lot. In 2D All to All pattern, the result reflects the
same result trend as before, the max load of greedy algorithm in node (0,0,0) is 36
percent larger than fold and cut mechanism in node(0,7,0). Min Load of Fold and
50
Cut mechanism in node(0,0,0) is much larger than greedy algorithm in node(4,6,0).
Therefore, fold and cut mechanism average the workload of each node, variance of
woke load in greedy algorithm is much larger with different hot spot positions in the
network. With regard to 2D square nearest neighbor pattern, fold and cut’s max load
is 250 percent less than greedy algorithm. Min load in greedy is lower than fold and
cut as well. Follow the same rule as all to all pattern.
Table 3.2: Max Load Table using different mapping mechanism
Pattern Name Mapping Mechanism Packet Size Max Load Max Position
2D All to All Fold and Cut 1*8 2560 (0,0,0)
2D All to All Greedy 1*8 4064 (0,7,0)
2D Square NN Fold and Cut 4*8 192 (0,0,1)
2D Square NN Greedy 4*8 672 (0,7,0)
Table 3.3: Min Load Table using different mapping mechanism
Pattern Name Mapping Mechanism Packet Size Min Load Min Position
2D All to All Fold and Cut 1*8 1536 (0,0,0)
2D All to All Greedy 1*8 8 (4,6,0)
2D Square NN Fold and Cut 4*8 96 (0,0,0)
2D Square NN Greedy 4*8 32 (5,2,0)
Resource Usage Imbalance
In order to reflect the resource usage imbalance in the time dimension, we choose
the max load node result from the previous table. Fold and cut in node(0,0,0) and
greedy in node(0,7,0) in all to all pattern are chosen. As Figure 3·14 shows, because
of the different mapping ways, each node’s usage would vary a lot. In fold and cut,
each direction are almost following the same trend as time goes, while greedy one
each direction fluctuate irregularly. Fold and cut buffer usage is less than greedy one
which demonstrates the same result from the previous section. Between 0 to 2000
cycle, node (0,0,0) all to all pattern uses more buffer space in xneg and xpos than
other directions. We would use this pattern to predefine different direction’s buffer
51
size in order to save chip resources. In node(0,7,0) all to all pattern, after around
3500 cycles, zneg directions barely use input buffer. Buffer resources allocated for
zneg is freed to save chip area. In square nearest neighbor pattern, we can follow the
same rule as all to all to optimize our router.
Comparing overall buffer resource usage, fold and cut use much less buffer resource
than greedy one. And the variance in time dimension is less in fold and cut.
Figure 3·14: Buffer Usage All to All node wave(X-axis: Cycle, Y-axis:
Buffer Usage)
Bandwidth Usage Imbalance
In order to reflect bandwidth, we use link usage to reflect the busy degree of each out-
put link. With the result from previous section, we choose all to all and square near-
est neighbors to reflect the overall and local communication intensity of the network.
From Figure 3·16 and Figure 3·17 we discovered that within all to all pattern, Two
mechanism has similar overall link usage, but different imbalance situation among 6
directions. However, in the square nearest neighbor pattern, fold and cut apparently
52
Figure 3·15: Buffer Usage Square NN node wave(X-axis: Cycle, Y-
axis: Buffer Usage)
have less link usage both in six directions than greedy one which also means fold
and cut has better mapping efficiency. Moreover, Fold and Cut has much less usage
imbalance difference.
Figure 3·16: Overall Link Usage All to All(X-axis: Direction, Y-axis:
Link Usage)
Figure 3·18 shows all to all pattern running on node(0,0,0) with fold and cut.
And on node(0,7,0) with greedy mapping link usage. The reason we choose these two
nodes is that, from the previous result, these two nodes have the most intense link
53
Figure 3·17: Overall Link Usage Square NN(X-axis: Direction, Y-
axis: Link Usage)
usage which also means different mapping behavior would affect the network intense
hot spot in the network. From the fold and cut one we can tell that after 200 unit
time, the xneg link usage drop significantly, near to zero. And after 250 unit time,
different direction link usage is more imbalanced which leads to xpos would be more
congested. While in greedy one, six directions are all following the same trend of link
usage and more balanced than fold and cut one.
Figure 3·18: Link Usage All to All node wave(X-axis: Cycle, Y-axis:
Link Usage)
54
Figure 3·19 dipicts link usage running square nearest neighbor pattern. Node(0,0,0)
shows quite imbalanced link usage in six direction. Between 0 to 15 time unit, xneg
and yneg are more intense than other directions. In greedy one, node(0,7,0) shows
six directions are following the same trend with a more balanced intensity situation.
Figure 3·19: Link Usage Square NN node wave(X-axis: Cycle, Y-axis:
Link Usage)
Chapter 4
Dynamic Hardware Design
With regard to different mapping behavior, resource and link usage vary from node
position and time. Our hypothesis is that we could change hardware configuration
both in different nodes and a different time. As examples we consider different pa-
rameters of buffer size, packet priority in switch arbitration, and switch design and
see how they affect performance for different mapping mechanisms and communica-
tion patterns. We did the hardware router design both in hard and soft configuration
methods. Soft methods can work at a fine-grained time scale, but requires hardware
costs which can be eliminated with hard configurations.
4.1 Virtual Cut-Through Router with Dynamic Buffers
In order to dynamically allocate resources in the router, dealing with load imbalance
issues, we can utilize the prediction result given by the previous section’s algorithm.
Furthermore, when the basic input buffer unit reaches its threshold, assign more re-
source to that direction. In this work, we design a dynamic input buffer architecture
that can able to allocate resource base on current traffic situation. (See figure 4·1)
Virtual Cut Through the communication switching technique is used in our architec-
ture.
55
56
Figure 4·1: Dynamic Router Architecture
Router Design
Prior to the discussion of adaptive allocation buffer router design, we start with
conventional router. In this work, we choose Virtual Cut-Through(VCT) as our router
design which is a widely used switching technique. Each input port has its own input
buffer to hold the blocked packets which store packets at an intermediate node when
the next downstream required channel is not available. As long as downstream node
has enough buffer space, upstream can forward the packet without waiting for the
entire packet to arrive. Hence, the input buffer plays a key role in the routing process.
Routing compute (RC), input buffer and switch allocation are the main component
of VCT router. RC unit determines packets forwarding base on the routing algorithm.
Input buffer is connected to RC which buffers the incoming packets. Switch and switch
allocator is connected after input buffer using the round-robin arbitration scheme.
For this design, we are using Dimensional Order Routing (DOR), also referred
to as XYZ routing, routing algorithm. DOR fits our current design, reasoning that
the routing route is predetermined which buffer allocation prediction algorithm could
57
make use of that.
This type of architecture is quite efficient dealing with load balance workload
when the traffic rate is nearly the same in different channels. However, different
application communication pattern is not evenly distributed. As the previous section,
we can tell that workload could be in quite an imbalance situation, so dynamically
allocate resources to congested channel is a way to mitigate imbalance and boost
communication performance. For instance, when our prediction control logic detects
the resource bottleneck or input buffer threshold has been reached, more buffer size
is assigned to that channel. After a period of time, congested channel usage drops
down, allocated resources will be released back for sake of further other channel’s
congest situation.
Therefore, dynamically resource allocation architecture is proposed in the next
section dealing with load imbalance communication patterns for different applications.
Adaptive Buffer Router Architecture
Figure 4·2: Dynamic Router Architecture
58
In fixed VCT router architecture, it could not deal with load imbalance issues.
However, the buffer size is limited, maximally make use of buffer resource is crucial
to squeeze resource boosting performance.
In this work, the router is a fully flexible router architecture that can allocate an
available buffer block from shared buffer pool to congested channels that needs more
buffer space.(See figure 4·2) Dynamic VCT design can be treated as soft configuration,
using control logic, dynamically allocate resources to congested direction based on
their needs in runtime.
For this purpose, the shared buffer pool is created for each channel’s input buffer.
When the channel’s input buffer needs more space, a request will be sent to the buffer
control module. After the control module’s allocation, an available shared buffer block
will be assigned to the needed channel.
Figure 4·3: Basic Shared Buffer Block
At the starting point, each input buffer has been assigned a basic input buffer
unit which size is able to deal with lightweight workload. When one of the directions
59
Figure 4·4: Shared Buffer Pool
happens to be congested, it requests more buffer from the shared buffer pool.
The shared buffer pool is consists of a basic shared buffer block which is composed
of a fixed number of BRAM.(See figure 4·3) The reason for using a fixed number of
several BRAM is that it is easier to control and also makes the design simpler. For
each basic shared buffer block, a 6 in 1 out MUX is controlled by buffer control, which
defines which input direction is utilizing the basic block.(See figure 4·4) Buffer control
contains tables for each input channel to indicate which basic buffer block has been
assigned to each channel. Load imbalance situation changes over time, so after the
congested channel finished sending blocked packets, its requested buffer blocks will
be released back to shared buffer pool for further congestion situations.
We are using credit-based flow control. The payload inside credit flit is the avail-
able basic shared buffer block left. Six input channels are sharing the same shard
buffer pool, so a threshold is set in the case when all six channels’ packets are con-
tending the last basic buffer block in shared buffer pool. Credit flit is generated
every several cycles sending to downstream nodes. Whenever a packet is received in
the current node, the credit control module subtracts 1 from the credit value. The
60
back pressure should be generated when the threshold has been reached, and send
to six downstream node at the same time. When downstream node received the
backpressure credit flit, the switch arbitration for that output and injection will be
paused.
Switch connected to input buffer uses the same switch arbitration policy as VCT.
Switch send the signal to each input direction’s input buffer, selecting the desired
packets in peek flits each holding a copy of packets with different output direction.
In conclusion with this mechanism, we are able to dynamically allocate input
buffer resources based on both predicted traffic situations and the current node’s
input buffer.
4.2 Dynamic Buffer Design Performance Evaluation
When evaluating the performance of dynamic VCT and VCT, we divide into two
parts. Performance of running regular pattern and irregular pattern. We know that
dynamic VCT can dynamically allocate buffer resources for different link directions
base on the run time link usage of the node. While in VCT each direction’s buffer size
and resources are fixed. So in an irregular pattern, dynamic VCT would have better
performance than VCT. The irregular pattern happens when 2D topology mapping
onto 3D physical clusters or some specific applications. The regular pattern is the
one we show in previous section 3.1.
From the result we generated, we can tell that there are still improvements in
dynamic VCT which need to be done in the future. In the regular pattern, dynamic
VCT’s latency is larger than VCT, the worst case in dynamic VCT is larger as well.
Head of line blocking issue is still the bottleneck that constrains the dynamic version’s
performance. So dynamic VCT still has improvement space. However, dynamic VCT
has a better ability to deal with an irregular pattern. Even though the worst case is
61
larger in dynamic VCT, latency in dynamic VCT is less than VCT. As a result, if
we can improve dynamic VCT architecture and performance, it could achieve much
better performance than VCT even in regular patterns.
4.2.1 Regular Patterns
In this section, we take all to all and cube the nearest neighbor as our comparison
pattern. These two patterns typically show the overall communication and spatial
communication pattern.
Firstly, in regular all to all pattern, from Figure 4·5 we can tell that dynamic
VCT latency is larger than VCT. In packet workload 12, VCT is 15 percent faster
than dynamic buffer. And worst case is 26 percent better than dynamic version.
Dynamic VCT is only the initial version, so there is large improvement space in
dynamic version.
Figure 4·5: Dynamic VCT and VCT comparison running 3D all to
all pattern(X-axis: Packet Size(flit=16B), Y-axis: Latency)
The same trend happen is cube nearest neighbor pattern. Dynamic VCT latency
is worse than VCT. While in the worst case, in packet workload 24 VCT is much
worse than the dynamic version.
In regular pattern 3D all to all and 3D cube nearest neighbor, dynamic VCT
latency is worse than VCT. One of the reason is that the packets in basic block
buffer need to be sent out and the block is empty. After this, produce block move to
62
the next basic block. It is possible that in the current produce block, there are no
desire output packets in the block but exist in the next or following blocks. Head of
block happens in this situation. This is the main reason that limits performance. So
the next improvement of the dynamic VCT is that the switch is able to search the
following blocks to get the desired packets. I believe that if head of block issue is
solved, performance can be better than VCT design.
Figure 4·6: Dynamic VCT and VCT comparison running 3D cube
nearest neighbor pattern(X-axis: Packet Size(flit=16B), Y-axis: La-
tency)
4.2.2 Irregular Patterns
In this section, we analyze the performance of dynamic VCT and VCT running irreg-
ular patters. Irregular patterns happen in a specific application or 2D to 3D topology
embedding. Here we take 2D to 3D graph embedding mechanism and using all to all
and square nearest neighbor patterns to test the performance of dynamic VCT and
VCT.
Figure 4·7 shows the latency comparison between dynamic VCT and VCT perfor-
mance running 2D all to all pattern. We can see that in the worst case, dynamic VCT
is much worse than VCT because there is a large improvement space in dynamic VCT
design. However, the latency in dynamic VCT latency is 15 percent better than VCT.
As a result, the dynamic buffer has a better ability to deal with irregular patterns
63
and load imbalance applications.
Figure 4·7: Dynamic VCT and VCT comparison running 2D all to
all pattern(X-axis: Packet Size(flit=16B), Y-axis: Latency)
Figure 4·8 shows the performance in running 2D square nearest neighbor pattern.
Dynamic VCT and VCT both perform better than VCT. 2D square nearest neigh-
bor is a more imbalanced pattern than all to all. So the advantage of dynamically
allocating resources shows up. When specific link direction is more congested, more
resources like buffer space is allocated. In conclusion, dynamic VCT performs better
than VCT in irregular patterns.
Figure 4·8: Dynamic VCT and VCT comparison running 2D square
nearest neighbor pattern(X-axis: Packet Size(flit=16B), Y-axis: La-
tency)
64
4.2.3 Resource Utilization Comparison
With support of dynamic buffer, dynamic VCT saves resource utilization by allocating
more resources to the congestion direction. However, VCT only support fixed size for
each 6 direction. Figure4·9 depicts the dynamic buffer usage of node(0,0,0) running all
to al pattern with no mapping. At around cycle 4000, the total block number required
by six directions is 20 which is the size of shared buffer pool. Within this combination
of parameters, assign 20 blocks of block memory is enough for node(0,0,0), However,
in VCT design, all six directions’ input buffer are fixed, so the buffer size should
astisfy the highest demand direction. At around cycle 4000, the highest demanded
direction is yneg with 10 blocks of memory. As a reuslt, 6(directions) x 10, 60 blocks
of memory are required. Dynamic VCT saves large amount of resources are saved for
chip area.
Figure 4·9: Dynamic Buffer Usage All to All(with no mapping)
node wave(X-axis: Cycle, Y-axis: Buffer Block Number(Block mem-
ory size=256B))
65
4.2.4 Impact of mapping on hardware usage
As previous result, we know that different mapping strategies could have different
resource demand in different node. Figure 4·10 and Figure 4·11 depicts buffer block
needs in six direction. In all to all patter, Figure 4·10 shows within the same pattern,
different nodes has their own demand for buffer size. Node(0,0,0) maximum requires
20 blocks of ram, while in node(0,2,1) the number would be 16. In Node (0,0,0), yneg
and ypos requires more buffer space than others and other 4 directions roughly just
uses one of the basic block. So congestion will happen yneg and xneg input port in
switch arbitration.
Figure 4·10: Dynamic Buffer Usage All to All node wave(X-axis:
Cycle, Y-axis: Buffer Block Number(Block memory size=256B))
In square nearest neighbor pattern, fold and cut fallow the same pattern in all to
all. However, in the greedy one, all direction is only using one buffer block.
As a result, we observed that, with previous figures we can generate a log for
each node containing buffer usage. Each node requires different buffer size for all six
direction. Configure each node and assign resource base one their need could save
chip area for compute part and increase resources utilization.
66
Figure 4·11: Dynamic Buffer Usage Square NN node wave(X-axis:
Cycle, Y-axis: Buffer Block Number)
Figure 4·12: Buffer Usage All to All node wave(X-axis: Cycle, Y-axis:
Buffer Usage)
4.3 Switch Design Based on Imbalance Analysis
Dynamic Priority
From the previous section, we know that mapping behavior affects different nodes’
resources and workload. Figure 4·12 indicates in the time dimension, nodes have
different request for buffer resource and reflects the request from the input side of
the switch. Node(0,0,0) with fold and cut mapping mechanism’s buffer usage. Before
2000 cycle, xpos and yneg directions has much more buffer usage than the other 4
directions. More packets are stored in xpos and yneg, they need higher priority to
arbitrate for output. However, usages situation changes in time dimension. After
2000 cycles, buffer usage tends to be more even, priority would change as well.
Each node could generate a log based on the request for buffer resource at different
67
time which also indicates the request of switch arbitration. Dynamically adjust each
direction’s arbitration priority based on buffer usage of each direction. Therefore, we
could assign more priority in the direction with more packets inside the buffer. The
way we assign the priority is based on the percentage of current direction buffer usage
of overall usage. Not only, the congested direction will have more chances to send
out packets but also make sure the less congested one could have the chance to send
out the placket. With control logic determine how to dynamically assign priority for
each direction, it can be achieved in runtime with soft configuration.
Dynamic Switch
The dynamic switch could be a future work which needs to consider in the future.
With previous work, crossbar and reduction tree-based switch are two of the switches
that have been proposed in router design. Crossbar is a commonly used switch in
NOC. But it has poor scalability. When large amount of input and output ports
are required like wormhole router which has several virtual channel in each input
direction, the switch would be so complex that arbitration logic can not be finished
one cycle. And the switch would take too much utilization resources of the chip.So
reduction tree switch is introduced in order to accomplish scalability and easy to meet
timing. While the overhead would be 2-5 cycles because of the longer pipeline stages.
Also, it would consume a lot of chip area. According to the previous switch, we
proposed a hybrid switch design based on each node’s circumstance. Hybrid switch
design requires hard partial configuration for each node which could takes longer time
but saves much resources utilization. More work in the future would be investigate
whether hardware resources required by soft methods would be eliminated with hard
configuration.
Imbalance situation would be quite extreme in different nodes, we take node 001 in
2D square nearest neighbor with fold and cut as an example. The figure(Figure 4·14)
68
Figure 4·13: Crossbar and Reduction tree besed switch
shows the buffer usage in each direction. xneg and yneg are two of the direction with
heavy buffer usage connecting to the input of switch. We only apply a 2(input) X
6(output) crossbar doing switch allocation on highly demanding directions. In terms
of less utilized direction, we can create a FIFO gathering other low requirement
directions’ packets connecting with the output port. The Control unit could schedule
a proper arbitration between FIFO and crossbar output(See Figure 4·15).
Figure 4·14: Dynamic Buffer Usage fold and cut node wave(X-axis:
Cycle, Y-axis: Buffer Block Number)
69
Figure 4·15: Reconfigure crossbar based on resource usage
4.4 Summary
From the previous section, we observed that different mapping strategies could have
varies load and communication imbalance both locally and overall. This imbalanced
situation affects the corresponding hardware configure. With both hard and soft
configurations, we configure the router according to their need.
Firstly, talking about resource allocation, we proposed a dynamic buffer archi-
tecture for overall router architecture. This would be soft method that control logic
allocate resources in runtime. Dealing with load imbalance issues, 6 directions have
varied demand for buffer resources. With the support of dynamic buffer, each direc-
tion could request buffer size on their demand instead of having idle buffer size which
could be a waste of resources and power. From our result, dynamic buffer design has
better ability to deal with the imbalanced communication network.
Meanwhile, we also analyze network resources usage in the local scope. Nodes
are having various workload, leading to different hardware configure on each node.
From the result we generated, each node can be configured with buffer size matching
their heaviest workload. Therefore, chip area like buffer resources can be saved for
computing area.
The switch is another part we can configure to accommodate the imbalance situa-
70
tion. Firstly, based on the buffer usage wave for each node in the time dimension, we
can modify the priority of different input directions at different time cycles. Direc-
tion with more packets would have higher priority than others and a higher chance
to get the output port control. In this manner, the imbalanced issue can be miti-
gated using soft method. Secondly, we can reconfigure the switch architecture based
on each node’s buffer usage. If some direction has much fewer packets than others,
we can configure directions with high demand with crossbar design and less demand
direction with a FIFO combining the packets. With the control unit decide priority
on these two parts’ output control. Hard partial configuration in this hybrid switch
design would save chip resources utilization with higher router performance. Future
work would be the hardware resources saved by these method and whether resources
used by soft one can be eliminated by hard confuguration.
71
Chapter 5
Conclusions and Future Work
This chapter provides the conclusion and summary of our work, followed by proposed
future work.
5.1 Conclusion
In this thesis, we explore mapping of applications onto FCCs through mapping of
logical onto physical topologies and configuring a corresponding hardware reconfigu-
ration solution. With regard to different applications and logical topologies, we find
that new mapping mechanisms could have a significant impact on the efficiency of
communication. Also, we find that mapping strategies could generate different and
irregular communication patterns, i.e., different from applications where logical and
physical topologies match. This leads to load and communication imbalances which
would also affect the choice of hardware configuration. In order to deal with the im-
balance issue, we proposed several ways to reconfigure hardware using soft and hard
methods based on the imbalance situation.
First, we propose two ways of mapping given application and topology onto a
FCC physical cluster, Fold-and-Cut and Greedy mapping. 2D mesh mapping onto
3D torus cluster is the example topology mapping we choose as our test case. All
to all and square nearest neighbors are the communication pattern we choose to
reflect the application communication pattern. Fold and cut mechanism maintains
the original 2D topology as much as possible. Cutting sides limits the communication
72
performance but can be optimized. For greedy mapping, the network is mapping the
most intensely communicated nodes as close as possible. It also avoids “heavy” links
that take more time to traverse. Comparing the two methods for 2D logical topology
mapping onto 3D torus, fold-and-cut has better performance than greedy, most likely
because fold-and-cut takes the original topology into account. Another result is that
different mapping mechanisms and communication patterns have a different impact
on load and resource and link imbalance. Also, resources have evolving resource
demands that vary throughout the communication phase.
Second, based on the previous result, we find that mapping behavior could have
different impacts on resource demand and communication imbalance. These imbal-
anced situation affect the corresponding hardware configuration. With both hard
and soft configurations, we configure the router according to their need. We explore
a dynamic buffer design that mitigates the different buffer requirements triggered by
communication imbalance issues; this uses soft reconfiguration applied at runtime.
On the other hand, in the local scope, nodes have various workloads, leading to dif-
ferent hardware configurations on each node. From the result we generated, each
node can be configured with buffer size matching their heaviest workload. Therefore,
chip area, e.g. for buffer resources, can be saved for computing area.
5.2 Future Work
Our design provides a model of mapping applications onto FCC clusters by extracting
the given application’s communication pattern and implementing the optimized map-
ping mechanism. After analyzing resources and link usage of the network, we apply
the corresponding hardware configuration. The future work includes exploring more
mapping mechanisms targeting more communicating patterns and logical topologies
as well as adding support for different physical clusters with various topologies. For
73
more fine-grained mapping, hardware partial configuration can be done on run time.
And there remains the question about the hardware costs required to implement soft
methods and whether these can be eliminated with hard configuration. There are
also more ways to explore hardware configuration in future work.
For example, the switch is another part we can work on. The priority of 6 input
directions at different time cycle stages could have different priorities. Direction with
more packets would have higher priority than others and a higher chance to get the
output port control. Also, switch architecture can be reconfigured based on each
node’s buffer resource requirement. In hybrid switch, hardware partial configuration
method save chip resources but with longer partial configuration time. A combination
of the crossbar and FIFO design can save the resource and improve efficiency.
References
Aleliunas, R. and Rosenberg, A. (1982). On embedding rectangular grids in square
grids. IEEE Transactions on Computers, C-31(9):907–913.
Altera (2014). Cyclone V Device Handbook vol. 2: Transceivers.
Altera (2017). Intel FPGA SDK for OpenCL: Programming Guide.
Annexstein, F., Baumslag, M., Herbordt, M., Obrenic, B., Rosenberg, A., and Weems,
C. (1990). Achieving Multigauge Behavior in Bit-Serial SIMD Architectures Via
Emulation. In 3rd Symposium on the Frontiers of Massively Parallel Computation.
doi: 10.1109/ FMPC.1990.89459.
Arden, W. M. (2002). The international technology roadmap for semiconductors:
perspectives and challenges for the next 15 years. Current Opinion in Solid State
and Materials Science, 6(5):371–377. doi: 10.1016/S1359-0286(02)00116-X.
Bailey, D. H., Broadhurst, D., Hida, Y., Xiaoye S. Li, and Thompson, B. (2002). High
performance computing meets experimental mathematics. In SC ’02: Proceedings
of the 2002 ACM/IEEE Conference on Supercomputing. doi: 10.1109/SC.2002.
10060.
Barrett, R., Borkar, S., Dosanjh, S., S.D. Hammond, M.A. Heroux, X. H., Luitjens,
J., Parker, S., Shalf, J., and Tang, L. (2013). On the role of co-design in high
performance computing. In Advances in Parallel Computing Vol 24. Transition of
HPC Towards Exascale Computing, pages 141–155.
Boku, T., Kobayashi, R., Fujita, N., Amano, H., Sano, K., Hanawa, T., and Yam-
aguchi, Y. (2019). Cygnus: GPU meets FPGA for HPC. In Proceedings of the
IEEE International Conference on Supercomputing. https: //www.r-ccs.riken.jp/
labs/lpnctrt/assets/img/ lspanc2020jan boku light.pdf.
Chiu, M. and Herbordt, M. (2009). Efficient filtering for molecular dynamics sim-
ulations. In 2009 International Conference on Field Programmable Logic and
Applications. doi: 10.1109/ FPL15426.2009.
Chiu, M. and Herbordt, M. (2010a). Molecular dynamics simulations on high per-
formance reconfigurable computing systems. ACM Transactions on Reconfigurable
Technology and Systems, 3(4):1–37. doi: 10.1145/1862648.1862653.
74
75
Chiu, M. and Herbordt, M. (2010b). Towards production FPGA-accelerated molec-
ular dynamics: Progress and challenges. In 2010 4th High Performance Reconfig-
urable Technology and Applications. doi: 10.1109/HPRCTA.2010.5670800.
Chiu, M., Herbordt, M., and Langhammer, M. (2008). Performance potential of
molecular dynamics simulations on high performance reconfigurable computing sys-
tems. In 2008 Second International Workshop on High-Performance Reconfigurable
Computing Technology and Applications. doi: 10.1109/ HPRCTA.2008.4745685.
Chiu, M., Khan, M., and Herbordt, M. (2011). Efficient calculation of pairwise
nonbonded forces. In 2011 IEEE 19th Annual International Symposium on Field-
Programmable Custom Computing Machines. doi: 10.1109/ FCCM.2011.34.
Chung, I.-H., Lee, C.-R., Zhou, J., and Chung, Y.-C. (2011). Hierarchical mapping
for hpc applications. Parallel processing letters, 21(03):279–299. doi: 10.1142/
S0129626411000229.
Conti, A., VanCourt, T., and Herbordt, M. (2004). Processing Repetitive Structures
with Mismatches at Streaming Rate. In Field Programmable Logic and Applica-
tion. FPL 2004. Lecture Notes in Computer Science, vol 3203. Springer, Berlin,
Heidelberg. doi: 10.1007/978-3-540-30117-2 131.
Dally, W. and Seitz, C. (1987). Deadlock-free message routing in multiprocessor
interconnection networks. IEEE Transactions on Computers, 36(5).
Dally, W. and Towles, B. (2004). Principles and Practices of Interconnection Net-
works. Elsevier.
DOE ASCR (2017). Crosscut report: Exascale requirements reviews. Technical
report, Department of Energy. doi: 10.2172/1417653.
DOE ASCR (2018). Extreme heterogeneity 2018: Productive computational science
in the era of extreme heterogeneity. Technical report, Department of Energy. doi:
10.2172/1473756.
Expo´sito, R. R., Taboada, G. L., Ramos, S., Tourin˜o, J., and Doallo, R. (2013). Per-
formance analysis of hpc applications in the cloud. Future Generation Computer
Systems, 29(1):218–229. doi: 10.1016/j.future.2012.06.009.
Geng, T., Wang, T., Li, A., Jin, X., and Herbordt, M. (2019a). A Scalable Framework
for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight
and Workload Balancing. In ArXiv Preprint arXiv:1901.01007.
Geng, T., Wang, T., Sanaullah, A., Yang, C., Xuy, R., Patel, R., and Herbordt,
M. (2018a). A framework for acceleration of CNN training on deeply-pipelined
FPGA clusters with work and weight load balancing. . In 2018 28th International
76
Conference on Field Programmable Logic and Applications (FPL 2018): 394–402.
doi: 10.1109/ FPL.2018. 00074.
Geng, T., Wang, T., Sanaullah, A., Yang, C., Xuy, R., Patel, R., and Herbordt,
M. (2018b). FPDeep: Acceleration and Load Balancing of CNN Training on
FPGA Clusters. In 2018 IEEE 26th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), page 81–84. doi: 10.1109/
FCCM.2018. 00021.
Geng, T., Wang, T., Wu, C., Yang, C., Li, A., Song, S., and Herbordt, M. (2019b).
LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism. In 2019 IEEE
30th International Conference on Application-specific Systems, Architectures and
Processors (ASAP), volume 2160, pages 9–16. doi: 10.1109/ASAP.2019.00-43.
Geng, T., Wang, T., Wu, C., Yang, C., Wu, W., Li, A., and Herbordt, M. (2019c).
O3BNN: An Out-Of-Order Architecture for High-Performance Binarized Neural
Network Inference with Fine-Grained Pruning. In ICS ’19: Proceedings of the
ACM International Conference on Supercomputing, volume 2160, page 461–472.
doi: 10.1145/ 3330345. 3330386.
George, A., Herbordt, M., Lam, H., Lawande, A., Sheng, J., and Yang, C. (2016).
Novo-G#: A Community Resource for Exploring Large-Scale Reconfigurable Com-
puting Through Direct and Programmable Interconnects. In 2016 IEEE High Per-
formance Extreme Computing Conference (HPEC), Waltham, MA, pages 1–7. doi:
10.1109/ HPEC.2016.7761639.
Gropp, W., Lusk, E., Doss, N., and Skjellum, A. (1996). A high-performance,
portable implementation of the MPI message passing interface standard. Parallel
Computing, 22:789 – 828. doi: 10.1016/0167-8191(96)00024-5.
Gu, Y. and Herbordt, M. (2007a). Amenability of multigrid computations to FPGA-
based acceleration. Available at https://archive.ll.mit.edu/HPEC/agendas/
proc07/Day2/05_Gu_Pres.pdf.
Gu, Y. and Herbordt, M. (2007b). FPGA-based multigrid computations for molecu-
lar dynamics simulations. In 15th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 117–126. doi: 10.1109/ FCCM.2007.42.
Gu, Y., VanCourt, T., and Herbordt, M. (2006a). Accelerating molecular dynamics
simulations with configurable circuits. IEE Proceedings on Computers and Digital
Technology, 153(3):189–195. doi: 10.1049/ip-cdt:20050182.
Gu, Y., VanCourt, T., and Herbordt, M. (2006b). Improved interpolation and sys-
tem integration for FPGA-based molecular dynamics simulations. In 2006 Inter-
national Conference on Field Programmable Logic and Applications, pages 21–28.
doi: 10.1109/ FPL.2006.311190.
77
Gu, Y., VanCourt, T., and Herbordt, M. (2006c). Integrating FPGA acceleration
into the ProtoMol molecular dynamics code: Preliminary report. In 2006 14th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 315–316. doi: 10.1109/ FCCM.2006.52.
Gu, Y., VanCourt, T., and Herbordt, M. (2008). Explicit design of FPGA-based
coprocessors for short-range force computation in molecular dynamics simulations.
Parallel Computing, 34(4-5):261–271. doi: 10.1016/j.parco.2008.01.007.
Hager, G. and Wellein, G. (2010). Introduction to high performance computing for
scientists and engineers. CRC Press.
Haghi, P., Geng, T., Guo, A., Wang, T., and Herbordt, M. (2020). FP-AMG: FPGA-
Based Acceleration Framework for Algebraic Multigrid Solvers. In 28th IEEE
International Symposium on Field-Programmable Custom Computing Machines.
Hauck, S. and DeHon, A. (2010). Reconfigurable computing: the theory and practice
of FPGA-based computation. Elsevier.
Herbordt, M. (2013). Architecture/algorithm codesign of molecular dynamics pro-
cessors. In 2013 Asilomar Conference on Signals, Systems, and Computers, pages
1442–1446. doi: 10.1109/ ACSSC.2013.6810534.
Herbordt, M. (2019). Advancing OpenCL for FPGAs: Boosting Performance with
Intel FPGA SDK for OpenCL Technology. In The Parallel Universe, pages 17–32.
Herbordt, M., Gu, Y., VanCourt, T., Model, J., Sukhwani, B., and Chiu, M. (2008a).
Computing models for FPGA-based accelerators with case studies in molecular
modeling. Computing in Science and Engineering, 10(6):35–45. doi: 10.1109/
MCSE.2008.143.
Herbordt, M., Khan, M., and Dean, T. (2009). Parallel discrete event simulation of
molecular dynamics through event-based decomposition. In In 2009 20th IEEE
International Conference on Application-specific Systems, Architectures and Pro-
cessors, Boston, MA, pages 129–136. doi: 10.1109/ ASAP.2009.39.
Herbordt, M., Kosie, F., and Model, J. (2008b). An efficient O(1) priority queue
for large FPGA-based discrete event simulations of molecular dynamics. In In
2008 16th International Symposium on Field-Programmable Custom Computing
Machines, pages 248–257. doi: 10.1109/ FCCM.2008.49.
Herbordt, M., Model, J., Sukhwani, B., Gu, Y., and VanCourt, T. (2006). Single pass,
BLAST-like, approximate string matching on FPGAs. In 2006 14th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines, pages 217–226.
doi: 10.1109/ FCCM.2006.64.
78
Herbordt, M., Model, J., Sukhwani, B., Gu, Y., and VanCourt, T. (2007a). Single
pass streaming BLAST on FPGAs. Parallel Computing, 33(10-11):741–756.
Herbordt, M., Olin, K., and Le, H. (1999). Design trade-offs of low-cost multicom-
puter networks. In The 7th Symposium on the Frontiers of Massively Parallel
Computation, pages 25–34. doi: 10.1109/ FMPC.1999.750581.
Herbordt, M. and VanCourt, T. (2005). System and method for programmable logic
acceleration of data processing applications and compiler therefore. United States
Patent Application Publication. Pub. no.: US2007/0277161 A1.
Herbordt, M., VanCourt, T., Gu, Y., Sukhwani, B., Conti, A., Model, J., and DiS-
abello, D. (2007b). Achieving high performance with FPGA-based computing.
IEEE Computer, 40(3):42–49.
Hoefler, T. and Snir, M. (2011). Generic topology mapping strategies for large-scale
parallel architectures. In ICS ’11: Proceedings of the international conference on
Supercomputing, pages 75–84. doi: 10.1145/1995896.1995909.
Humphries, B., Zhang, H., Sheng, J., Landaverde, R., and Herbordt, M. (2014). 3D
FFT on a Single FPGA. In 2014 IEEE 22nd Annual International Symposium on
Field-Programmable Custom Computing Machines. doi: 10.1109/ FCCM.2014.28.
Intel (2019). Intel Stratix 10 L- and H-Tile Transceiver PHY User Guide.
IRDS (2018). International roadmap for devices and systems: Executive summary.
Technical Report 2018 Edition, IEEE.
Khan, M., Chiu, M., and Herbordt, M. (2013). FPGA-Accelerated Molecular Dy-
namics. In Benkrid, K. and Vanderbauwhede, W., editors, High Performance
Computing Using FPGAs, pages 105–135. Springer Verlag. doi: 10.1007/978-1-
4614-1791-0 4.
Khan, M. and Herbordt, M. (2011). Parallel discrete event simulation of molecular
dynamics with speculation and in-order commitment. Journal of Computational
Physics, 230(17):6563–6582. doi: 10.1016/j.jcp.2011.05.001.
Khan, M. and Herbordt, M. (2012). Communication requirements for FPGA-centric
molecular dynamics. In Symposium on Application Accelerators for High Perfor-
mance Computing. https:// www.bu.edu/ caadlab/saahpc12.pdf.
Khan, M. A., Hankendi, C., Coskun, A. K., and Herbordt, M. C. (2011). Software
optimization for performance, energy, and thermal distribution: Initial case studies.
In 2011 International Green Computing Conference and Workshops (IGCC), pages
1–6. doi: 10.1109/ IGCC.2011.6008575.
79
Kung, H. and Stevenson, D. (1977). A software technique for reducing the routing
time on a parallel computer with a fixed interconnection network. In High Speed
Computer and Algorithm Optimization. Academic Press, New York.
Landaverde, R. and Herbordt, M. (2014). GPU Optimizations for a Production
Molecular Docking Code. In 2014 IEEE High Performance Extreme Computing
Conference (HPEC). doi: 10.1109/HPEC.2014.7040981.
Leiserson, C. (1980). Area-Efficient Graph Layouts (for VLSI). In 21st Annual Sym-
posium on Foundations of Computer Science (sfcs 1980). doi: 10.1109/SFCS.1980.
13.
Li, A., Geng, T., Wang, T., Herbordt, M., Song, S., and Barker, K. (2019). BSTC:
A Novel Binarized-Soft-Tensor-Core Design for Accelerating Bit-Based Approxi-
mated Neural Nets. In SC ’19: Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis. doi: 10.1145/
3295500.3356169.
Lindtjorn, O., Clapp, R., Pell, O., Fu, H., Flynn, M., and Mencer, O. (2011). Be-
yond traditional microprocessors for geoscience high-performance computing appli-
cations. IEEE Micro, 31(2):41–49. doi: 10.1109/MM.2011.17.
Liu, Y., Sheng, J., and Herbordt, M. (2016). A Hardware Prototype for In-Brain
Neural Spike-Sorting. In 2016 IEEE High Performance Extreme Computing Con-
ference (HPEC). doi: 10.1109/HPEC.2016.7761590.
Lu, Z., Liu, M., and Jantsch, A. (2007). Layered switching for networks on chip.
In DAC ’07: Proceedings of the 44th annual Design Automation Conference, pages
122–127. doi: 10.1145/1278480.1278511.
Ma, E. and Tao, L. (1988). Embeddings among toruses and meshes. Technical
Report Technical Report No. MS-CIS-88-63, Department of Computer and Infor-
mation Science, University of Pennsylvania.
Ma, E. and Tao, L. (1993). Embeddings among meshes and tori. Journal of Parallel
and Distributed Computing, 18:44–55. doi: 10.1006/jpdc.1993.1043.
Mahram, A. and Herbordt, M. (2010). Fast and Accurate NCBI BLASTP: Accel-
eration with Multiphase FPGA-Based Prefiltering. In ICS ’10: Proceedings of
the 24th ACM International Conference on Supercomputing, pages 73–82. doi:
10.1145/1810085.1810099.
Mahram, A. and Herbordt, M. (2012a). CAAD BLASTP 2.0: NCBI BLASTP
accelerated with pipelined filters. In 22nd International Conference on Field Pro-
grammable Logic and Applications (FPL), pages 217–223. doi: 10.1109/FPL.2012.
6339184.
80
Mahram, A. and Herbordt, M. (2012b). FMSA: FPGA-Accelerated ClustalW-Based
Multiple Sequence Alignment through Pipelined Prefiltering. In 2012 IEEE 20th
International Symposium on Field-Programmable Custom Computing Machines,
pages 177–183. doi: 10.1109/FCCM.2012.38.
Mahram, A. and Herbordt, M. (2015). NCBI BLASTP on High Performance Recon-
figurable Computing Systems. ACM Transactions on Reconfigurable Technology
and Systems, 15(4):6.1–6.20. doi: 10.1145/2629691.
Meng, J., LLomosi, E., Kaplan, F., Zhang, C., Sheng, J., Herbordt, M., Schirner, G.,
and Coskun, A. (2016). Joint optimization of communication and cooling costs in
hpc data centers. Journal of Parallel and Distributed Computing, 96.
Miyajima, T., Ueno, T., Koshiba, A., Huthmann, J., Sano, K., and Sato, M. (2018).
High-Performance Custom Computing with FPGA Cluster as an Off-loading En-
gine. In Proceedings of the ACM/IEEE International Conference for High Per-
formance Computing, Networking, Storage and Analysis. https://sc19. supercom-
puting.org/ proceedings/tech poster/poster files/ rpost174s2-file3.pdf.
Model, J. and Herbordt, M. (2007). Discrete event simulation of molecular dynamics
with configurable logic. In 2007 International Conference on Field Programmable
Logic and Applications, pages 151–158. doi: 10.1109/FPL.2007.4380640.
Mondigo, A., Ueno, T., Sano, K., and Takizawa, H. (2020). Comparison of Direct and
Indirect Networks for High-Performance FPGA Clusters. In Rincon, F., Barba,
J., So, H., Diniz, P., and Caba, J., editors, ARC 2020. Lecture Notes in Computer
Science, vol 12083. Springer. 10.1007/978-3-030-44534-8 24.
Monien, B. and Sudborough, H. (1990). Embedding one interconnection network
in another. In Tinhofer, G., Mayr, E., Noltemeier, H., and Syslo, M., editors,
Computational Graph Theory. Springer, Vienna. 10.1007/978-3-7091-9076-0 13.
Muhammed, T., Mehmood, R., Albeshri, A., and Alsolami, F. (2020). Hpc-smart
infrastructures: A review and outlook on performance analysis methods and tools.
In Smart Infrastructure and Applications, pages 427–451. Springer. 10.1007/978-
3-030-13705-2 18.
Munafo, R. (2018). Cooperative High-Performance Computing with FPGAs: Matrix
Multiply Case Study. Master’s thesis, Department of Electrical and Computer
Engineering, Boston University. https://open.bu.edu/handle/2144/30740.
Nicopoulos, C. A., Park, D., Kim, J., Vijaykrishnan, N., Yousif, M. S., and Das, C. R.
(2006). Vichar: A dynamic virtual channel regulator for network-on-chip routers.
In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO’06), pages 333–346. IEEE. doi: 10.1109/MICRO.2006.50.
81
NSF (2016). Future directions for nsf advanced computing infrastructure to support
u.s. science and engineering in 2017-2020 (2016). Technical report, National
Science Foundation.
NSF CFD (2014). Future Directions for NSF Advanced Computing Infrastructure.
National Academies Press.
Obrenic´, B., Herbordt, M., Rosenberg, A., and Weems, C. C. (1999). Using emula-
tions to construct high-performance virtual parallel architectures. IEEE Transac-
tions on Parallel and Distributed Systems, 10(10):1–15. doi: 10.1109/71.808155.
Parik, N., Deniz, O., Kim, P., and Li, Z. (2008). Buffer allocation approaches for
virtual channel flow control. Technical report, Department of Electrical Engineer-
ing, Stanford University. http://citeseerx.ist.psu.edu/viewdoc/ download?doi=
10.1.1.93.909 &rep=rep1&type=pdf.
Park, J., Qiu, Y., and Herbordt, M. (2009). CAAD BLASTP: NCBI BLASTP
Accelerated with FPGA-Based Pre-Filtering. In 2009 17th IEEE Symposium on
Field Programmable Custom Computing Machines, Napa, CA, pages 81–87. doi:
10.1109/FCCM.2009.27.
Park, J., Qiu, Y., and Herbordt, M. (2010). CAAD BLASTn: Accelerated NCBI
BLASTn with FPGA Prefiltering. In Proceedings of the IEEE International Sym-
posium on Circuits and Systems, pages 3797–3800. doi: 10.1109/ISCAS.2010.
5537721.
Pascoe, C., Lawande, A., Lam, H., George, A., Sun, Y., Farmerie, W., and Her-
bordt, M. (2010). Reconfigurable supercomputing with scalable systolic arrays
and in-stream control for wavefront genomics processing. In Proceedings of the
Symposium on Application Accelerators for High Performance Computing. https:
//www.researchgate.net/ publication/ 265931244 Reconfigurable Supercomputing
with Scalable Systolic Arrays and In-Stream Control for Wavefront Genomics
Processing.
Pellegrini, F. and Roman, J. (1996). Scotch: A software package for static map-
ping by dual recursive bipartitioning of process and architecture graphs. In In-
ternational Conference on High-Performance Computing and Networking, pages
493–498. Springer. doi: 10.1007/3-540-61142-8 588.
PITAC (2005). Computational Science: Ensuring America’s Competitiveness. Na-
tional Coordination Office for Information Technology Research and Development,
http://www.nitrd.gov.
Plessl, C. (2018). Bringing FPGAs to HPC Production Systems and Codes. In
H2RC’18 workshop at Supercomputing (SC’18). doi: 10.13140/RG.2.2.34327.42407.
82
Porrmann, M., Hagemeyer, J., Pohl, C., Romoth, J., and Strugholtz, M. (2010).
Raptor–a scalable platform for rapid prototyping and fpga-based cluster computing.
Parallel Computing: From Multicores and GPU’s to Petascale, Advances in Parallel
Computing, 19. doi :10.3233/978-1-60750-530-3-592.
Putnam, A. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter
Services. In Proceedings of the International Symposium on Computer Architecture,
pages 13–24. doi: 10.1109/ISCA.2014.6853195.
Rosenberg, A. (1975). Preserving proximity in arrays. SIAM Journal of Computing,
4(4):443–460. doi: 10.1137/0204038.
Rosenberg, A. (1980). Issues in the study of graph embeddings. In International
Workshop on Graph-Theoretic Concepts in Computer Science, pages 150–176. doi:
10.1007/3-540-10291-4 12.
Russinovich, M. (2017). Inside Microsoft’s FPGA-Based Configurable Cloud. "https:
//channel9.msdn.com/Events/Build/2017/B8063accessed6/2017".
Sanaullah, A. and Herbordt, M. (2017). OpenCL for HPC/FPGAs: Case Study
with 3D FFT. In HEART 2018: Proceedings of the 9th International Symposium
on Highly-Efficient Accelerators and Reconfigurable Technologies, page 1–6. doi:
10.1145/3241793.3241800.
Sanaullah, A. and Herbordt, M. (2018a). An Empirically Guided Optimization
Framework for FPGA OpenCL. In 2018 International Conference on Field Pro-
grammable Technology (FPT), pages 46–53. doi: 10.1109/FPT.2018.00018.
Sanaullah, A. and Herbordt, M. (2018b). FPGA HPC using OpenCL: Case Study
in 3D FFT. In HEART 2018: Proceedings of the 9th International Symposium
on Highly-Efficient Accelerators and Reconfigurable Technologies, page 1–6. doi:
10.1145/3241793.3241800.
Sanaullah, A. and Herbordt, M. (2018c). Unlocking Performance-Programmability
by Penetrating the Intel FPGA OpenCL Toolflow. In 2018 IEEE High Perfor-
mance extreme Computing Conference (HPEC). doi: 10.1109/HPEC.2018.8547646.
Sanaullah, A., Khoshparvar, A., and Herbordt, M. (2016a). FPGA-Accelerated
Particle-Grid Mapping. In 2016 IEEE 24th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), pages 192–195. doi:
10.1109/ FCCM.2016.53.
Sanaullah, A., Lewis, K., and Herbordt, M. (2016b). Accelerated Particle-Grid Map-
ping. In Proceedings of the ACM/IEEE International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis. http://sc16.supercomputing.
org/sc-archive/tech poster/poster files/post257s2-file3.pdf.
83
Sanaullah, A., Sachdeva, V., and Herbordt, M. (2018a). SimBSP: Enabling RTL
Simulation for Intel FPGA OpenCL Kernels. doi: 10.1186/s12859-018-2505-7.
Sanaullah, A., Yang, C., Alexeev, Y., Yoshii, K., and Herbordt, M. (2018b). Real-
Time Data Analysis for Medical Diagnosis using FPGA Accelerated Neural Net-
works. BMC Bioinformatics, 19 Supplement 18. doi: 10.1186/s12859-018-2505-7.
Shalf, J., Quinlan, D., and Janssen, C. (2011). Rethinking hardware-software code-
sign for exascale systems. Computer, 44(11):22–30. doi: 10.1109/MC.2011.300.
Sheng, J. (2017). High Performance Communication on Reconfigurable Clusters.
PhD thesis, Department of Electrical and Computer Engineering, Boston Univer-
sity. https://open.bu.edu/handle/2144/27045.
Sheng, J., Humphries, B., Zhang, H., and Herbordt, M. (2014). Design of 3D FFTs
with FPGA Clusters. In Proceedings of the IEEE High Performance Extreme
Computing Conference. doi: 10.1109/ HPEC.2014.7040997.
Sheng, J., Xiong, Q., Yang, C., and Herbordt, M. (2015a). Hardware-Efficient Com-
pressed Sensing Encoder Designs for WBSNs. In 2015 IEEE High Performance
Extreme Computing Conference (HPEC). doi: 10.1109/ HPEC.2015.7322437.
Sheng, J., Xiong, Q., Yang, C., and Herbordt, M. (2016a). Collective Communi-
cation on FPGA Clusters with Static Scheduling. ACM SIGARCH Computer
Architecture News, 44(4). doi: 10.1145/ 3039902.3039904.
Sheng, J., Yang, C., Caulfield, A., Papamichael, M., and Herbordt, M. (2017).
HPC on FPGA Clouds: 3D FFTs and Implications for Molecular Dynamics. In
2017 27th International Conference on Field Programmable Logic and Applications
(FPL). doi: 10.23919/ FPL.2017.8056853.
Sheng, J., Yang, C., and Herbordt, M. (2015b). Towards Low-Latency Communi-
cation on FPGA Clusters with 3D FFT Case Study. In Proceedings of the Inter-
national Symposium on Highly Efficient Accelerators and Reconfigurable Technolo-
gies. https://pdfs.semanticscholar.org/832d/c69145f5ba0ed6a951583201b1b20dd
2096e.pdf.
Sheng, J., Yang, C., and Herbordt, M. (2016b). Application-Aware Collective Com-
munication on FPGA Clusters. In IEEE 24th Annual International Symposium
on Field-Programmable Custom Computing Machines (FCCM). doi: 10.1109/
FCCM.2016.55.
Sheng, J., Yang, C., and Herbordt, M. (2018a). High Performance Dynamic Com-
munication on Reconfigurable Clusters. In 28th International Conference on Field
Programmable Logic and Applications (FPL). doi: 10.1109/ FPL.2018.00044.
84
Sheng, J., Yang, C., and Herbordt, M. (2018b). High Performance Dynamic Com-
munication on Reconfigurable Clusters (Extended Abstract). In 2018 IEEE 26th
Annual International Symposium on Field-Programmable Custom Computing Ma-
chines (FCCM), pages 219–219. doi: 10.1109/ FCCM.2018.00053.
Shi, R., Dong, P., Geng, T., Ding, Y., Ma, X., So, H., Herbordt, M., Li, A., and Wang,
Y. (2020). CSB-RNN: A Faster-than-Realtime RNN Acceleration Framework with
Compressed Structured Blocks. In Proceedings of the International Conference on
Supercomputing.
Shin, K. G. and Daniel, S. W. (1996). Analysis and implementation of hybrid switch-
ing. IEEE Transactions on Computers, 45(6):684–692. DOI: 10.1109/12.506424.
Stern, J., Xiong, Q., Sheng, J., Skjellum, A., and Herbordt, M. (2017). Accelerating
MPI Reduce with FPGAs in the Network. In Proceedings of the Workshop on
Exascale MPI. https://www.bu.edu/caadlab/exampi17.pdf.
Stern, J., Xiong, Q., Skjellum, A., and Herbordt, M. (2018). A Novel Approach
to Supporting Communicators for In-Switch Processing of MPI Collectives. In
Proceedings of the Workshop on Exascale MPI. https://www.bu.edu/caadlab/ Ex-
aMPI18a.pdf.
Su, N., Gu, H., Wang, K., Yu, X., and Zhang, B. (2018). A highly efficient dynamic
router for application-oriented network on chip. The Journal of Supercomputing,
74(7):2905–2915. 10.1007/s11227-018-2334-5.
Sukhwani, B. and Herbordt, M. (2008). Acceleration of a Production Rigid Molecule
Docking Code. In 2008 International Conference on Field Programmable Logic
and Applications, pages 341–346. doi: 10.1109/ FPL.2008.4629955.
Sukhwani, B. and Herbordt, M. (2009). FPGA-Acceleration of CHARMM Energy
Minimization. In HPRCTA ’09: Proceedings of the Third International Workshop
on High-Performance Reconfigurable Computing Technology and Applications, page
1–10. doi: 10.1145/ 1646461.1646462.
Sukhwani, B. and Herbordt, M. (2010a). Fast Binding Site Mapping using GPUs and
CUDA. In 2010 IEEE International Symposium on Parallel Distributed Process-
ing, Workshops and Phd Forum (IPDPSW), pages 1–8. doi: 10.1109/ IPDPSW.
2010.5470895.
Sukhwani, B. and Herbordt, M. (2010b). FPGA Acceleration of Rigid Molecule Dock-
ing Codes. IET Computers and Digital Techniques, 4(3):184–195. doi: 10.1049/
iet-cdt.2009.0013.
85
Sukhwani, B. and Herbordt, M. (2014). Increasing Parallelism and Reducing Thread
Contentions in Mapping Localized N-body Simulations to GPUs. In Kindratenko,
V., editor, Numerical Computations with GPUs, pages 379–405. Springer Verlag.
doi: 10.1007/ 978-3-319-06548-9 18.
Tvrdik, P. (1999). Embeddings and simulations of interconnection networks. In
Course Notes, Topics in Parallel Computing, Computer Science Department, Uni-
versity of Wisconsin. 10.1007/978-3-7091-9076-0 13.
VanCourt, T., Gu, Y., and Herbordt, M. (2004a). FPGA acceleration of rigid
molecule interactions. In 12th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 300–301. doi: 10.1109/ FCCM.2004.33.
VanCourt, T. and Herbordt, M. (2004). Families of FPGA-based algorithms for
approximate string matching. In Proceedings. 15th IEEE International Conference
on Application-Specific Systems, Architectures and Processors, 2004., pages 354–
364. doi: 10.1109/ ASAP.2004.1342484.
VanCourt, T. and Herbordt, M. (2005a). LAMP: A tool suite for families of FPGA-
based application accelerators. In International Conference on Field Programmable
Logic and Applications, 2005. doi: 10.1109/ FPL.2005.1515797.
VanCourt, T. and Herbordt, M. (2005b). Three dimensional template correlation:
Object recognition in 3D voxel data. In Seventh International Workshop on
Computer Architecture for Machine Perception (CAMP’05), pages 153–158. doi:
10.1109/ CAMP.2005.52.
VanCourt, T. and Herbordt, M. (2006a). Application-dependent memory interleav-
ing enables high performance in FPGA-based grid computations. In Proceedings
of the IEEE Conference on Field Programmable Logic and Applications, pages 395–
401. doi: 10.1109/ FCCM.2006.25.
VanCourt, T. and Herbordt, M. (2006b). Rigid molecule docking: FPGA reconfigu-
ration for alternative force laws. Journal on Applied Signal Processing, v2006:1–10.
doi: 10.1155/ ASP/2006/97950.
VanCourt, T. and Herbordt, M. (2006c). Sizing of processing arrays for FPGA-based
computation. In 2006 International Conference on Field Programmable Logic and
Applications, pages 755–760. doi: 10.1109/ FPL.2006.311307.
VanCourt, T. and Herbordt, M. (2007). Families of FPGA-based accelerators for
approximate string matching. Microprocessors and Microsystems, 31(2):135–145.
doi: 10.1016/ j.micpro.2006.04.001.
86
VanCourt, T. and Herbordt, M. (2009). Elements of high performance reconfigurable
computing. In Zelkowitz, M., editor, Advances in Computers, volume v75, pages
113–157. Elsevier. doi: 10.1016/ S0065-2458(08)00802-4.
VanCourt, T., Herbordt, M., and Barton, R. (2003). Case study of a functional
genomics application for an FPGA-based coprocessor. In International Conference
on Field Programmable Logic and Applications, pages 365–374. doi: 10.1016/
j.micpro. 2004.03.005.
VanCourt, T., Herbordt, M., and Barton, R. (2004b). Microarray data analysis using
an FPGA-based coprocessor. Microprocessors and Microsystems, 28(4):213–222.
doi: 10.1016/ j.micpro. 2004.03.005.
Vipin, K. and Fahmy, S. A. (2018). FPGA Dynamic and Partial Reconfiguration:
A Survey of Architectures, Methods, and Applications. ACM Computing Surveys
(CSUR), 51(4):1–39. doi:10.1145/3193827.
Wang, T., Geng, T., Jin, X., and Herbordt, M. (2019a). Accelerating AP3M-Based
Computational Astrophysics Simulations with Reconfigurable Clusters. In 2019
IEEE 30th International Conference on Application-specific Systems, Architectures
and Processors (ASAP), pages 181–184. doi: 10.1109/ ASAP.2019.000-5.
Wang, T., Geng, T., Jin, X., and Herbordt, M. (2019b). FP-AMR: A Reconfigurable
Fabric Framework for Block-Structured Adaptive Mesh Refinement Applications.
In 2019 IEEE 27th Annual International Symposium on Field-Programmable Cus-
tom Computing Machines (FCCM), pages 245–253. doi: 10.1109/ FCCM.2019.
00040.
Xiang, Z., Wang, T., Geng, T., Xiang, T., Jin, X., and Herbordt, M. (2018). Soft-
Core, Multiple-Lane, FPGA-based ADCs for a Liquid Helium Environment. pages
1–6. doi: 10.1109/ HPEC.2018.8547550.
Xilinx (2009). Virtex-5 FPGA RocketIO GTP Transceiver. Xilinx.
Xilinx (2018). Vivado Design Suite User Guide: High-Level Synthesis. Xilinx.
Xiong, Q., Bangalore, P., Skjellum, A., and Herbordt, M. (2018). MPI Derived
Datatypes: Performance and Portability Issues. In EuroMPI’18: Proceedings of
the 25th European MPI Users’ Group Meeting. doi: 10.1145/ 3236367.3236378.
Xiong, Q. and Herbordt, M. (2017). Bonded Force Computations on FPGAs. In
2017 IEEE 25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pages 72–75. doi: 10.1109/ FCCM.2017.49.
87
Xiong, Q., Skjellum, A., and Herbordt, M. Accelerating MPI Message Match-
ing Through FPGA Offload. In 2018 28th International Conference on Field
Programmable Logic and Applications (FPL), pages 191–1914. doi: 10.1109/
FPL.2018.00039.
Xiong, Q., Yang, C., Haghi, P., Skjellum, A., and Herbordt, M. (2020). Accelerating
MPI Collectives with FPGAs in the Network and Novel Communicator Support.
In Proceedings of the IEEE Symposium on Field Programmable Custom Computing
Machines.
Xiong, Q., Yang, C., Patel, R., Geng, T., Skjellum, A., and Herbordt, M. (2019).
GhostSZ: A Transparent SZ Lossy Compression Framework with FPGAs. In
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pages 258–266. doi: 10.1109/ FCCM.2019.00042.
Yang, C. (2019). High-Performance Communication Infrastructure Design on FPGA-
Centric Clusters. PhD thesis, Department of Electrical and Computer Engineering,
Boston University. https://open.bu.edu/handle/2144/38207.
Yang, C., Geng, T., Wang, T., Patel, R., Xiong, Q., Sanaullah, A., Lin, C., Sachdeva,
V., Sherman, W., and Herbordt, M. (2019a). Fully Integrated FPGA Molecular
Dynamics Simulations. In SC ’19: Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, pages 1–31. doi:
10.1145/ 3295500.3356179.
Yang, C., Geng, T., Wang, T., Sheng, J., Lin, C., Sachdeva, V., Sherman, W.,
and Herbordt, M. (2019b). Molecular Dynamics Range-Limited Force Evalu-
ation Optimized for FPGA. In 2019 IEEE 30th International Conference on
Application-specific Systems, Architectures and Processors (ASAP), pages 263–271.
doi: 10.1109/ ASAP.2019.00016.
Yang, C., Sheng, J., Patel, R., Sanaullah, A., Sachdeva, V., and Herbordt, M. (2017a).
OpenCL for HPC with FPGAs: Case Study in Molecular Electrostatics. In 2017
IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. doi:
10.1109/ HPEC.2017.8091078.
Yang, C., Sheng, J., Sridhar, A., Herbordt, M., Nicoloff, C., and Battat, J. (2017b).
An FPGA-based Data Acquisition System for Directional Dark Matter Detection.
In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages
1–8. doi: 10.1109/ HPEC.2017.8091079.
CURRICULUM VITAE
89
