Hierarchical and scalable bus architecture generation on FPGAs with high-level synthesis by Chen, Ying
c© 2015 Ying Chen
HIERARCHICAL AND SCALABLE BUS ARCHITECTURE
GENERATION ON FPGAS WITH HIGH-LEVEL SYNTHESIS
BY
YING CHEN
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2015
Urbana, Illinois
Adviser:
Associate Professor Deming Chen
ABSTRACT
This thesis presents and evaluates a bus-based system for FCUDA, a trans-
lation tool enabling CUDA code to be run on FPGAs. With the goal of
constructing a solid light-weight back-end with optimized performance, we
choose AXI4 as the communication protocol and plug in all necessary com-
ponents on a hierarchical bus system. Then, FCUDA cores are added in the
back-end and the comprehensive system is automated into a single tool chain.
Several optimizations are added in this automated FCUDA bus system for
the delivery of better performance. For example, FCUDA cores are tiled
into clusters based on configuration inputs, and clock domains are separated
to reduce long wires. For the experiments, this work adjusts the existing
resources and period models and enhances the system latency model by in-
corporating off-chip memory communication latency. The system is proved
to be light-weight based on post-routing resource reports. Design space ex-
ploration among multilevel granularity parallelisms is performed to get the
system’s best performance, with which a comparison with GPU is made. Our
system can achieve at most 2.08 performance improvement when compared
with the execution latency on GPU.
ii
ACKNOWLEDGMENTS
The completion of this thesis and my graduate work at the University of
Illinois would not have been possible without the support and assistance of
the following individuals and groups.
I will always value Prof. Deming Chen as my advisor and am especially
grateful to Prof. Chen for his continued support, encouragement, and guid-
ance throughout my time as a master student. His positive influence on my
professional and insistent personal development will continue to benefit me
throughout my life.
Many thanks also to Eric Liang of Peking University and Tan Nguyen,
who invested significant time and energy into helping my project succeed. I
would also like to thank Dr. Kyle Rupnow and Swathi Gurumani of ADSC
for their advice and suggestions on this thesis.
I am also indebted to the other members of Prof. Chen’s research group
for ideas, comradery, and quite a lot of experience sharing as a graduate
student. In particular, thanks to Anand Ramachandran, Chen-Hsuan Lin,
and Yi Liang who were occasional sounding boards for my project. Many
thanks to Heo Yun and Keith Campbell, our wonderful server maintainer
who ensures that everything runs smoothly on CAD, our great server.
Finally, I am forever grateful to my family, for cheering me up and keeping
me sane; to my husband, Mingcheng, for his constant love and encouraging
me to hang on until difficulties get solved; to my friends, for helping me to
enjoy my life as a graduate student.
iii
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . v
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 3
2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 High-Level Synthesis and Multi-Level Paralleled FCUDA . . . 4
2.3 On-Chip Buses . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Communication Architectures and AMBA Bus . . . . . . . . . 6
2.5 AXI4 Compliant Design Methodology in Xilinx Platform . . . 7
2.6 How the AXI in Vivado Design Suite Works . . . . . . . . . . 8
CHAPTER 3 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . 15
3.1 Bus Connected FCUDA System Overview . . . . . . . . . . . 16
3.2 Traffic on Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Integrating FCUDA-Generated Code with AXI Bus . . . . . . 20
3.4 Hierarchies in Bus System . . . . . . . . . . . . . . . . . . . . 28
3.5 Placement of Components and Clock Domain Optimization . . 31
3.6 Bus Communication Verification . . . . . . . . . . . . . . . . . 35
3.7 Automated Bus-Based FCUDA Tool Flow . . . . . . . . . . . 38
3.8 System Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
CHAPTER 4 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . 47
4.1 Design Space Exploration . . . . . . . . . . . . . . . . . . . . 48
4.2 Comparison with GPU . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Block Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
iv
LIST OF ABBREVIATIONS
BRAM Block RAM
GPU Graphics Processing Unit
FPGA Field Programmable Gate Array
AXI Advanced eXtensible Interface
SoC System-on-a-Chip
CUDA Compute Unified Device Architecture
IP Intellectual Property
IC Integrated Circuits
CA Communication Architecture
AMBA Advanced Microcontroller Bus Architecture
LUT Look-Up Table
CUDA Compute Unified Device Architecture
HLS High-Level Synthesis
DSP Digital Signal Processing
FF Flip Flop
CLB Configurable Logic Block
ASMBL Advanced Silicon Modular Block
P & R Place & Route
Tcl Tool Command Language
RTL Register Transfer Level
ML-GPS Multilevel Granularity Parallelism Synthesis
v
CHAPTER 1
INTRODUCTION
Fast, inexpensive processors are essential for nearly all human endeavors.
During the past several years, there has been a dramatic shift in comput-
ing trends toward concurrent computation. For decades, integrated circuits
scaled in size according to Moore’s law [1]. More and more transistors with
shrinking size are crammed on a single integrated circuit. While Moore’s law
continues to hold true, manufacturers have faced significant challenges in
power [2] and heat dissipation due to the breakdown of Dennard scaling [3].
The end of dramatic exponential growth in single-processor performance
marks the end of dominance of the single microprocessor in computing. The
era of sequential computing gives way to a new era in which parallelism
enabled by multicore processors is at the forefront [4]. While the chip makers
turn to multicore processors [5, 6], the fundamental problem of reaching limits
of the Dennard scaling is not solved [7, 8]. Also, multithreaded programs for
multicore processors are notoriously difficult to write [9]. Additional cost and
effort are required to solve problems like competition between threads, which
prohibit the production significantly compared to sequential programming
models [10].
In light of the rapidly growing interest for parallelism in a wider and
coarser level than feasible traditional processors, the potential of GPUs and
FPGAs has been realized. NVIDIA’s CUDA [11, 12], provides a program-
ming model for their GPUs which consist of hundreds of processing cores.
Intensive compute loads can be handled with a high degree of data-level
parallelism. FPGAs, on the other hand, offer efficient application-specific
parallelism extraction through the flexibility of their reconfigurable fabric.
1
However, FPGAs’ development requires programmers with in-depth hard-
ware expertise and the debugging can be very painful. FCUDA [13] is a
design flow that enables CUDA programmers to easily exploit the inherent
parallelism of an FPGA using ordinary CUDA code.
This thesis presents an automatically generated, intrinsically scalable, and
light-weight AXI-based bus backend for FCUDA which enables efficient com-
munication between FCDUA cores and other devices (such as off-chip mem-
ory). With such a solid backend, FCUDA can be used as a real system for
computation with high efficiency. Also, our analysis of the system’s band-
width and latency provides the basis for new promising optimization oppor-
tunities. The goal of this thesis is to fit FCUDA generated C code into a real
communication system with high-performance delivery.
2
CHAPTER 2
BACKGROUND
This chapter provides a brief background on related topics of our bus-based
system including FPGAs, multilevel paralleled FCUDA, on-chip buses, and
AXI working mechanisms.
2.1 FPGAs
A field-programmable gate array (FPGA) is an integrated circuit (IC) de-
signed to be configured by a customer or a designer after manufacturing.
Before FPGAs were widely adopted, application-specific integrated circuits
(ASICs), had long been the manufacturing technology of choice for custom
digital hardware. With hundreds of millions of gates, an ASIC can be de-
signed to implement any arbitrary design, with the benefits of low cost at very
high scale [14]. However, ASICs face many limitations, including high non-
recurring expenses (NRE) costs, increasing mask costs [15], long turnaround
time and time-to-market. Manufacturing ASICs can be risky: if a bug is
found in an ASIC design or any middle step after fabrication, all the taped
out chips have to be thrown away and completely new devices must be fab-
ricated.
Compared with ASICs, FPGAs overcome many of the barriers by enabling
users to quickly design and reconfigure one piece of hardware, without the
high initial cost of ASIC designs. Therefore, after FPGAs became commer-
cially viable in the mid 1980s [16], ASICs have been increasingly challenged
3
while FPGAs embraced its explosive period of time. In 1990s, FPGAs have
found their way to consumers in various industrial applications [17].
Despite the advantages such as faster time-to-market, simpler design cycle,
no NRE costs, field programmability, and more predictable project cycle,
FPGAs do have some disadvantages when compared to ASICs, such as higher
unit cost at volume, significant power/area drawbacks, and lower speed [18].
For designers, the architecture of FPGAs forces difficult trade-offs. Certain
hardware structures in FPGAs are plentiful, while others are limited to a
significant degree. For different applications, the resource requirements vary
a lot. Thus, a designer must take care to take advantage of the strengths of
the FPGA platform while working around the weaknesses [14].
2.2 High-Level Synthesis and Multi-Level Paralleled
FCUDA
Programming FPGAs often requires hardware design expertise, as using
hardware description language (HDL) involving the RTL level can be er-
ror prone and hard to debug. A simple function in C level with only several
lines of code turns out to be a very big project on FPGAs using HDL and
requires several months of debugging. The growing capabilities of silicon
technology and the increasing complexity of applications in recent decades
have forced design methodologies and tools to move to higher abstraction
levels [19].
High-Level Synthesis (HLS) translates applications users written in high-
level languages to hardware designs on FPGAs [20]. Such a synthesis greatly
improves the designer’s productivity by increasing the abstraction level at
which the designer works. Some new HLS flows, in addition to reducing the
time for creating the hardware, also help reduce the time to verify it as well
as facilitate other flows such as power analysis [21].
4
FCUDA [13, 22] is a high-level-synthesis design flow that compiles CUDA [11]
kernels into RTL for FPGAs using a commercial tool, Vivado HLS [23] (pre-
viously AutoPilot [24]). Multi-level parallelisms are extended on the baseline
of FCUDA, including core, thread, and on-chip-memory levels [22]. Among
all these levels, design space exploration is performed in order to find the
best configuration in terms of performance delivered.
The output of FCUDA flow is a set of computation cores at the RTL level
onto which the original computation is mapped. Using FCUDA, program-
mers familiar with CUDA can easily implement an application on FPGAs and
efficiently utilize the on-board resources by doing a design space exploration.
2.3 On-Chip Buses
Shrinking process technologies and increasing design sizes have led to highly
complex billion-transistor ICs. Driven by Moore’s law, manufacturers are
integrating an increasing number of diversity components on a single chip. A
heterogeneous SoC may contain one or more programmable components such
as general-purpose processors, digital signal processors cores, application-
specific intellectual property (IP) cores, as well as an analog front end, on-
chip memory, I/O devices, and other application specific circuits. With the
extraordinary advantage of high-speed capability, low resource cost and low
power consumption, SoCs are very important in many application areas, such
as the mobile electronics market and many other embedded systems.
On-chip bus organized communication architecture (CA) is among the top
challenges in CMOS SoC technology due to rapidly increasing operation fre-
quencies and growing chip size. In general, the performance of the SoC
design heavily depends upon the efficiency of its bus structure. The balance
of computation and communication in any application or task is one of the
most important factors that affect the system’s delivered performance [25].
Traditionally, buses are handcrafted carefully to meet the needs of a narrow
5
target market, or to provide support for a specific processor.
IP cores, as constituents of SoCs, are usually designed with many dif-
ferent interfaces and communication protocols. Integrating such cores in a
SoC often requires insertion of much sub-optimal “glue” logic. To avoid this
problem, standards of on-chip bus structures were developed [25]. Currently,
there are several publicly available bus architectures from leading manufac-
turers, such as Core-Connect from IBM [26], AMBA from ARM [27, 28],
Silicon Backplane from Sonics [29], and many others. As bus architectures
are usually tied closely to processors’ architectures, manufactures provide
cores optimized to work with their bus architectures, such as the PowerPC
or the ARM processor. Therefore, minimal extra interface logic is needed,
and the burden of system designers is greatly relieved.
2.4 Communication Architectures and AMBA Bus
All the designs of on-chip CAs address the following three issues [30, 31]:
1. Definition of CA topology - defines the physical structure of the CA.
There are numerous CA typologies, ranging from a single shared bus to
more complex architectures such as bus hierarchies, token ring, cross
bar, or custom networks.
2. Selection and configuration of the communication protocols - for each
channel/bus in the CA, communication protocols specify how each com-
munication transaction occurs. One of the most important transaction
manners specified in the protocols is the arbitration mechanism, such as
round robin access, priority-based selection, time division multiplexed
access. Also, these arbiters can be implemented either centralized or
distributed in the bus.
3. Communication mapping - refers to the process of associating abstract
6
system-level communications with physical communication paths in the
CA topology [30].
Advanced microcontroller bus architecture (AMBA) [32, Chapter 20] is a
bus standard devised by ARM with the aim to support efficient on-chip com-
munications among ARM processor cores. Nowadays, AMBA is one of the
leading on-chip busing systems used in high-performance SoC designs. Two
AMBA bus segments, system-bus and peripheral-bus, hierarchically organize
the whole system into two parts. A bridge that buffers data and operations
between these two parts mutually connects the two parts together. AMBA
specifications define the standard bus protocols of the on-chip components’
connections for different SoC structures. The first-generation AMBA bus
is the advanced system bus (ASB). It is used for simple cost-effective de-
signs that support burst transfer operation, pipe-lined transfer operation,
and multiple bus masters. Advanced high-performance bus (AHB), as a
later generation of the AMBA bus is intended for high-performance high-
clock synthesizable designs. Another bus specified within AMBA bus is the
advanced peripheral bus (APB), which is used to connect general-purpose
low-speed low-power peripheral devices.
2.5 AXI4 Compliant Design Methodology in Xilinx
Platform
AMBA AXI is a relatively new bus specified in AMBA. The first version of
AXI was first included in AMBA 3.0, released in 2003. AMBA 4.0, released
in 2010, includes the second major version of AXI, AXI4. Currently, AXI
is the most widespread AMBA interface. It can connect up to hundreds of
masters and slaves in complex SoCs.
The AMBA AXI protocol is an open standard, on-chip interconnect speci-
fication for the connection and management of functional blocks in a system-
7
on-chip (SoC). Featured by higher productivity, greater flexibility, broader
IP compatibility, it is well-supported by considerable EDA vendors and has
driven a comprehensive market in third-party IP products and tools for de-
velopment.
Xilinx, well known as for inventing the FPGAs and the developer of various
RTL design tools, introduced AXI4 interfaces in its Vivado Design Suite
2014 and ISE Design Suite 14 [33] to provide the benefits of productivity,
flexibility, and availability. AMBA AXI4, the fourth generation of the AMBA
interface specification from ARM, contains three types of AXI4 interfaces.
The first one is AXI4, which is used for high-performance memory-mapped
requirements. The second one is AXI4-Lite. This is an interface that can
be added for simple, low-throughput memory-mapped communication. The
last one is AXI4-Stream and it is designed for high-speed streaming data.
2.6 How the AXI in Vivado Design Suite Works
The AXI Interconnect cores we are using in the system is reconfigured from
Vivado AXI Interconnect IP. Based on reference of Xilinx [34], this section
gives a brief introduction of how AXI works in Vivado Design Suite platform.
2.6.1 Memory-Mapped Protocols and AXI4-Stream Protocol
The protocols of the three AXI interfaces use are different: memory-mapped
protocol and AXI-Stream protocol. The first one is used by AXI and AXI4-
Lite, while the latter one is used by AXI4-stream.
In memory mapped AXI protocol, all transactions with data to be trans-
ferred are interacting with a target address within the system’s memory
8
space [34]. When data is sent from the master to a slave, the slave’s unique
address space is added as a target base address to the destination address.
When the response is sent back, ID of the waiting master is appended as the
base target address.
Such a target-address-based transaction mode where all the IPs operate
around a defined memory space makes the system a homogeneous one. In
contrast, AXI4-Stream protocol is used for applications that typically focus
on a data-centric and data-flow paradigm where the concept of a target
address is not present or not required. Each AXI4-Stream acts as a single
unidirectional channel for a handshake data flow [34].
Compared to memory mapped AXI protocol types, AXI-Stream’s mecha-
nism of moving data between IPs is defined and efficient, but there is no uni-
fying address context between IPs, which makes it more specialized around
some given applications.
In our applications, IPs are translated from CUDA to C source code, all
the transactions are specified by the master cores in either burst or standard
mode. Addresses are clearly specified in each transactions, therefore AXI4
protocol is chosen for our bus system.
2.6.2 AXI4 Interface between Single Master and Single Slave
This part gives a basic overview of the interface between a single master and
a single slave in memory mapped AXI protocol. This is the simplest case of
the protocol, as well as the building block of our bus system.
Five different channels are used for AXI4 interfaces: Read Address Chan-
nel, Write Address Channel, Read Data Channel, Write Data Channel, and
Write Response Channel. Figure 2.1 shows how an AXI4 Read transaction
uses the Read address and Read data channels.
9
Figure 2.1: Channel Architectures of Reads
Figure 2.2: Channel Architectures of Writes
Figure 2.2 shows how an AXI4 write transaction uses the Write address
channel, Write data channel and Write Response channel.
10
As shown in preceding figures, AXI4 provides separate data and address
connections for Reads and Writes, which allows simultaneous, bidirectional
data transfer. Besides standard read or write mode, there is also a burst
mode, under which AXI4 requires a single address and then bursts up to
256 words of data. Besides burst, a variety of options are enabled in Vi-
vado Design Suite allowing AXI4-compliant systems to achieve higher data
throughput. Some of these features, in addition to bursting, are: data up-
sizing and down-sizing, pipeline, and so on. We will describe a couple of
these techniques later in this chapter.
2.6.3 AXI Interconnect Core Features
AXI4 Interconnect IP cores connect one or more AXI memory-mapped mas-
ter devices to one of more memory-mapped slave devices. When connecting
one master to one slave and not performing any conversions or address range
checking, the AXI Interconnect core is implemented as wires, with no re-
sources, no delay and no latency. When there are multiple masters or slaves,
arbiter or decoder will be added.
In many cases, an AXI Interconnect IP connects cores with different output
different width or different clock rate. According to the needs of different ap-
plications, different conversion logic will be used in the AXI Interconnect IP,
such as data width conversion, clock rate conversion, AXI-Lite slave adap-
tation, and so on. These additional conversion functions are optional and
only used when the master and slave cores have mismatches in between.
Figure 2.3 illustrates a top-level AXI Interconnect core with several optional
logic inside.
Width Conversion In Figure 2.3, both up-sizers and down-sizers belong
to width conversion logic. MI stands for master interface and SI stands for
slave interface.
11
Figure 2.3: Top-Level AXI Interconnect [34]
The AXI Interconnect core has a parametrically-defined, internal, native
data-width that supports 32, 64, 128, and 256 bits. The AXI data channels
that span the crossbar are sized to the “native” width of the AXI Interconnect
core [34].
When any of the SI slots or MI slots is sized differently, the AXI Inter-
connect core inserts width conversion units to adapt the slot width to the
AXI Interconnect core native width before transmitting the crossbar to the
other hemisphere. All MI and SI slots have associated individual parametric
data-width values. The AXI Interconnect core adapts each MI and SI slot
automatically to the internal native data-width as follows:
• When the data width of an SI slot is wider than the internal native data
width of the AXI Interconnect, a down-sizing conversion is performed
along the pathways of the SI slot.
• When the internal native data width of the AXI Interconnect core is
wider than that of an MI slot, a down-sizing conversion is performed
along the pathways of the MI slot.
• When the data width of an SI slot is narrower than the internal na-
tive data width of the AXI Interconnect, an up-sizing conversion is
12
performed along the pathways of the SI slot.
• When the internal native data width of the AXI Interconnect core is
narrower than that of an MI slot, an up-sizing conversion is performed
along the pathways of the MI slot.
When down-sizers and up-sizers are used in pathways, data serialization or
data merging occurs according to system needs for both reads and writes.
Clock Conversion Clock conversion comprises the following [34]:
• Clock-Rate Reduction: An integer division (N:1) of the clock-rates is
performed from its input (SI) to its output (MI) side.
• Clock-Rate Acceleration: An integer multiplication (1:N) of the clock-
rates is performed from its input (SI) to output (MI) side.
• Asynchronous Clock Conversion: An asynchronous clock conversion
module performs either reduction or acceleration of clock-rates by pass-
ing the channel signals through an asynchronous FIFO.
In a system organized by AXI Interconnect core into masters and slaves,
three clock vector inputs are tied, one is the clock rate for master domain,
one is for slave domain, the other one is for AXI Interconnect cores. The
AXI Interconnect core adapts the clock rate of each MI and SI slot automat-
ically to the native clock rate of the AXI Interconnect. Usually, the three
clock inputs are centralized and controlled by a single clock & reset synchro-
nize processor IP and they use the same clock rates. However, sometimes,
different clock domains are necessary and clock conversion will be used.
Pipelining Pipelining is a common technique used for improving through-
put at the expense of a little increased latency. AXI Interconnect also uses
13
pipelining by buffering data bursts for throughput improvement. This is usu-
ally adopted when the data rate at an SI or MI slot differs from the native
data rate of the AXI Interconnect core due to data width or clock rate con-
version. Besides pipelining buffers, peripheral register slices and data path
FIFOs are also commonly used logic to accommodate the data flow rate or
improve system timing.
14
CHAPTER 3
IMPLEMENTATION
With only FCUDA generated C code, we can neither guarantee the correct-
ness of FCUDA cores at the RTL level, nor do we know the real performance
delivered by FCUDA. For this work, we set up a comprehensive AXI4-based
bus system for FCUDA by adding all the necessary components as well as
their proper communication interfaces.
The AXI4 bus system with multi-hierarchies and memory-mapped proto-
cols is used as the communication architecture to support a large enough
number of cores. A round-robin arbitration mechanism is used to ensure the
fairness among cores and therefore, achieve the best performance.
Ports are combined and pragmas are added on the memory interface during
Vivado HLS synthesis. When FCUDA generated C code is translated into
the RTL design, multiple ports of each core are combined and an AXI master
wrapper is added to accommodate AXI protocols. Then, after being exported
as IPs, FCUDA generated C cores are ready to be connected to the bus
system as AXI4 masters.
Several enhancements are added in the system, such as clock domain divi-
sion and placement constraints, to consolidate and improve our bus system.
Finally, we automate FCUDA compilation together with the bus back-end
into an unified tool chain with the help of scripts. The tool chain flow includes
front-end CUDA to C compilation, Vivado HLS synthesis, batch script mode
system set up in Vivado, and synthesis or simulation launching. During the
15
automatic workflow process, manual configuration inputs are barely needed,
which greatly releases the burdens of users.
3.1 Bus Connected FCUDA System Overview
Among all the various communication architectures, our back-end system
uses bus as on-chip communication infrastructure to connect all components
and arbitrate all the requests for communicating with off-chip memory, since
bus is enough for a system with less than 100 cores like ours and it is light
weighted compared with others, such as NOC.
The bus we use is based on AMBA Advanced eXtensible Interface 4 (AXI4)
protocol, which is the fourth generation of the AMBA interface specification
from ARM. Using Xilinx Vivado Design Suite with AXI4 Compliant Plug-
and-Play IP extension as the develop environment, a broad set of AXI4-based
IPs with a single open standard interface across the Embedded, DSP, and
Logic domains are offered, thus facilitate a more convenient, higher produc-
tive and more flexible development.
To support FCUDA computation cores work correctly, the most important
thing is to make the cores communicate effectively and efficiently with the
off-chip storage. Figure 3.1 shows an overview of our bus system.
DDR3 SDRAM, a modern type of dynamic random-access memory with
a high bandwidth interface, is embedded in most of Xilinx FPGAs. Consid-
ering its large storage and fast speed of communication, we choose DDR3
as the off-chip storage device in our system. The DDR3 is connected to
FPGA where our FCUDA cores are built in through an AXI wrapped mem-
ory controller, which consists of all the necessary components for processing
the “fetch” or “store” requests it has received. A memory controller deals
with all processing for memory operation including DDR3 address decoding,
16
Figure 3.1: Bus System Designed for FCUDA
data channel selection, DDR3 activation and charging, and so on. Outside
the memory controller, an AXI slave wrapper which is compliant with the
communication protocol in our system is wrapped. Such a slave wrapper
packs all the input and output signals into an AXI slave interface so that the
whole memory part serves as a slave in the bus system, detecting requests
sent from masters and sending back responses.
Bus connections inside the system is divided into two hierarchies, which we
call Bus Hierarchy 1 and Bus Hierarchy 2. Both of these two bus hierarchies
are composed of several connected AXI IP infrastructures. FCUDA cores are
the masters of this system. As shown in Figure 3.1, the cores in each red dash
line box are grouped as a tile. In each tile, an AXI bus interconnect IP from
the second level of bus hierarchy is used to arbitrate requests sent from (or
responses sent to) the cores in this tile. All the AXI bus interconnect IPs in
these tiles form a lower level of our system. Then, all these interconnect IPs
are organized by another AXI bus interconnect IP, which is the higher level
of our system. The AXI IP from this higher level arbitrates all the requests
from (or processed responses to) each tile. The goal of this hierarchical
bus structures ensures that at one time, there is only one message being
17
transferred, as required by the bus.
3.2 Traffic on Bus
Each FCUDA core keeps executing until the task is finished. During the
execution, requests for the memory to fetch data or store results are contin-
uously sent through the AXI interface to the bus. Usually, a large number of
cores are initiated in the system for reducing latency. While the bus can be
occupied by only one request or response at a time, all the cores are sending
out requests and waiting for responses whenever they need to.
To solve this conflict, FIFOs for buffering unprocessed requests and re-
sponses are used inside the bus for each master and slave interface, as shown
in Figure 3.2.
In Figure 3.2, we use a four-core system as an example. Buffers to store
requests are added for masters as well as slaves. The arbiter inside bus
system scans each buffer to detect whether there are requests waiting to be
processed. Similarly, for the memory slave side, a buffer is added to store the
processed responses until it is its turn to use the bus and send the responses
back.
We use round robin arbitration. The reason we choose round robin as the
strategy is twofold.
1. All cores get started at the same time. However, the finish time of the
system is the time that all the cores finished their execution. Based on
the task split mechanism in our CUDA to C compilation, all the cores
execute the same code in each thread block loop. In other words, it
takes almost the same time for all the cores to generate the requests.
When all of these cores are waiting for the response to continue their
18
Master 
request buf0
Round_Robin Arbitrators in multiple AXI Interconnect IP
request
/respond
buf
MC
DDR3
Bus System
Master 
request buf1
Master 
request buf2
Master 
request buf3
Slave
respond buf0
AXI wrapper
core
0
AXI wrapper
core
1
AXI wrapper
core
2
AXI wrapper
core
3
Figure 3.2: Buffers for Master and Slaves inside Bus
computation, we need to ensure fairness among them so the total time
would be shortest. Otherwise, it is very possible that serialized execu-
tion happens: some cores always receive faster responses while others
always wait. The total performance delivered by the whole system
depends on the slowest core.
2. Resource consumption is less with a round robin arbiter. The round
robin arbitrator is easy to implement. It scans the buffers of all the con-
nected master interfaces and processes one stored request each round
without considering other factors. Using such a simple arbitrator leads
to the result of a smaller resource consumption. The resource is very
valuable when we optimize the performance of the FCUDA system.
The more resource left on FPGA for us to use, the more cores or more
threads that can be instantiated and run simultaneously. Hence, a
shorter time will be taken by our system to finish the task.
Each request carried on the bus is tied with a unique ID, indicating which
19
master it is sent from. Also, all the slaves on the bus have their own unique
address space. When a master sends out a request, it attaches the target
address information to direct the request to the proper slave’s space. Here
in our system, it is set to be the memory address starting from 0x80000000,
which is for the off-chip memory slave in our system memory map. Once
the request is received by the memory controller, the AXI slave wrapper
buffers its ID. After the memory controller and DDR3 finish all processing
and have its response prepared, the slave wrapper ties the buffered ID onto
the response and sends it back to its original master via the bus.
3.3 Integrating FCUDA-Generated Code with AXI
Bus
The FCUDA compiler generates C code of multiples cores into a single func-
tion. Putting such C code directly through Vivado HLS synthesis cannot
generate AXI master RTL cores for the bus system. Preprocessing steps for
FCUDA cores are necessary: core control decentralization, interface protocol
conversion and ports combination.
3.3.1 Decentralized Core Control
The generated C code of FCUDA is a C file with a single top-level function
containing all the cores’ code. Putting such C code into Vivado HLS will get
us a single opaque top-level module that includes all the cores’ computation
logic. This can be viewed as a centralized control which hides cores from
outside accesses shown as Figure 3.3.
However, in our bus-based multi-core system, all FCUDA cores need to
work independently and separately with their own connections to other com-
20
Centralized Control
FCUDA
Core
FCUDA
Core
FCUDA
Core
FCUDA
Core
FCUDA
Core
FCUDA
Core
Figure 3.3: Centralized Control of FUCDA Cores
ponents via different interfaces on the bus. Therefore, in order to instantiate
each core individually, we decentralized the control of FCUDA cores by sep-
arating the single top-level function into multiple functions, one for each
core.
For example, for a system with four cores inside, codes are shown as Fig-
ure 3.4. Each core has a unique function name formed by an application
name and a number representing its core ID.
The decentralized core codes reside in one C file to remain neat and clear.
To generate IP core for each function, Vivado HLS is invoked on this C file
multiple times, each of which is done with a different top function name,
until every core gets synthesized and exported as an independent IP.
21
void matrixMul_core0(/* args */) {
#pragma HLS interface ap_bus port=memport
#pragma HLS RESOURCE variable=memport core=AXI4M
for (loop = 0; loop < gridDim.x; loop +=4) {
blockIdx.x = loop + 0;
//computation using blockIdx.x
}
}
void matrixMul_core1(/* args */) {
#pragma HLS interface ap_bus port=memport
#pragma HLS RESOURCE variable=memport core=AXI4M
for (loop = 0; loop < gridDim.x; loop +=4) {
blockIdx.x = loop + 1;
//computation using blockIdx.x
}
}
void matrixMul_core2(/* args */) {
#pragma HLS interface ap_bus port=memport
#pragma HLS RESOURCE variable=memport core=AXI4M
for (loop = 0; loop < gridDim.x; loop +=4) {
blockIdx.x = loop + 2;
//computation using blockIdx.x
}
}
void matrixMul_core3(/* args */) {
#pragma HLS interface ap_bus port=memport
#pragma HLS RESOURCE variable=memport core=AXI4M
for (loop = 0; loop < gridDim.x; loop +=4) {
blockIdx.x = loop + 3;
//computation using blockIdx.x
}
}
Figure 3.4: Decentralized Control FCUDA-Generated C Code
3.3.2 Interfaces Protocol Conversion
Signals sent from raw FCUDA cores are intended for direct connection with a
memory controller instead of a bus. These signals can be roughly categorized
22
into four types:
• Control signals - control the behavior of all components in the system,
for example, signals indicating a core is starting a read, and signals
indicating responses should be written to on-chip memory.
• Status signals - the status inside cores. Some operations can only be
activated once the expected status is detected. An example is that
receiving responses from memory must happen after a status signal
indicates that the inside buffer for storing is not full, otherwise, received
responses would be lost.
• Address data - address channel sending out raw address data for DDR3.
• Read or write data - data information that needs to be stored in DDR3,
or the data fetched from DDR3 which is to be used for computation.
An AXI interface is necessary for connecting FCUDA cores on the bus. To
uniform all these signals in the right time order as AXI protocol, we generate
a “wrapper” outside each FCUDA core by inserting Vivado HLS pragmas
to C source code before synthesis from C to RTL. This wrapper consists of
logic that encodes different formats of information or messages so it can be
understand by the bus system.
Creating RTL Port The first step in creating a bus interface is to create
an RTL port which supports the required bus interface. In Vivado HLS, an
ap bus interface is used to communicate with a bus bridge. The interface is
generic enough to be used with a bus bridge which arbitrates with the system
bus, instead of adhering to any specific bus standard [23].
An ap bus interface can be used in two ways:
• Standard Mode: The standard mode of operation is to perform indi-
vidual read or write operation, specifying the address of each.
23
// Function for fetching matrix A’s data from memory to FCUDA
core0 in standard read mode.
void matrixMul_fetchAs_core0(DATATYPE * A, ...) {
for (threadIdx.y=0;threadIdx.y<blockDim.y ;
threadIdx.y=threadIdx.y+1)
for (threadIdx.x=0;threadIdx.x<blockDim.x ;
threadIdx.x=threadIdx.x+1) {
As[ty_block[threadIdx.y][threadIdx.x]]
[tx_block[threadIdx.y][threadIdx.x]]=
A[((a+(wA*ty_block[threadIdx.y][threadIdx.x]))
+tx_block[threadIdx.y][threadIdx.x])];
}
}
Figure 3.5: Genearetd C Code with Standard Data Fetching
• Burst Mode: In burst mode, the base address and the size of the trans-
fer are indicated by the port. The data will be quickly transferred in
consecutive cycles. Burst mode is activated if “memcpy” function is
used in the C source code.
We take the data fetch in matrix multiplication benchmark as an example.
Figure 3.5 is the function of fetching a tile of matrix A’s data from off-chip
DDR3 to on-chip BRAM in standard read mode, as a normal assignment
syntax. After adding an ap bus pragma on the data port A, the data will be
fetched one at a time with source address and destination address explicitly
specified.
Figure 3.6 shows the function of fetching a tile of data in matrix A from off-
chip DDR3 to on-chip BRAM. When ap bus is applied to argument pointer
A, the “memcpy” function operated on A will be translated into a burst read
in Verilog with the size of 16. Each burst read fetches 16 consecutive data
values starting from the source address, A + fetchAs As X 0 + threadIdx.y
* fetchAs As c 1, in DDR3 to port A of FCUDA cores. Then the 16 data
values will be stored into 16 consecutive places in the on-chip BRAM starting
from destination address, fetchAs As offset + As[threadIdx.y]. Meanwhile,
24
// Function for fetching matrix A’s data from memory to FCUDA
core0 in burst read mode.
void matrixMul_fetchAs_core0(DATATYPE * A, ...) {
// Variables defination and initialization.
if (enableSignal_fetchAs)
for (threadIdx.y = 0;threadIdx.y < blockDim.y ; threadIdx.y =
threadIdx.y+1)
memcpy(fetchAs_As_offset + As[threadIdx.y],
//destination starting address
A + fetchAs_As_X_0 + threadIdx.y * fetchAs_As_c_1,
//data source starting address
16 * sizeof (DATATYPE)); // size of data to fetch
}
Figure 3.6: Generated C Code with Burst Fetching
the next burst read will enter the request buffer and get launched when the
bus is free. Therefore, fetching the whole tile of data needs blockDim.y times
of burst read.
Typically, reading in burst mode is much faster than standard mode for
fetching or storing the same amount of data in consecutive addresses. This
is because bursts reduce address arbitration/control cycles and help keep the
memory controller in the same row, band, and read/write direction longer.
However, the down side of the burst mode read is that when an application
tries to use non-consecutive data in memory, redundant data has to be fetched
if the burst mode is used. For example, instead of fetching a matrix’s data
in row pattern, we want to fetch in a column pattern, which means our
target dataset is stored in separate places in DDR3. In such circumstances,
system bandwidth is wasted due to more traffic on the bus for transferring
consecutive data which has no use. In addition, on-chip BRAMs are also
wasted for storing unnecessary data. Both the system bandwidth and on-
chip BRAM are very limited. Wasting any of them could lead to a significant
degradation in performance.
25
AXI Interface The second step is adding an AXI interface. In addition to
the standard RTL ports like ap bus, ap hs, Vivado HLS can also automati-
cally add bus interfaces to an RTL design. The primary difference between
bus interfaces and RTL ports created by interface synthesis (ap hs, ap bus,
etc.) is that the bus interfaces are added to the design during the Export
RTL process [23], which is after C code gets synthesized. So the generated
Verilog code for the AXI Interface is not available in a project after synthesis.
And the synthesis report from Vivado HLS does not include the bus inter-
faces either. This is also why we do not use the resource report from Vivado
HLS, but instead use the report generated after the synthesis and P & R in
the Vivado Design Suite.
Several types of bus interfaces are available in Vivado HLS:
• AXI4 Lite Slave
• AXI4 Master
• AXI4 Stream
• PLB 4.6 Slave
• PLB 4.6 Master
• FSL
• NPI
Each bus interface depends on certain kinds of RTL interface types. For
example, the AXI4 Stream Interface can only be added to ap fifo, while the
AXI4 Master can only be applied to ap bus. In our system, FCUDA cores
are masters which are responsible for starting requests and then waiting for
responses from the off-chip memory. All pragmas can be added either in C
source files or as directives included in an independent Tcl file. Since analyz-
ing the communication ports among all the arguments for each application is
easier during source code translation, we insert pragmas to the proper argu-
ments in FCUDA compilation. Code with pragmas for both RTL ports and
bus interfaces added is shown as Figure 3.7.
26
// Top level function for core 0. DATATYPE can be any type.
void matrixMul_core0(int wA, int wB, DATATYPE * A, dim3 gridDim,
dim3 blockDim) {
#pragma HLS interface ap_bus port=A
#pragma HLS RESOURCE variable=A core=AXI4M
}
Figure 3.7: Pragmas Added for AXI Wrapper
// Top level function for core 0. DATATYPE can be any type.
void matrixMul_core0(DATATYPE * A, DATATYPE * B, DATATYPE * C,) {
//matrix A and B are two input matrixes. C is the result.
}
Figure 3.8: Generated C Code with Multi Ports
3.3.3 Memory Ports Combination
Many benchmarks have multiple arguments as pointers for arrays stored in
off-chip memory, corresponding to the case that CUDA programs have mul-
tiple arrays reside in GPU’s global memory. For example, in the generated
C code of matrix multiplication, there are two input matrices and one re-
sult matrix stored in off-chip memory. An intuitive way of function interface
implementation with an argument list is shown in Figure 3.8.
For code in Figure 3.8, Vivado HLS will add ap bus and AXI interfaces
for each of the data ports. As a consequence, there will be multiple separate
AXI master or slave interfaces for each core. Unfortunately, the bus system
does not scale well with an increasing number of ports, and the maximum
number of cores that can be initiated in the system is significantly affected
by the number of ports. Ports combination has to be performed.
There are two potential solutions to this problem. One solution is to build
an extra module which connects multiple AXI interfaces of a core into one
single interface. This extra external module is basically a system with an
arbiter and a multiplexer between the FCUDA core and the slave interface
27
// Top level function for core 0. DATATYPE can be any type.
void matrixMul_core0(DATATYPE * memport) {
//matrix A, B and C are stored after another.
DATATYPE *A = &memport[A_size];
DATATYPE *B = &memport[B_size];
DATATYPE *C = &memport[C_size];
}
Figure 3.9: Ports Combination at Source Code Level
on the bus. However, such a hardware solution is generally inflexible as well
as difficult to debug and maintain. Therefore, a software solution is adopted
in our system.
We choose a software solution that combines all the pointer ports into one
before synthesizing by Vivado HLS and adhering AXI interfaces. A simple
source code level transformation is sufficient to combine multiple ports into
one, as shown in the Figure 3.9.
We do such a transformation for each benchmark followed by adding all
the ports and interfaces described previously. Finally, the structure of an
FCUDA core is shown in Figure 3.10.
Now the FCUDA cores can be used to connect on the bus interface as a
working master.
3.4 Hierarchies in Bus System
The bus interconnection cores we use in the system are reconfigured from AXI
Interconnect IP in the Vivado Xilinx IP library. Each interconnect IP can
have at most 16 slave interfaces, which means that one AXI interconnect core
can connect with only up to 16 master cores. Compared to our system with
tens of FCUDA master cores, the connection ability of one AXI infrastructure
28
AXI master 
interface
RTL code 
translated from 
FCUDA 
generated C
Exported FCUDA IP
ap_bus added
AXI BUS
combined
memport
Figure 3.10: AXI Wrapped FCUDA Core and Its Connection to Bus
is far from enough.
Cascading more AXI interconnect cores can solve this problem. We use
a hierarchical bus interconnection network with two levels of cascaded AXI
cores, so that the system can be extended to a larger connection supporting
up to 16 ∗ 16 = 256 FCUDA cores. The hierarchies in the bus system is
shown in Figure 3.11.
Such a number of cores is enough for FCUDA applications and the hierar-
chical bus system is reasonable because of the following:
1. Due to the limitation of resources on FPGA, such as LUT, BRAM,
and Flip-Flop, barely any current FPGAs can support more than 200
FCUDA cores.
2. If more threads are still needed for performance improvement, loop
29
AXI Wrapped 
FCUDA CORE 0 
...
BUS IP0
Hier 2
BUS IP1
Hier 2
BUS IP1
Hier 2
BUS 
IP1
Hier 1
AXI Wrapped 
DDR3 Memory 
Controller 
AXI Wrapped 
FCUDA CORE N1 
DDR3
AXI Wrapped 
FCUDA CORE 
N1+1 
...
AXI Wrapped 
FCUDA CORE 
N2 
AXI Wrapped 
FCUDA CORE 
N1+1 
...
AXI Wrapped 
FCUDA CORE 
N2 
Cores in Tiles
Figure 3.11: Bus System Hierarchies of FCUDA
unroll and memory array partition are the best choices to reduce clock
cycles instead of simply instantiating more cores. Having more cores
will introduce long wires and complicated routing on the board. The
frequency of the system will be slower and the performance will be
affected.
3. Bus, as a light communication architecture, is fast, resource-efficient
and convenient for on-chip components. However, such an advantage
no longer exists when the system is enlarged to include more than
hundreds of cores. In that case, other communication architecture,
like NOC (Network-On-Chip), should be considered to achieve better
performance instead.
Despite the advantage of supporting more cores, another advantage of
our two-level hierarchical bus structure is providing more opportunities to
achieve better performance. This is because different hierarchical configura-
tions make substantial difference in terms of resources and frequency.
30
For example, suppose there are 16 cores to be placed in our system. The
system can be configured as following:
1. We use only one level and all cores connected to the single AXI Bus
Interconnect IP.
2. We can divide them into two tiles with eight cores in each tile.
3. We can even divide them into four tiles with four cores in each tile.
The difference between the three mentioned design choices is the trade-off
between the period and resources. The first one uses the fewest resources
since it uses only one Bus Interconnect IP, and the resources it saves can be
used to instantiate more cores or unroll more times for reducing clock cycles.
However, the fastest frequency it can meet is probably the lowest among the
three designs since many cores need to use long wires to connect with the
Bus IP, which are surrounded by many other cores. In contrast, the third
choice uses the most IP resources but due to fewer cores in each tile and
more Bus IPs for connection, it can run at a faster frequency with reduced
long wires and simpler routing.
3.5 Placement of Components and Clock Domain
Optimization
3.5.1 Placement Optimization
Placement with constraints of putting related components together can effec-
tively reduce long wires and complex crossings between wires, thus improve
the system’s running frequency. In Vivado, this can be implemented by plac-
ing Pblocks during P & R stage by assigning their resources with coordinates.
31
Tile0 Tile1
Bus Hier2
Tile2
Bus
Tile6
Bus
Tile7
Bus
Tile8
Bus
Tile3 Bus Tile5BusTile0BusBUS Hier 1
Bus Hier2 s Hier2
s Hier2Bus Hier2
Bus Hier2 s Hier2Bus Hier2
Figure 3.12: Ideal Placement of Core Tiles
In our FCUDA tiled system design, what we would like to achieve is that
cores from the same tile are placed closely, and different tiles are separated
into different non-overlapped placement blocks. This is easy to understand
since cores in one tile need to be connected to the same bus IP. If all the
cores are mixing together as would occur if default placement by Vivado was
used, long wires will be used to connect cores to its corresponding bus IP.
Take a system with nine tiles as an example. The ideal placement of core
tiles is shown in Figure 3.12, where all the tiles are almost equally far from
the bus IP in the first hierarchy and there is no overlap between each tile, so
that there are no extremely long wires or complex crossings in the system.
The way we assign resources to each tile is based on resources arrangement
on FPGAs. Xilinx created the Advanced Silicon Modular Block (ASMBL)
architecture to enable FPGA platforms with varying features optimized for
different application domains [35]. All Xilinx 7 series FPGAs adopt this
ASMBL architecture, which organizes different types of resources, like DSP,
BRAM, Slice logic, in a column-based form with unique coordinates. Ac-
32
cording to these resources coordinates, we can assign a specific area on the
FPGA into one Pblock. The devision of FPGA into Pblocks is based on
the tile configuration we predefined, which is a part of system configuration
in the design space. Tile configuration is defined in the format of (m,n),
meaning that there are m rows and n columns of Pblocks with the same size
on the whole FPGA. Each Pblock is a “virtual” tile, holding a physical tile
consists of FCUDA cores and one bus IP from the second bus hierarchy.
3.5.2 Clock Domain Division
Optimized placement of system components can help improve the system’s
running frequency and reduce complex routing. However, there is still one
problem left. Long wires can still be introduced when connecting FCUDA
core tiles with the memory controller and the clock & reset synchronize pro-
cessor. These long wires are called “system long wires” since they are usually
connections for system-wise control, such as cock synchronization and reset
signals.
Since the places of on-chip ports for connection with off-chip DDR3 are
fixed, the memory controller has to be placed around these ports. It can
be very far from the bus hierarchies as well as the master FCUDA core
tiles. Long wires and complex routing have to be used by Vivado during
P & R. The total period will be slowed down, not because of our FCUDA
cores’ performance, but the way we organize the bus system. Consequently,
accurate analysis of the fastest period FCDUA cores can run to is prohibited.
Dividing the whole design into two separate clock domains can effectively
eliminate the interference of irrelevant long wires because of system architec-
ture. Figure 3.13 gives an illusion of the two clock domains.
As shown in Figure 3.13, tiles of FCUDA cores and memory controller
are separated into two clock domains by an asynchronous buffer inside the
33
Figure 3.13: Clock Domains in Bus System
bus interconnect module. Each clock domain uses a clock & reset synchro-
nize processor to synchronize the clock phases, frequency, control and status
signals, and so on. Let us use clock1 to represent the clock signal in clock
domain 1 and use clock2 for clock domain 2. We can estimate the worst neg-
ative and positive slack for each domain with their own PLL added. Long
wires of control signals from the memory controller for synchronization with
other components in the design are all included in clock2. Synchronization
is left to the asynchronous buffer and the two clock & reset synchronize pro-
cessors to deal with. The clock division enables us to analyze the FCUDA
cores’ fastest frequency by using only timing constraints and reports of clock1
without interference of other system connections. The timing analysis we get
is exactly the result of FCUDA cores’ configuration and implementation.
Actually, the advantages of the clock domains’ division are much more
than what we have mentioned above.
Firstly, according to our P & R result, the memory controller part usually
run at a faster frequency. However, that may not always be the case after real
board implementation. Components are very possibly slowed down or sped
34
up with different degrees during the system’s on-board implementation. With
our design of clock domain division, the frequency of the memory controller
can hardly affect the frequency of FCUDA computation cores.
On the other hand, the memory controller we are using is reconfigured
from Xilinxs IP library, MIG3.7 for DDR3. It requires all the connected
components maintain a ratio as either 1:2 or 1:4 with the memory controller.
In other word, the FCUDA cores’ executing frequency is limited according to
the settings of the memory controller IP. This can be troublesome when we
adjust the memory controller’s bandwidth by changing its frequency, since it
will affect the whole system’s running frequency. Putting FCUDA cores into
a separate clock domain helps it avoid being affected.
Considering all the limitations and advantages discussed, the clock domain
division is highly demanded.
3.6 Bus Communication Verification
To verify that the AXI-based bus is working correctly in fetching data, getting
results computed, and storing results back to DDR3, we design a verifica-
tion flow which includes feeding DDR3 with initialization data, starting the
FCUDA IP cores for computation, and finally comparing generated results
stored back to DDR3 with software computation results.
Initializing data in DDR3 is different from other types of storage devices on
board, like SDRAM or BRAM. There is no existing control panel or similar
tool for loading data files into DDR3 with an easy click. An initialization
module has be to added in our system. We investigated different ways of
initialization module’s insertion including in different formats and places.
Soft IP processors like MicroBlaze or ARM cores can be used to manipu-
35
late data in DDR3. However, the generality of processors usage on FPGAs
are not ideal. The 7-series Xilinx FPGA boards have different architectures.
While Zynq boards have two built-in ARM cores connected to the AXI bus
around the entire board, other series like Virtex FPGAs can only use in-
stantiated MicroBlaze soft IP cores without an existing bus connected to
DDR3, which means that other peripheral hardware components such as Di-
rect Memory Access (DMA) are needed. In other words, whenever we change
the platform of our experiments, we need to set up a new software environ-
ment together with hardware components, which can be problematic. Also,
using an instantiated processor IP simply for initialization wastes too many
resources.
Another way is adding hardware modules connected to DDR3 memory con-
troller. The modules can be constructed inside a memory controller’s AXI
wrapper. The advantage of this method is that without AXI wrappers be-
tween the initialization module and DDR3 memory controller, we do not need
to bother with one more layer of signal translation to AXI protocol. While
the disadvantage is that the memory controller and its AXI wrapper have
to be split in between so that the raw signals of the memory controller are
exposed for connection with initialization module. What makes things worse,
removing the initialization module is not convenient at all. And adding an
extra wrapper to the memory controller can be messy with different control
signals when considering different traffic conditions.
The last method is adding an extra hardware module to the system as
an independent plug-in AXI master core for DDR3 initialization. Although
converting signals to AXI protocol is a little bothering, it is much easier
to ensure the correctness and debugging since the behavior of the wrapped
initialization module is simple and deterministic. Also, it does not consume
that many resources and is convenient to add or remove from the rest of
the bus system. Compared with the previous two methods, we adopt this
initialization method for our bus communication verification.
Figure 3.14 shows how we add the initialization module as an independent
core in the system.
36
Figure 3.14: Initialization Module Inserted in the System
Figure 3.15: Initialization Module State Machine
As shown in Figure 3.14, no change is needed on the original system ex-
cept adding one bus interface for connection of the initialization module.
When the system gets started, the AXI wrapped initialization module starts
first. It writes data from a user’s input files to DDR3 in our defined burst
mode, then reading is performed from DDR3 to check whether writing is
performed correctly. After the check is finished, FUCDA cores are started
and application computation begins.
Figure 3.15 shows the state diagram of our initialization module.
TG stands for traffic generation including reading from input data file and
writing to DDR3. At the beginning, the module starts from an initial state,
37
TG IDLE. After that, it enters TG GEN state, where initialization module
starts generating the information of burst length, addresses and data for each
writing or reading. If writing is to be performed, TG WR CMD is selected
once write command signals is received, followed by signals representing data
to be stored, until finally the signal representing writing is done. If reading
is selected, commands and addresses information will be sent out and data
are from DDR3 are retrieved.
Using HDL to interact with memory controller and memory is prone to
errors and debugging takes considerable effort. In our initialization module,
not only need we add Verilog code for interaction behaviours with memory
controller, but also the AXI protocol conversion logic for all signals, like
control signals and the addresses. This makes our work arduous and diffi-
cult. Considering its necessity, we managed to figure it out after long time
debugging.
In our system, we use read operations to do a check on whether data are
stored correctly in memory. So at the end of reading, there is a check per-
formed to see whether data in DDR3 matches user’s input file. If matching,
a start signal will be sent to FCUDA cores for computation to get started.
3.7 Automated Bus-Based FCUDA Tool Flow
FCUDA, as a new computation tool on programmable hardware, has been
proved very promising in improving performance and saving power compared
to other prevalent computation architectures [22, 13, 36]. To make it a real
helpful tool with practical value, we should make the system as convenient
as possible: users start from CUDA code all the way to running simulation
or synthesis of the whole system with as few inputs or manual code changes
as possible. In order to accomplish this goal, we automate and integrate all
the system processing and code changes into a single cohesive tool chain.
38
CUDA
Insert 
FCUDA 
pragmas
FCUDA
Generated  C Code for 
each core with 
Vivado_HLS pragmas 
for AXI wrapper
User input 
configuration
Front-end C level
Back-end HDL level
Vivado HLS
AXI wrapped  
FCUDA core IP
 Backend scripts and 
HDL files prepartion Vivado
Simulation files 
preparation & 
Simulation
Input initial value for 
parameters & 
simulation tool config
Setup synthesis 
environment
Synthesis and 
reports collected
Vivado 
TCL
Automatic 
changed 
HDL files 
imported
Figure 3.16: Automated FCUDA Bus System Tool Flow
Figure 3.16 shows the tool flow of our bus-based FCUDA system. The
blue blocks represent original application source code or generated code for
later steps. The green blocks are automatic code transformation done by our
automated flow. The orange blocks are steps where users’ input or manual
configuration is needed. This section describes in detail what happens in
each step of our automated FCUDA bus-based system tool chain.
3.7.1 CUDA Code
The input to our system, as with FCUDA, is CUDA code, written to be
executed on an NVIDIA GPU. User pragmas are used for explicit state of
the combination of ports. For example, in matrix multiplication, ports A,
B and C are combine into one memport with different offsets. Figure 3.17
shows user input pragmas for matrix multiplication CUDA code. WA, HA
39
#pragma FCUDA PORTNAME remove_port_name=A offset=0
#pragma FCUDA PORTNAME remove_port_name=B offset=WA*HA
#pragma FCUDA PORTNAME remove_port_name=C offset=WA*HA+WB*HB
__global__ void matrixMul( DATATYPE *C, DATATYPE *A, DATATYPE *B,
int wA, int wB) {
}
Figure 3.17: Pragmas for Port Combination
are width and height for matrix A, similarly WB and HB are the width and
height for matrix B.
Based on pragmas in Figure 3.17, matrix A’s data is stored from the be-
ginning of DDR3, followed by matrix B and matrix C. With these pragmas,
FCUDA compiler can be directed to combine indicated data ports as what
we have described in Section 3.3.
3.7.2 User Input Configuration
Multilevel granularity parallelism is added in the FCUDA compiler to achieve
better performance. Besides the traditional core-level parallelism, thread-
level and on-chip memory access-level parallelism extractions contribute to
compute logic replication [22]. All these parallelism configurations need to
be predefined by users. Also, users need to set information such as the tile
configuration for the bus back-end.
3.7.3 FCUDA
Given a few required user inputs, FCUDA is automatically launched and
synthesizable C code is generated. The FCUDA compiler we use is based
on the one described in [22]. Several adjustments were made on top of that
40
in order to generate cores that fit in our bus system, including adding AXI
interfaces, decentralization of core control, and data ports combination.
3.7.4 Scripts for Bus System Setup
Tool command language (Tcl) is a very powerful and easy-to-use dynamic
programming language, suitable for a wide range of applications, including
web and desktop applications, networking, administration, testing and many
more [37]. The Vivado tool environment integrates Tcl as a script-based
compilation style method in which users can manage sources and the design
process by themselves, also know as non-project mode. Alternatively, users
can use a project-based method to automatically manage the design process
and design data using projects and project states, also known as project
mode [38].
Taking advantage of the non-project mode in the Vivado Design Suite,
the whole back-end system is able to be constructed from scratch by simply
executing our prepared Tcl scripts, which contain the commands of importing
user-defined IPs, reconfiguring IPs from the Vivado IP Library, connecting
wires among components, launching simulation and synthesis, and so on.
Our Tcl scripts of system setup for each application are automatically con-
figured and prepared based on users’ predefined inputs. With these scripts,
users do not need to bother with all the system construction’s details, which
greatly release the burden from CUDA programmers. Setting up the sys-
tem using non-project mode in Vivado also reduces the probability of errors
resulting from users’ improper operations and platform dependencies.
41
3.7.5 Generating HDL Files
Vivado generates HDL files for block diagram design. However, new HDL
wrappers for FCUDA applications have to be made by us to hide top-level
I/O ports.
For CUDA applications, there is usually more than one argument even
after we combine all the memory ports. When translated to RTL code, each
of these arguments needs a connection to some signals from outside. What
makes things worse, these arguments are usually of an integer or float data
type, which causes each of them to consume at least 32 I/O ports.
For example, in matrix multiplication, besides the data ports for the arrays
we combined, there are still several integer type arguments, like the width
and height of each matrix. All of these integer variables will be translated
to 32 bit-wise signals in the top-level HDL file connecting with 32 input or
output I/O ports on board during synthesize. In addition, all the benchmarks
have the arguments called block dimension and grid dimension, each of which
is made of three integers, not to mention all the system control and status
signals that we have to be connected with outside, like ap start, ap ready,
ap done, etc. The I/O resources on board are too limited compared to what
we need and can barely support all these input and output signals.
A C code level solution for this problem does not exist, since we need to
assign initial values to these arguments when doing RTL simulation, argu-
ments cannot be hidden in the C level. Otherwise, there will be nowhere to
find them after Vivado HLS translation.
To solve this problem, a new wrapper with initial values of an applica-
tion’s arguments is created. In this wrapper, the arguments’ initial values
are stored as constants in on-chip registers. RTL signals representing FCUDA
arguments are made to connect with these registers using simple wires. This
wrapper ensures that only a few necessary ports for application control and
off-chip memory communication are connected through I/O ports to the out-
42
side, which consume only a small portion of the on board I/O resources.
When our HDL wrappers are generated for each application, the tool au-
tomatically extracts the names of the arguments to be hidden and asks users
to input initial values for each. With the initial values, connections for most
parameters can be easily wrapped as internal wires, leaving only a few nec-
essary wires to be connected outside. These are the only outside connections
for each application.
There are several other HDL files we automatically generate for the sys-
tem. For example, the DDR3 behavior simulation and parameters files are
extracted from Xilinx’s IP repository and imported. Also, the testbench
files written in HDL for simulation are also prepared automatically in this
step. Since application-specific arguments are hidden inside the top-level
HDL wrapper, the testbench files remain almost the same among all the
applications.
3.7.6 Invoking Vivado
Vivado is launched in non-project mode automatically to execute the Tcl
scripts we have prepared for bus system’s setup as a part of our tool chain.
After the system is set up, IP Integrator, which is one of the new features in
Vivado Design Suite helping users to improve productivity, is also very helpful
to use. Such an IP integrator environment gives a very clear overview of the
system and IP status. Opening the constructed project as project mode
in Vivado IP integrator, user can clearly see the hierarchies of the bus and
the connections between components, which is much more convenient than
inferring from the scripts we prepared in advance for system users.
With the IP Integrator, users can also reconfigure all the IPs inside the
system, or even modify the system’s structure. Tcl commands will be gener-
43
ated and can be added to our original scripts if users would like to keep the
changes permanently.
3.7.7 Simulation
The simulation required files are essentially complete and requires little user
intervention. The testbench we prepared and imported in previous steps has
instantiated the bus system and made all the necessary connections for it.
Initial values have been specified by users while a top-level HDL wrapper file
is made. If users would like to view waveforms, they can open the project
using Vivado Project-Mode and select either Vsim built inside or invokes
Modelsim to run the simulation.
3.7.8 Synthesis
The synthesis portion of the flow is integrated in the tool chain and can
be run through directly to get result reports. Corresponding constraints for
different tile configurations are automatically chosen and added. After RTL
synthesis and P & R, reports for timing analysis and resources utilization
will be generated and collected to a specific report folder inside FCUDA
application’s directory.
3.8 System Debug
During our work on the FCUDA bus-based system construction, which con-
tains both the multilevel paralleled FCUDA and the AXI-based bus system,
we did substantial debugging work to ensure the whole system works cor-
44
rectly. Debugging this comprehensive system related to both software trans-
lation as well as hardware connection could be overwhelming since bugs can
exist in any portion.
A systematic debugging flow was used during our system’s development by
separating the whole system into different levels, testing each portion of the
system and making sure each level generates correct results before continuing
to the next step. Here, we make a summary of how we found some main bugs
and the ways we fixed them.
1. Fixes in FCUDA compiler: To ensure the generated system gets the
correct computation result, transformations in FCUDA should be ver-
ified to be working correctly. So we add testbenches written in C code
which call the FCDUA decentralized functions of each core to simulate
the all cores’ behavior. Given some input data, we compare the golden
results computed by standard C code with FCUDA cores’ results. Us-
ing such a method, we were able to find bugs in the original FCUDA
compiler related to burst write destination addresses, control signals of
fetch and computation, and so on.
2. Fixes in Vivado HLS: Although the AXI wrapper can be automated
added outside the FCUDA C code core with insertion of pragmas, the
exported IP can still have problems in generated HDL code. CHDL
Cosimulation enables debugging with the generated HDL and C code
testbench. After simulation using the Vivado HLS cosimulation tool
as well as Modelsim, we finally located state machine bugs of the AXI
wrappers related to memcpy functions in the Vivado HLS translation
tool. And we got the behavior fixed by separating the burst operations
of different interfaces.
3. Fixes in Bus System: When the generated FCUDA IPs are ensured to
be correct, Simulation in Modelsim with increasingly more components
of our bus system is the way we ensure the whole system works cor-
rectly. For example, since DDR3’s behaviors are complicated, we first
changed it with an AXI wrapped single port BRAM in which a small
amount of data is stored. Then we move on to simulate only the part
45
of the DDR3 memory and its controller. An initialization module was
developed to feed DDR3 with our initialization data. The initializa-
tion module performs a read-and-compare check to make sure DDR3
is working correctly. Finally, we started the simulation on the whole
system, retrieved results stored in DDR3, and compared them with
software generated results.
After performing the three main steps, we were able to guarantee the
correctness of our AXI-based bus system’s implementation.
46
CHAPTER 4
EXPERIMENTAL RESULTS
Since we have the FCUDA generated cores working with a solid bus and
storage devices, we can evaluate FCUDA compilation together with our bus
system from a more accurate and comprehensive point of view. In this chap-
ter, we will describe the experiments we have done and provide analysis for
the results.
Our experiments are extended from FCUDA with multi-level granularity
parallelism [22] by adding the AXI-based bus system. Experimental results
were gathered in several ways in order to quantify the characteristics of our
FCUDA system. Such a system also provides us more insights and opportu-
nities of optimizations as a whole picture.
As an efficient and powerful FPGA design tool, the Vivado Design Suite is
used as our platform for simulation and synthesis. Together with ModelSim
and Vivado HLS, results of timing and resources are obtained to calculate
the system’s models. We adjust the models from [22] for our system, trying
to be as accurate as possible. Finally, with performance comparison results
with a comparable GPU device, we are able to derive subjective evaluations
of our system.
Also, we get many more benchmarks to work with our tool chain, includ-
ing some from the Rodinia Benchmark Suite [39], a collection of accelera-
tors for heterogeneous computing infrastructure. By extending our FCUDA
toolchain to more applications, we hope to make it a real helpful parallel
computation architectures in more areas of realistic applications.
47
4.1 Design Space Exploration
FCUDA has different levels of parallelism, which form a large design space for
users to specify. As proposed in [22], the granularities of parallelism include:
1. Threads: unfolding of thread-loop iterations through unroll-and-jam
transformation. Each unroll results in one more replication of resources,
meanwhile reduces half of the original computation time.
2. Arrays: on-chip array access concurrency is controlled by the degree of
array partitioning, which divides arrays into separate partitions. Each
partition is mapped onto a separate BRAM and thus, the array acquires
multiple memory ports.
3. Cores: unfolding of threadblock-loop iterations through replication of
thread-loop function calls. As we have described in previous sections,
by separating functions in generated C code, we make instantiation of
each core in parallel by invoking Vivado HLS multiple times.
4. Core-cluster: the set of thread-blocks is partitioned into subsets, with
each subset organized by one AXI Interconnection IP from Hierarchy
2.
The multilevel granularity parallelism synthesis (ML-GPS) FCUDA flow is
divided into three automated main steps, shown as Figure 4.1. Firstly, a ker-
nel profiling step is performed in order to build the resource estimation model
for each kernel. Profiling entails feeding the FCUDA source to source trans-
formation with a small set of multilevel granularity configurations which are
subsequently synthesized to generate corresponding utilization reports and
timing analysis reports. The original experiments in [22] use resource uti-
lization estimations from Vivado HLS since no solid back-end was developed.
Now with FCUDA cores fitted in our newly developed bus system, we are
able to synthesize the whole system and generate post P & R reports, which
are much more accurate.
48
Figure 4.1: ML-GPS Flow
A kernel-specific resource model is then built using regression analysis.
The number of the profiled configuration points determines the accuracy of
the resource estimation model generated. More configurations result in extra
profiling time while ensures higher accuracy.
Secondly, after profiling, we scope the kernel’s design space by deriving
maximum core, unroll and array partition parameters based on FPGA device
parameters, kernel features, and the resource models we have derived. The
list of available core-cluster configurations is determined by considering both
the resource model and the features of FPGA resource distribution.
Finally in the last step, design space exploration is performed by comparing
latencies among all configuration points. Latency is computed based on the
models and clock cycle estimations from Vivado HLS. To accelerate the search
of best configuration in design space, a binary search heuristic is used to trim
the number of HLS invocations and prune the design space.
By performing such design space exploration, we are able to identify the
most optimal configuration of FCUDA in order to maximize the performance
of the CUDA kernal on selected FPGA device.
49
Our experiments are done on eight benchmarks, each with both 32-bit in-
terger and 16-bit short versions. Table 4.1 is a summary of all the benchmarks
we are using. HotSpot, LavaMD, and PathFinder are the three benchmarks
we added from Rodinia [39], a benchmark suite designed for heterogeneous
computing infrastructures with OpenMP, OpenCL, and CUDA implementa-
tions.
4.1.1 Models of System
The exploration of multilevel granularity space is based on estimations of
resource, clock period and cycles. We estimate resource and clock period
due to routing based on regression analysis results, whereas get cycle estima-
tions from Vivado HLS. The formulas we use for resource and clock period
estimations are modified from the ones that have been used in [22], which
include the degradation due to routing and P & R effect. These formulas are
presented as following.
Resource Model The resource model is built during the profiling step of
our flow in Figure 4.1. Post-routing utilization reports of a small number
of points in each application’s design space are generated by running our
automated bus system flow. These points expose parallelism at levels of
core (CR), thread (TH), and array-partitioning (AP), which are suitable for
regression analysis. Instead of retrieving utilization reports from Vivado HLS
in paper [22], full synthesis and P & R are performed for each point in this
step with our constructed bus back-end to make the resource estimation more
accurate. The formula of the system’s resources is shown as following:
R = (R0 + R1 × CR + R2 × CR× TH + R3 × CR
× AP + R4 × TH × AP + IP BUS) ∗ CL
+ (System Hierarchy − 1) ∗ IP BUS
+ IP Peripheral (4.1)
50
Table 4.1: Benchmarks Description
Application (Kernel
Name)
Thread Blocks Description
Matrix Multiply (mm) 4096
Computes multiplication
results of two matrices, a
standard and common
benchmark for parallel
computing
Fast Walsh Transform
(fwt1)
8192
Walsh-Hadamart
transform, a generalized
Fourier transformation
used in various engineering
applications
Fast Walsh Transform
(fwt2)
8192 Same as fwt1
Discreet Wavelet
Transform (dwt)
4096
Compute 1D partial
wavelet decomposition
using Haar basis
Coulombic Potential (cp) 32768
Computation of
electrostatic potential in a
volume containing charged
atoms
HotSpot 4096
A widely used tool to
estimate processor
temperature based on an
architectural floorplan and
simulated power
measurements
LavaMD 4096
Calculation of particle
potential and relocation
due to mutual forces
between particles within a
large 3D space
PathFinder 4096
Dynamic programming
used on 2D grid to find a
path with smallest
accumulated weights
The resources of the FCUDA bus system contains the resources of three
parts: FCUDA cores, AXI interconnection cores, as well as other peripheral
IPsE for system synchronization. Since each core cluster has its own AXI IP
inside, we incorporate its resource consumption as with the FCUDA cores
51
resource in Equation 4.1. However, notice that when there is only one core
cluster in the system, the system does not need an IP BUS at Hierarchy 1
and System Hierarchy − 1 = 0. In other words, one IP BUS from inside
the core cluster is enough to organize all the cores when the system has only
one core cluster.
Equation 4.1 characterizes the resource utilization of the bus system on
the core number (R1), count of threads (R2), array partitioning (R3), and
the interaction between unrolling and array partitioning (R4). The resource
of BUS IP and Peripheral IPs for system signals control can be viewed as
constant during regression. For each type of resources (LUT, Flip-Flop,
BRAM, and DSP), we construct a separate equation which represents the
system resource usage as a function of the different parallelism granularities.
Usually, Xilinx’s VC709 FPGA can tolerate more than 70 cores (differs
from application to application) based on resource estimation. As a result,
the resource consumption of different configurations varies a lot. For exam-
ple, there are two configurations, unroll and memory partition parameters
are both 1, while one configuration contains 1 core, and 70 cores for the other.
Obviously, the latter configuration equation counters more during regression
profiling. In order to keep fairness between different configuration points, we
add another step of normalization before linear regression to avoid such a big
difference in the configuration magnitudes, which can significantly affect the
accuracy of regression results.
Period Model The period model we use is similar to the one proposed
in [22], which aims to capture the clock period degradation resulting from
wire routing within the core cluster. In our system, we added multi-clock
domains in the system to help reduce irrelevant long wires for system connec-
tion. Thus, the clock period reports we use from Vivado are only for routing
of FCUDA cores.
Through our period model as the following equation, we incorporate the
52
degradation after synthesis and routing as a function of different parallelism
granularities as well as layout information on interconnection wires.
Period = (P0 + P1 ×Diag + P2 × Util + P3 × AP + P4 × TH (4.2)
Diag in Equation 4.2 is calculated using Equation 4.3. According to
the core cluster configuration we predefined, resources on board are divided
evenly into tiles, which are called physical tiles. Synthesized FCUDA core
clusters are called virtual tiles. All virtual tiles are of same sizes and similar
as all physical tiles. In Equation 4.3, minDim corresponds to the mini-
mum dimension of a physical tile in slices and Rslice is the slice count of the
core-cluster logic.
Diag2 =
{
2×Rslice : ifRslice ≤ minDim2
minDim2 +
(
Rslice
minDim
)2
: ifRslice > minDim
2
(4.3)
Diag corresponds to the diagonal length (in slices) of a virtual tile with
the following properties: (i) the total core-cluster slices can fit in the virtual
tile, (ii) the dimensions of the virtual tile do not exceed the dimensions of
the allocated physical tile and (iii) the diagonal length of the virtual tile is
minimal given the two previous constraints.
Util in Equation 4.2 represents the slice utilization rate of the physical
tile by the core-cluster logic. Parameters Rslice (hence Diag) and Util in
Equation 4.2 are calculated leveraging the resource model described above.
Hence, parameter Diag incorporates the core-cluster resource area and layout
information while Util incorporates the routing flexibility into the period
model.
53
Figure 4.2: Sequential Task Synchronization
Figure 4.3: Ping-Pong Task Synchronization
4.1.2 Latency Computation
Execution Stages of FCUDA During FCUDA compilation, the kernel
is decomposed into compute tasks and data-transfer tasks [36]. Computation
and data-transfer tasks can be either scheduled in a sequential synchroniza-
tion scheme or ping-pong scheme. Two schemes of scheduling are shown in
the order of time as Figure 4.2 and Figure 4.3.
In FCUDA, each application’s execution contains a bunch of thread blocks.
Figure 4.2 and Figure 4.3 show all the steps in the execution of one block.
Both serial and ping-pong synchronizations start from Init stage, in which
all the on-chip BRAM are initialized to be 0. Then fetch and computation
are performed alternatively, either serially or pipelined. Finally in wb stage,
when all the data of current block are processed, results will be written back
via system bus to off-chip DDR3 memory.
In ping-pong scheme, tasks communicate through double BRAM buffers in
a pipelined fashion where the data producing/consuming task interchange-
ably writes/reads to/from one of the two intermediate BRAM buffers [36].
In Figure 4.3, the number after Fetch and Comp means which BRAM buffer
is being used. For instance, Fetch1 means the system bus is fetching and
storing data to BRAM buffer 1.
All the experiments we did are based on the ping-pong scheme. Ping-
54
Figure 4.4: Computation Dominate Ping-Pong
Figure 4.5: Memory Transfer Dominate Ping-Pong
pong is preferred not only because it offers potential benefits of higher off-
chip memory bandwidth utilization, but also because it enables computation
and data-transfer overlap at a coarser granularity for more efficient kernel
execution.
Enhanced System Latency Model Despite the ideal case where Fetch
and Comp are perfectly overlapping each other shown in Figure 4.3, usually,
the execution in reality is either computation dominate like Figure 4.4 or
memory transfer dominate like Figure 4.5.
Based on such observation, we enhance our original latency computation
model as following:
Latency = max{cmp latency,mem latency}+ other latencies (4.4)
In Equation 4.4, compute latency is the computation latencies for all thread
blocks. mem latency is the latency for all memory transfer overlapped with
computation, whichever is longer. The item other latencies is all the other
latencies include latencies of Init, Wb, and processing delay between blocks.
other latencies is much shorter compared with the computation and memory
transfer latencies.
Computation latency When computation takes more time than the mem-
ory transfer, the total latency of kernel execution is computation latencies
55
of all blocks plus the other latencies. Since all the blocks are executed as a
for loop with same code and all the cores are executing the same code, total
computation latency can be estimated using Equation 4.5.
cmp latency(TH,AP,CR,CL) = Cyc× Nblock
CR× CL × Period (4.5)
In Equation 4.5, Nblock represents the total number of kernel thread-blocks,
Cyc is the number of execution cycles required for one thread-block and
Period is the shortest clock period the system can meet. Period is affected
by all the design space dimensions and is estimated through our estimation
model in Equation 4.2. On the other hand, Cyc is affected by the TH and
AP dimensions and is generated by invoking Vivado HLS engine, which is
almost as accurate as ModelSim simulation results while consumes less than
1/100 of the simulation time needed by ModelSim.
Memory Latency Memory latency is the latency of data preparation
stage for the next stage of computation in FCUDA. For some benchmarks,
such latency only contains data fetching from off-chip DDR3, while for the
others, this may also contains writing intermediate results back to DDR3.
Moving data between DDR3 and computation logic in FCUDA cores is
a pipeline with multiple stages. These middle stages are either buffers or
processing logic on the path of data transferring, including FCUDA AXI
wrappers’ output buffer, BUS interfaces consists of buffers, BUS arbitration
logic, Memory Controller as a single stage, and buffers inside DDR3. All
of these stages have processing latencies or throughput bandwidth. The
memory latency for the system depends on the longest stage.
So we investigate and compare bandwidth of these intermediate stages by
ModelSim Waveform Simulation. It turns out that memory controller is the
bottle neck of memory latency. Compared to the time memory controller’s
responses, the other parts on the path between FCDUA cores and DDR3,
such as various buffers for interfaces and arbitration logic, have enough time
to finish their processing during the pipeline. This observation lives up to
56
our intuitive impression since buffers behaving in a first-in-first-out pattern
does not need much logical processing and their simple wire connections
provide high bandwidth. As for the arbitration and conversions inside the bus
interconnect, round robin scans every interface and no complicated protocol
conversion is used, plus that pipeline is used for acceleration. There is no
bottleneck of bandwidth or long latency needed before Bus infrastructures’
responses can be made.
Thus the bandwidth and latency of memory controller decides system
memory latency. For the DDR3 memory controller, the maximum theoretical
bandwidth can be computed as following equation:
2 ∗MeomoryClockFrequency ∗ PhysicalDDR3DataWidth/8 (4.6)
On VC709 board, the physical DDR3 data width is 64 bits, and clock rates for
memory is 400 MHz at the most. Thus the maximum theoretical bandwidth
we could derive with one controller is 2 ∗ 400 ∗ 64/8 = 6.4GB/sec. In reality,
only about 50% to 70% can be achieved leading to a bandwidth of about
3.2 GB/sec to 4.48 GB/sec. This is actually a very very small bandwidth
and a comparison with the available bandwidth on GPU will make this fact
more intuitive. The GPU we are using in experiments that can make a fair
computation performance comparison is K20 from Nvidia, whose bandwidth
for fetching/storing block’s data from/to global memory is claimed to be 208
GB/sec.
Based on the above analysis, a model can be constructed now for calculat-
ing memory response latency based on the number of data to be transferred
and the bandwidth of memory controller. The memory latency model is
shown as Equation 4.7.
mem latency = iterNum× CR× CL× burstNumPerBlock×
max(minResponseLatency, burstDataSize/bandwidth)
(4.7)
57
Figure 4.6: Minimum Response Latency and Bandwidth Bottleneck
minResponseLatency in Equation 4.7 is the minimum latency of DDR3
response time, which means the necessary time for dealing with requests of
one core by the memory controller with no competitors in the system. When
the number of waiting requests at one moment in the system becomes larger,
pipeline of memory controller will be filled and round robin arbitration among
cores’ requests will be performed. Hence, the latency of memory controller
due to bandwidth limitation will overcome the minimum response time and
becomes the main reason for memory latency increase.
Such a mechanism is shown in a more clear way as Figure 4.6. As shown in
Figure 4.6, the green blocks represent the time when the corresponding core
is sending out new request via bus to memory. The yellow blocks represent
the time when the core is waiting for responses.
From the figure, we can see that, when the core number is less than 4, the
memory latency is exactly the minimum response latency no matter what the
bandwidth is. Otherwise with more than 4 cores, like a 6 core system, where
the pipeline stages are full, throughput based on traffic conditions and the
58
Table 4.2: Bus Back-End Resource Utilization
LUT Flip-Flop BRAM DSP
System IPs 7584 6275 0 0
Total 433200 866400 1470 3600
Util 1.75% 0.72% 0% 0%
bandwidth of memory should be used to calculate total latencies. This is the
same as various other pipeline cases. Only when traffic in system becomes
crowded, pipeline is fully used and bandwidth is considered to measure the
throughput, otherwise, minimum latency should be kept.
In our experiments, real values of minResponseLatency and bandwidth
are measured based on Simulation results in ModelSim. burstDataSize and
burstNumPerBlock are application features related to burst transfer for
each thread block, which are easy to calculated from the transferred C code.
iterNum is the number of execution iterations for each FCUDA core.
Now With calculated mem latency, we can compare it with cmp latency
and derive the whole system’s latency for each benchmark.
4.1.3 Resources of Bus Back-End
Our bus system is very light weighted compared to other communication
architectures such as NOC. This is very important in FPGA-based CUDA
system since resource is the most important limitation for adding more par-
allelisms. Table 4.2 illustrates the resources utilization of all components
for bus back-end connection except FCUDA cores in our system. The sys-
tem IPs include clock & reset synchronize processor, all AXI interconnect
cores for bus hierarchy, as well as the memory controller IPs. All these IPs
are fundamental components for our bus-based system, where FCUDA cores
will be fit in. Hence, we call them system IPs. System IPs consume only a
59
very small portion of resources, leaving most to FCUDA cores. Compared
to NOC which needs considerable Flip-Flop and LUT for many FIFOs, our
bus system has the attractive advantage for designers as being light weight.
4.1.4 Design Space Results
Though the Vivado Design Suite supports all Xilinx 7 series FPGAs, we select
the Xilinx VC709 board with a Virtex-7 FPGA on board as our target FPGA
board in the experiments. This is a large FPGA with 433,200 LUT Elements,
866,400 FlipFlops, 1470 Block RAMs, and 3600 DSPs. For memory, we take
advantage of its dual memory controller to increase the off-chip memory
bandwidth.
Such a large FPGA board enables ML-GPS FCUDA to have a huge de-
sign space. We depict the whole design space of benchmark mm and fwt2,
as Figure 4.7 and Figure 4.8. Both pictures contain more than 2000 de-
sign configuration points, each of which represents a unique configuration of
core-cluster, core, thread, and array parameters. The latency of each config-
uration is system total latency for the execution of 8296 thread blocks, and
is evaluated by models we described in Section 4.1.2.
Figure 4.7 is the design space of matrix multiplication benchmark. From
the figure, we can see that, usually, the more slices the configuration point
consumes, the less latency it needs. This makes sense to us since when we
have very little parallelism in the system, such as only one core with no unroll
or array partition, the latency of the system must be very large due to the
long computation time it needs. However, there are also some points with
high resource consumption and long latency. This is because these points all
have a large number of cores. Although computation latency is reduced due
to more core-level parallelism, the data traffic is also increased significantly
as so many cores are sending out requests together, resulting in long memory
latency and thus, a long total latency.
60
Figure 4.7: Design Space of Matrix Multiplication
Figure 4.8: Design Space of FWT2
Figure 4.8 is the design space of the fwt2 benchmark. From the figure, we
can see that most design points are formed into a thick line, meaning that
a great number of configurations have almost the same total latency. This
is because compared to benchmark mm, fwt2 needs to fetch and store much
61
more data for each thread block. Although computation latency decreases as
more parallelism is exposed, memory latency is too long to be hidden by the
computation. Therefore, all those points with the same total core numbers
have the same memory latency, which is also the total latency, no matter
what are the configurations of unroll, memory partition, and tile.
Based on whether memory latency or computation latency dominates in
the system total latency, benchmarks can be categorized into two kinds,
computation bound and memory bound. MM is a computation bound ap-
plication, while fwt2 is a memory bound application, as we can see from the
big difference in their design space trends.
4.1.5 ML-GPS Computation Performance Improvement
ML-GPS has been an effective way to reduce computation latency as men-
tioned in [22]. With our new developed bus system and adjusted system
models, we perform the computation performance improvement experiments
of ML-GPS over SL-GPS again.
With the thousands of configuration points in the design space of ML-GPS
FCUDA, a design space exploration is performed to minimize computation
latency. As shown in Figure 4.9, we compare the minimum computation
latency within the ML-GPS design space with the SL-GPS with best tile
configuration. The best tile configuration of SL-GPS is chosen by first setting
the unroll and memory partition number to be 1, then searching all tile
configurations with all available core numbers, and selecting the combination
with the shortest computation latency.
From Figure 4.9, we can see that dwt, hotspot, lavaMD, and pathFinder
have smaller computation performance improvement. This is because, for
these benchmarks, unroll and memory partition are not very effective in
shortening computation latency resulting from the existence of loop depen-
62
Figure 4.9: Computation Performance: ML-GPS vs. SL-GPS with Tile
Profiling
dencies. Hence increasing the numbers of unroll and memory partition are
not as beneficial as instantiating more cores. Actually, memory partition
does not successfully reduce clock cycles much in several benchmarks. When
unroll and memory partition are not used, the rest of the design space con-
sists only of the core number and tile configurations, which is the same as
that of SL-GPS with tile profiling. In such cases, the performance rate is
almost 1.
We also did the experiments of comparing the best performance of ML-
GPS with SL-GPS without tile profiling as in paper [22]. However, the
limitation of a single AXI IP’s connections makes the maximum number of
cores in our system not only decided by resources, but also up bounded by
the ability of AXI IP connection. Figure 4.10 depicts the results of ML-GPS
in comparison with SL-GPS with a single tile.
From Figure 4.9 and Figure 4.10, we can see that though improvements
over SL-GPS with a single tile is several times larger than SL-GPS with
tile profiling, the trend of different improvements between benchmarks are
63
Figure 4.10: Computation Performance: ML-GPS vs. SL-GPS with Single
Tile
the same, resulting from the different ML-GPS features of the benchmarks
themselves.
4.1.6 ML-GPS System Performance Improvement over
SL-GPS
After adding memory latency and computing the results of the system’s
latency based on our models, we get performance improvement results of
ML-GPS as shown in Figure 4.11 and Figure 4.12. Comparing the results
of system performance with the computation performance, we can see the
improvements decrease a lot because of the memory bandwidth bound. The
mm benchmark, as an exception in the charts, is not effected much because it
is computation bounded and does not reach bandwidth limitations at the best
design space point. There are other computation bounded benchmarks, like
lavaMD. However, since they do not have a big improvement in computation
in the first place, they do no have an obvious difference in system performance
64
Figure 4.11: System Performance: ML-GPS vs. SL-GPS with Tile Profiling
Figure 4.12: System Performance: ML-GPS vs. SL-GPS with Single Tile
either.
65
Figure 4.13: FPGA(32-Bit Implementation) vs. GPU: Latency Comparison
4.2 Comparison with GPU
In this set of experiments, we compare the performance of FPGA-based hard-
ware configuration identified by ML-GPS with the software execution on the
GPU. For the GPU performance evaluation, we use the NVIDIA Tesla K20
accelerator with 2496 processor cores and 706 MHz processor core clock. In
terms of FPGA, we target the largest Xilinx Virtex7 devices (VC709) which
include a rich resource collection to perform a fair comparison with the GPU.
In the comparison results, we include both the compute latencies and the
memory latency to/from off-chip memories. For FPGA-based implementa-
tion, the ping-pong scheme is used. The K20 device offers a peak bandwidth
of 208 GB/sec. In comparison, the average bandwidth of VC709 is only about
12.8 GB/sec. To view the limitation of memory bandwidth more clearly, we
add another set of results with the assumption that we have a larger memory
communication bandwidth solution of 208 GB/sec.
Figure 4.13 and Figure 4.14 depicts the FPGA execution latencies for the
66
Figure 4.14: FPGA(16-Bit Implementation) vs. GPU: Latency Comparison
ML-GPS chosen configuration based on the result of the design space explo-
ration. From Figure 4.13 and Figure 4.14, we can observe that the 16-bit
kernels perform better than the corresponding 32-bit versions on the FPGA
(note that all the GPU execution latencies we use are based on 32-bit ker-
nel versions). This is due to a smaller communication datasize of 16 bits as
well as shorter computation cycles. Also, based on the charts, we can see
that some of the applications are computation bounded, like MM, DWT,
and lavaMD. They are not very sensitive to off-chip communication band-
width changes. However, the others are memory bounded. Reducing off-chip
communication could be very helpful to improve total latencies.
4.3 Block Fusion
With FUCDA fitted in our bus system, more insights into the whole system
can be gained and promising optimizations are proposed, such as block fusion.
67
ML-GPS has proved its great effect in improving computation latency.
While adding more cores is helpful to reduce computation cycles, memory
latency can be increased very fast. From our preceding figures, we can see
that memory bound is the bottleneck for many applications and it prohibits
FPGA-based CUDA applications to achieve better performance. To solve
this problem, we analyze our bus-based FCUDA system and come up with
an idea called “block fusion”.
The mechanism of block fusion is basically selecting blocks with the most
reused off-chip data, fusing these blocks by fetching the data of all these
blocks to on-chip BRAMs at once and computing as one thread block. In
this way, replicated data fetches from the same DDR3 addresses can be re-
duced and off-chip communication bandwidth utilization can be improved by
storing reused data on chip.
There are two key points in this optimization method. One is the selection
of thread blocks to fuse. Block fusion saves total data fetch times at the
expense of using more on-chip BRAMs inside each core. If thread blocks
with few off-chip data addresses shared are fused, almost all the data of
these thread blocks needs to be fetched and stored in the generated core’s
on-chip BRAM, leading to a side effect of resources shortage for instantiating
more cores.
The other key point is reorganizing the structure of the FCUDA generated
C code. Fetching or storing multiple thread blocks’ data before starting the
computation breaks original thread loops structure.
Since block fusion has a strong relationship with the on-chip BRAM re-
source consumption, this can also be added in the design space exploration as
a third dimension and profiling it with unroll and memory partition together.
68
CHAPTER 5
CONCLUSION
This thesis developed an hierarchical and well-automated AXI-protocol bus
system for ML-GPS FCUDA, an optimized FPGA-based CUDA implemen-
tation. The system makes FCUDA a real computation tool that can commu-
nicate with off-chip data storage devices. Also, the model of system’s latency
is enhanced by incorporating memory latency. Such a system makes FCUDA
a real system, ensures the correctness of FCUDA on the circuit level, and
provides more optimization opportunities for the whole system.
Experiments show that our bus back-end is very light-weight and works
well when the number of cores is less than 256, which is large enough for
FPGA-based implementation. Design space explorations of a bunch of bench-
marks are done and comparisons with various SL-GPS according to the char-
acteristics of our bus hierarchy are made for analysis. In addition, we com-
pare the total latency of FCUDA in the bus system with GPU. Results show
that although our memory latency may sometimes be a limitation for those
memory intensive benchmarks, the performance of our FCUDA bus-based
system is very competitive.
5.1 Future Studies
The AXI-based bus system for FCUDA presents a broad platform for poten-
tial future work. There are several interesting avenues that have yet to be
explored:
69
• Block fusions within exported FCUDA cores for improved off-chip band-
width utility. Fusing blocks with replicated off-chip data fetch or store
can be an effective solution for the limitation of memory bandwidth.
Intermediate data for later use can be stored in on-chip BRAMs to
reduce the communication burden with off-chip DDR3 memory.
• Shared memory between cores. If two cores access the same data, the
data can be stored only a single time in shared memory. Cores can
either access BRAMs using simple wires or using a local bus.
• Additional design space exploration of memory latency profiling. For
example, with block merging or fusion, memory latency can be re-
duced and computation latency will increase as more BRAMs for each
FCUDA IP is needed. To what extent is the best performance achieved?
• Alternatives of DDR3 as off-chip storage device. DDR3 is difficult
to manipulate, such as initialization, as well as limited in bandwidth.
There are other ways for off-chip communication on FPGA. For exam-
ple, PCIe as a widely adopted bus standard, can be used for high-speed
data transfer.
70
REFERENCES
[1] G. E. Moore et al., “Cramming more components onto integrated cir-
cuits,” Proceedings of the IEEE, vol. 86, no. 1, pp. 82–85, 1998.
[2] N. S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. S. Hu, M. J.
Irwin, M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law
meets static power,” Computer, vol. 36, no. 12, pp. 68–75, 2003.
[3] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R.
LeBlanc, “Design of ion-implanted MOSFET’s with very small physical
dimensions,” Solid-State Circuits, IEEE Journal of, vol. 9, no. 5, pp.
256–268, 1974.
[4] L. I. Millett, S. H. Fuller et al., The Future of Computing Performance:
Game Over or Next Level? National Academies Press, 2011.
[5] D. Geer, “Chip makers turn to multicore processors,” Computer, vol. 38,
no. 5, pp. 11–13, 2005.
[6] H. Sutter, “The free lunch is over: A fundamental turn toward con-
currency in software,” Dr. Dobbs Journal, vol. 30, no. 3, pp. 202–210,
2005.
[7] M. B. Taylor, “Is dark silicon useful?: Harnessing the four horsemen of
the coming dark silicon apocalypse,” in Proceedings of the 49th Annual
Design Automation Conference. ACM, 2012, pp. 1131–1136.
[8] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” in Com-
puter Architecture (ISCA), 38th Annual International Symposium on.
IEEE, 2011, pp. 365–376.
[9] H. Sutter and J. Larus, “Software and the concurrency revolution,”
Queue, vol. 3, no. 7, pp. 54–62, 2005.
71
[10] L. Hochstein, J. Carver, F. Shull, S. Asgari, V. Basili, J. K.
Hollingsworth, and M. V. Zelkowitz, “Parallel programmer productiv-
ity: A case study of novice parallel programmers,” in Supercomputing,
2005. Proceedings of the ACM/IEEE SC 2005 Conference. IEEE, 2005,
pp. 35–35.
[11] NVIDIA, “CUDA C Programming Guide,” 2012. [Online]. Available:
http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf
[12] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[13] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and
W.-M. Hwu, “FCUDA: Enabling efficient compilation of CUDA kernels
onto FPGAs,” in Application Specific Processors, 2009. SASP’09. IEEE
7th Symposium on. IEEE, 2009, pp. 35–42.
[14] J. Tolar, “A directory enhanced network on chip for FPGA,” M. S.
thesis, University of Illinois at Urbana-Champaign, Urbana, IL, 2013.
[15] P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G. Davis, B. Cremen,
and B. Troxel, “A hybrid ASIC and FPGA architecture,” in Computer
Aided Design, 2002. ICCAD 2002. IEEE/ACM International Confer-
ence on. IEEE, 2002, pp. 187–194.
[16] I. Kuon, R. Tessier, and J. Rose, “FPGA architecture: Survey and
challenges,” Foundations and Trends in Electronic Design Automation,
vol. 2, no. 2, pp. 135–253, 2008.
[17] C. Maxfield, The Design Warrior’s Guide to FPGAs: Devices, Tools
and Flows. Elsevier, 2004.
[18] S. Brown and J. Rose, “FPGA and CPLD architectures: A tutorial,”
IEEE Design & Test of Computers, vol. 13, no. 2, pp. 42–57, 1996.
[19] P. Coussy, D. D. Gajski, M. Meredith, and A. Takach, “An introduction
to high-level synthesis,” IEEE Design & Test of Computers, no. 4, pp.
8–17, 2009.
[20] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen, “High level
synthesis of stereo matching: Productivity, performance, and software
constraints,” in Field-Programmable Technology (FPT), 2011 Interna-
tional Conference on. IEEE, 2011, pp. 1–8.
[21] G. Martin and G. Smith, “High-level synthesis: Past, present, and fu-
ture,” IEEE Design & Test of Computers, no. 4, pp. 18–25, 2009.
72
[22] A. Papakonstantinou, Y. Liang, J. A. Stratton, K. Gururaj, D. Chen,
W.-M. Hwu, and J. Cong, “Multilevel granularity parallelism synthe-
sis on FPGAs,” in Field-Programmable Custom Computing Machines
(FCCM), 2011 IEEE 19th Annual International Symposium on. IEEE,
2011, pp. 178–185.
[23] Xilinx, “Vivado High-Level-Synthesis User Guide 902,” Jul 2012.
[Online]. Available: http://www.xilinx.com/support/documentation/
sw manuals/xilinx2012 2/ug902-vivado-high-level-synthesis.pdf
[24] Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong, “Autopi-
lot: A platform-based ESL synthesis system,” in High-Level Synthesis.
Springer, 2008, pp. 99–112.
[25] M. Mitic´ and M. Stojcˇev, “An overview of on-chip buses,” Facta
Universitatis-Series: Electronics and Energetics, vol. 19, no. 3, pp. 405–
428, 2006.
[26] A. Goel and W. R. Lee, “Formal verification of an IBM CoreConnect
processor local bus arbiter core,” in Proceedings of the 37th Annual De-
sign Automation Conference. ACM, 2000, pp. 196–200.
[27] A. AMBA, “2.0 Specification,” 2004.
[28] A. AMBA, “3.0 AXI Specification,” 2011.
[29] D. Wingard, “MicroNetwork-based integration for SOCs,” in Design
Automation Conference, 2001. Proceedings. IEEE, 2001, pp. 673–677.
[30] K. Lahiri, A. Raghunathan, and S. Dey, “Design space exploration for
optimizing on-chip communication architectures,” Computer-Aided De-
sign of Integrated Circuits and Systems, IEEE Transactions on, vol. 23,
no. 6, pp. 952–961, 2004.
[31] A. Raghunathan, G. Lakshminarayana, K. Lahiri, and S. Dey, “Method-
ology for the design of high-performance communication architectures
for system-on-chips using communication architecture tuners,” Dec. 20
2005, US Patent 6,978,425.
[32] R. Zurawski, Embedded Systems Handbook. CRC Press, Jul 2009, ch. 20.
[33] Xilinx, “AMBA AXI4 Interface Protocol: Overview,” 2015. [Online].
Available: http://www.xilinx.com/ipcenter/axi4.html
[34] Xilinx, “AXI Reference Guide,” Mar 2011. [On-
line]. Available: http://www.xilinx.com/support/documentation/
ip documentation/ug761 axi reference guide.pdf
73
[35] Xilinx, “7 Series FPGAs Configurable Logic Block,” Jul 2012.
[Online]. Available: http://www.xilinx.com/support/documentation/
user guides/ug474 7Series CLB.pdf
[36] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong,
and W.-M. W. Hwu, “Efficient compilation of CUDA kernels for high-
performance computing on FPGAs,” ACM Transactions on Embedded
Computing Systems (TECS), vol. 13, no. 2, p. 25, 2013.
[37] “Tcl Developer Xchange.” [Online]. Available: http://www.tcl.tk/
[38] Xilinx, “Vivado Design Suite Tcl Command
Reference Guide,” Dec 2013. [Online]. Avail-
able: http://www.xilinx.com/support/documentation/sw manuals/
xilinx2013 4/ug835-vivado-tcl-commands.pdf
[39] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
K. Skadron, “Rodinia: A benchmark suite for heterogeneous comput-
ing,” in Workload Characterization, 2009. IISWC 2009. IEEE Interna-
tional Symposium on. IEEE, 2009, pp. 44–54.
74
