Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA by Wu, Yun & McAllister, John
Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA
Wu, Y., & McAllister, J. (2018). Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA. IEEE
Transactions on Parallel and Distributed Systems, 29(1), 43-55. DOI: 10.1109/TPDS.2017.2746081
Published in:
IEEE Transactions on Parallel and Distributed Systems
Document Version:
Peer reviewed version
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
Publisher rights
Copyright 2017 IEEE. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of
use of the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:04. Jan. 2018
1Architectural Synthesis of multi-SIMD Dataflow
Accelerators for FPGA
Yun Wu, Member, IEEE, and John McAllister, Senior Member, IEEE
F
Abstract—Field Programmable Gate Array (FPGA) boast abundant
resources with which to realise high-performance accelerators for com-
putationally demanding operations. Highly efficient accelerators may be
automatically derived from Signal Flow Graph (SFG) models by using
architectural synthesis techniques, but in practical design scenarios,
these currently operate under two important limitations - they cannot
efficiently harness the programmable datapath components which make
up an increasing proportion of the computational capacity of modern
FPGA and they are unable to automatically derive accelerators to meet
a prescribed throughput or latency requirement. This paper addresses
these limitations. SFG synthesis is enabled which derives software-
programmable multicore single-instruction, multiple-data (SIMD) accel-
erators which, via combined offline characterisation of multicore perfor-
mance and compile-time program analysis, meet prescribed through-
put requirements. The effectiveness of these techniques is demon-
strated on tree-search and linear algebraic accelerators for 802.11n WiFi
transceivers, an application for which satisfying real-time performance
requirements has, to this point, proven challenging for even manually-
derived architectures.
Index Terms—Field Programmable Gate Array (FPGA), Dataflow,
Signal Flow, Architectural Synthesis, Single-Instruction Multiple-Data
(SIMD)
1 INTRODUCTION
F IELD Programmable Gate Array (FPGA) offer enormouscomputational capacity and distributed memory re-
sources for high-performance, low-cost realisation of signal,
image and data processing [1], high performance computing
and big data analytics [2] and industrial control [3] opera-
tions. As custom computing devices, FPGA typically host
accelerators - components whose circuit architecture is tuned
to realise a specific function with performance and cost well
beyond that available via software-programmable devices,
such as multicore processors or graphics processing units.
To achieve these benefits, accelerators have traditionally
been developed manually at Register Transfer Level (RTL)
[4]. This low level of design abstraction enables highly effi-
cient results, but imposes a heavy development load made
increasingly unproductive as the scale of modern FPGA de-
vices increase. Architectural Synthesis (AS) eases this burden
by automating the derivation of pipelined accelerators from
Signal Flow Graph (SFG) models and has proven highly
Authors undertook this work at the Institute of Electronics, Communications
and Information Technology (ECIT), Queen’s University Belfast, UK e-mail:
jp.mcallister@qub.ac.uk
successful in producing high-performance, efficient results
[5], [1].
In the context of modern FPGA, current SFG-AS ap-
proaches have two shortcomings. Firstly, the accelerators
they produce are networks of fixed-function components,
such as adders, multipliers or dividers. However, mod-
ern FPGA increasingly rely on multi-functional or pro-
grammable components, such as the DSP48E1 slice in Xilinx
FPGA [6], to provide computational capacity. When used
to realise fixed-function components, multiple of these are
required to affect different operations where otherwise one
would suffice, leading potentially to increased resource cost.
No current SFG-AS approach can harness these compo-
nents’ programmability. Additionally, when designing for
an industrial operating context or for standards-based sys-
tems, accelerators need to meet a prescribed throughput
or latency. No current SFG-AS approach can automatically
derive accelerators to meet such requirements and as a
result, iterative cycles of time-consuming FPGA place-and-
route to refine results to meet a required performance. This
is a highly time-consuming process and a major barrier to
high-productivity accelerator design.
This paper addresses these shortcomings. Specifically,
three principal contributions are made:
1) A novel SFG-AS approach is presented which derives
accelerators composed of custom multicore SIMD pro-
cessor architectures utilising the programmable datap-
aths on modern FPGA.
2) It is shown how, via off-line estimation of multicore
performance and compile-time application analysis, ac-
celerators may be automatically produced which meet
a pre-defined throughput requirement.
3) Automatic AS of accelerators with demanding real-
time requirements is demonstrated by application to
the design of transceivers for 802.11n WiFi.
The remainder of this paper is structured as follows.
Section 2 and Section 3 outline the multi-SIMD accelerator
design problem, before Sections 4 - 5 describe the synthesis
approach and Section 6 applies this to the design of 802.11n
transceiver accelerators.
2 BACKGROUND
Modern FPGA boast enormous on-chip computation, dis-
tributed memory and communications resources. For exam-
ple, Xilinx’s Virtex R©-7 FPGA family offer per-second access
2to up to 7 × 1012 multiply-accumulate (MAC) operations
and 40×1012 bits/s of memory via programmable DSP48E1
[6], Look-Up Table (LUT) and Block RAM (BRAM) [7]
resources. Along with the abundance of on-chip registers
on modern FPGA, these resources are ideal for creating
high-throughput, deeply-pipelined accelerators [1], [8], [2].
However, to date accelerators have been designed at RTL, a
low-level, manual process made increasingly unproductive
as the scale of modern FPGA increases.
Currently, two popular classes of approach address this
productivity problem by adopting more abstract design en-
try points. High-Level Synthesis (HLS) translates programs
written in popular software languages, such as C, C++ or
OpenCL, to accelerators [9], [10], [11]. These tools derive
RTL circuit architectures from the input and allow a de-
signer to manipulate performance and cost by transforming
the C source via, for example, loop unrolling, or by issuing
synthesis directives. They support bit-true hardware data-
types for arithmetic and automatically generate RTL code,
frequently via the use of advanced scheduling and resource
sharing in order to affect performance and cost.
An alternative approach uses AS techniques to derive
accelerators from SFG models described in tools such as
MATLAB Simulink. Exemplified by tools such as Xilinx’s
System Generator, Altera’s DSP Builder and Synopsys’ Syn-
plify DSP, these empower a designer to specify the be-
haviour of an RTL component before generating code in the
form of VHDL or Verilog. They support bit-true hardware
datatypes, automatic RTL code generation, close integra-
tion with vendor synthesis and place-and-route tools and
hardware-in-the-loop emulation. The SFG models created
are also ideal for application of AS transformations such as
automatic pipelining/retiming and graph folding or unfold-
ing to trade the performance and cost of the accelerator [5],
[1].
Regardless of the approach chosen, however, some re-
strictions are apparent. Consider the designer’s concern:
to realise a given function, on a given FPGA device, with
a throughput or latency which is prescribed either by an
industrial operating context, or by standards to which the
equipment of which it is a part must comply. In this
scenario, SFG-AS and HLS tools’ capabilities are currently
lacking in two important ways. Whilst transformation tech-
niques such as retiming [12] and folding or unfolding
[5], [13] allow accelerator performance to be traded with
resource or energy cost [13], [1], [14], current approaches
provide only partial support for deriving accelerators with
a given performance. This is because they measure perfor-
mance and cost in terms of abstract units such as clock
cycles (latency), samples/cycle (throughput) or number of
arithmetic components [13]. This puts the effect on actual
performance and cost in doubt, since facets such as number
of LUTs, or the length of each clock period cannot be
accurately estimated until the entire, highly time-consuming
FPGA synthesis toolchain, including place-and-route, has
been traversed. Hence creating an accelerator of a given
performance is a very unproductive, manual process.
Furthermore, all of these techniques derive RTL circuits
composed of fixed-function components, such as arithmetic
components, buffers or switches. However a substantial
proportion of the computational resource on modern FPGA
is increasingly made up of components which are multi-
functional or even programmable, such as the DSP48E1 on
Xilinx Virtex R©-7 FPGA. If these are restricted to performing
only one operation each, accelerators of greater cost may
result than otherwise necessary. To the best of the authors’
knowledge, no current AS approach addresses either of
these shortcomings.
The programmable components which these processes
should target require both control logic and memory re-
source to store and manage delivery of instructions and
operands. These structures evoke the notion of software-
programmable processors, the use of which on FPGA has
been growing in recent times and has evolved into interme-
diate fabrics or overlays [15], [16], [17]. These take a wide
variety of forms, including vector processors [18], [19], GPU-
like structures [20], [21], [22] or domain-specific processors
[23], [16]. These are all founded on components such as
the DSP48E1 but impose large resource and performance
overheads to enable program and data control. To enable
efficient accelerators, these overheads must be minimised
via a soft design approach which customises their structure
to the workload at hand. An approach which adheres to
this philosophy is described in [24], [25]. Here, very fine-
grained processors are used as building blocks for large-
scale multi-SIMD structures whose architectures are tuned
to the workload. This approach has been shown to support
accelerators with performance and cost which are highly
competitive with those from libraries such as Xilinx’s Core
Generator or Spiral [26]. However, there is no technology to
automate their generation.
This paper devises an SFG-AS approach which derives
custom multi-SIMD processors built around programmable
on-chip computation units. Section 3 introduces the target
architectures in more detail.
3 HETEROGENEOUS MULTI-SIMD FOR FPGA
3.1 FPGA Processing Elements
In [24], [25] is described an approach to the realisation of
FPGA accelerators for signal, image and data processing
using a template architecture shown in Fig. 1. Accelera-
tors are realised using networks of Processing Elements
(PEs) - software-programmable SIMD soft processors. The
execution of each SIMD is decoupled from all others and
communication is via point-to-point links. The structure
of the network, the communications links and the widths
of each PE are customisable at design time to maximise
performance and minimise cost for the workload at hand.
Point-to-Point FIFO Network
...
Interface 
Controller
PE
L1
PE
L1 ... LN
PE
L1 L2
PE
L1 ... LP
PE
L1 ...L2 LM
Fig. 1: multi-SIMD Architecture Template
To enable highly efficient accelerators, two key features
are demanded of the PEs. They must be lean, incurring very
3low resource cost, to enable scalability to many hundreds
of units for complex accelerators; associated with this re-
quirement is the need for standalone operation - the ability
to process data, access and manage memory and commu-
nicate externally without the need for a host processor. The
FPGA PE (FPE) [24] is a RISC load-store PE which fulfils
these requirements; SIMD and SISD (i.e. single-lane SIMD)
variants of the FPE are shown in Fig. 2. The FPE includes
only vital components - a program counter, program mem-
ory, instruction decoder, register file, branch detection, data
memory, immediate memory and an arithmetic logic unit
based on the DSP48E1 in Xilinx FPGA [6]. A COMM module
allows direct insertion/extraction of data into and out of the
FPE pipeline. In addition, the FPE’s architecture is highly
configurable for tuning to a specific workload [24].
ID/RF
Program 
Counter 
(PC)
Program 
Memory 
(PM)
Lane
Register 
File (RF)
COMM
Branch 
Detection
Branch 
Control
Instruction
Fetch
Source 
Select
Result 
Select
Write 
Back
Arithmetic Logic Unit 
(ALU)
DSP48E
EX1
Data Memory (DM)
EX2 EX3
Imm. 
Memory 
(IMM)
Instrn. 
Decode 
(ID)
Coprocessor
(a) FPE - SISD Mode
Lane
Register 
File RF
Arithmetic 
Logic Unit
ALU
COMM
Lane
Program 
Counter
Program 
Memory
Instruction 
Decoder
PC
PM
ID
Immediate 
Memory IM
Register 
File RF
Arithmetic 
Logic Unit
ALU
COMM
Lane
Register 
File RF
Arithmetic 
Logic Unit
ALU
COMM
(b) FPE - SIMD Mode
Fig. 2: The FPGA Processing Element
By ensuring absolute lowest cost, economies of scale
enable significant multicore resource cost savings. The price
for this efficiency, however, is flexibility - the FPE is not a
run-time general-purpose component because its architec-
ture is highly tuned to the application at hand. In addition,
it is domain-specific, enabling very high performance for
certain types of operations, with performance degradation
for others [25]. The benefit, however, is very high per-
formance; a 16-bit SISD FPE on Xilinx Virtex 5 VLX110T
supports 480 MMACs/s requiring 90 LUTs; this is just 14%
of the cost of a general-purpose Xilinx Microblaze processor
and 35% of that of the iDEA processor [23] on the same
device. This efficiency enables processor-based accelerators
for a range of applications whose performance and cost
is highly competitive with hand-crafted accelerators. The
structures which results are heterogeneous, as illustrated for
an example symbol detector for 4 × 4 16-QAM Multiple-
Input, Multiple-Output (MIMO) 802.11n transceivers in Fig.
3. This architectures includes clusters of MIMD structures
(4-FPE1) and nine 16-lane SIMD structures (FPE16). This
accounts for a total of 288 processing lanes, communicating
via point-to-point data queues to realise the functionality
required. The performance and cost of this architecture has
been shown to be highly competitive with hand-crafted
realisations of the same behaviour for this and a range of
other functions [24], [25].
8-
FP
E 1
6 FPE16
FPE16
FPE16 FPE16
FPE1
FPE4
36
FPE1
FPE4
35FPE1
FPE4
2FPE1
FPE4
4-FPE1
FPE1 FPE1
FPE1
FPE1
1 2 8
Fig. 3: FPE-based SD for 4× 4 802.11n
3.2 Synthesis of FPE-based Accelerators
The goal, in this paper, is to generate a multi-FPE accelerator
architecture from an application such that a prescribed
throughput, expressed as a number of iterations n of the
application per second, is satisfied. Accelerators based on
the FPE promise high performance and efficiency, but a
series of substantial design challenges must be overcome
to automate their derivation to meet a given real-time per-
formance. These are summarised in Fig. 4. A series of key
sub-tasks are involved [4]:
• Allocation of a set of SIMDs to realise the application,
• Partitioning & Binding of application tasks to SIMDs
and insertion of point-to-point communication links,
• Scheduling of the operations on each SIMD,
• Estimation of the performance of the result in order to
ensure requirements are satisfied,
• Code Generation of source for each PE.
Architectural
 Synthesis
Application n
Allocation Partitioning & Binding
Estimation
Scheduling
Code 
Generation
Fig. 4: FPE-based Accelerator Synthesis
Partitioning, scheduling and estimating the performance
of application workloads for programmable multicore and
4GPU devices is an active topic of research [27], [28]. How-
ever, these works differ from that described here. For
instance, in many cases they can reduce the scheduling
load on the compiler by employing hardware scheduling
circuitry [27], [29], [28]; this is not available in the FPE. In
all cases, they do not have to allocate a processing resource,
as is required for the FPE and their performance estimation,
e.g. [30], does not extend to estimating the physical length
of a clock cycle, as is required for FPGA.
The SFG modelling entry point is well-suited to synthe-
sis problems such as this and has already been adopted in
numerous FPGA AS tools as detailed in Section 2. It is a
highly restricted form of dataflow, a domain of modelling
languages which have been shown well-suited to rapid
synthesis of digital signal processing operations for both
multicore and FPGA [31], [32], [33], [5]. Specifically, SFG
is a sub-class of synchronous dataflow [34], a decidable
dataflow dialect [35] which has consistently demonstrated
outstanding support for compile-time analysis and genera-
tion of efficient code. Its popularity has led to a considerable
body of work in the areas of partitioning and binding,
scheduling and code generation of dataflow applications for
multiprocessors [36], [37], [32]. Hence, this paper focuses on
the key novel aspects of this work: allocation of multi-SIMD
PE architectures and mapping and scheduling of SFG tasks
across PEs to meet a given real-time performance.
4 ARCHITECTURAL SYNTHESIS OF HETEROGE-
NEOUS MULTI-SIMD ACCELERATORS
4.1 Synthesis Strategy
SFG models of two classes of tree-search Sphere Decoding
(SD) operation - Fixed-Complexity Sphere Decoder (FSD)
and Selective Spanning with Fast Enumeration (SSFE)-
[1, 1, 2, 4] - for 4 × 4 16-QAM MIMO for 802.11n Wi-Fi are
shown in Fig. 5 and will be used to illustrate the synthesis
process as it progresses. A SFG G = (N,E) describes a set
of nodes or actorsN and a set of edgesE = N×N - directed
First-In, First-Out (FIFO) queues of data tokens. A node is
said to fire, consuming/producing a pre-specified number
of tokens, known as the rate, from each incoming/outgoing
edge. In an SFG, all rates are 1 and are not quoted.
In Fig. 5, the SFG is composed of 108 subactors, each
of which processes a single Orthogonal Frequency Divi-
sion Multiplexing (OFDM) subdivision of the allocated fre-
quency band. For each sub-band i, two pieces of data are
input: a channel matrix Hi and a received symbol vector
yi. In the cases considered in this paper, H ∈ C4×4 and
y ∈ C4×1. Preprocessing is applied by pp, ordering the
entries of both y and H according to the distortion on
each path through the wireless channel, before an equalised
version of y (yeq) is produced. The resulting data are then
refined to an estimate sˆ of the transmitted symbol vector
s ∈ C4×1 by a sequence of euclidean distance cost functions
a1i − a4i in Fig. 5a (e1i − e4 in Fig. 5b) Further details of
these algorithms are available in [38], [39].
Each of the SFGs in Fig. 5 contain many instances of
actors of a restricted range of classes, replicated in a very
regular data/task parallel fashion. For instance, in FSD the
sequence of actors {a4, a3, a2, a1} forms all 16 branches of
the SD tree, replicated 108 times to constitute 1728 instances
108
3
21
a41 a42 a415 a416
pp
a31 a32 a315 a316
a21 a22 a215 a216
a11 a12 a115 a116
min
(a) FSD
108
e31 e32 e33
e27
e17
3
2
1
pp
e31 e32 e33 e34
e21 e23 e25 e27
e11 e13 e15 e17
min
e4
e22
e12
e24
e14
e26
e16
e28
e18
(b) SSFE-[1, 1, 2, 4]
Fig. 5: FSD and SSFE SDF Models
of this same sequence. Similarly, there are 16 data parallel
instances of min, one per tree. These repeated parallel se-
quences are well suited to SIMD realisation, and we propose
to exploit this feature to derive multi-SIMD realisations via
a two-step process illustrated in Fig. 6.
Batch b2
Batch b1
Batch 2
Batch 1
k1 k2
Workgroup 
Synthesis
Workload 
Synthesis
kernelcluster
Batch 2
Batch 1 Workgroup 
Synthesis
k1
k2
Fig. 6: Synthesis Process Overview
As shown on the left of Fig. 6, the process commences
with the SFG model and the definition of a set K of kernels.
Kernels are considered the fundamental units of ’work’,
with the accelerator created to realise these kernels. This
approach is in keeping with that of common heterogeneous
computing languages such as OpenCL [40], where they are
known as work-items, and CUDA. A kernel k ∈ K is a
subgraph of G such that G may be subdivided into a set of
partitions P = {p1, p2, ..., pn}, with each pi ∈ P an instance
of a kernel k ∈ K such that p1 ∪ p2 ∪ ...pn = G, pi ∩ pj = ∅
for all i, j, i 6= j, i.e. every actor in G is a member of
5precisely one kernel instance. Kernels may take any form
and are defined by the designer. However, in order to realise
the most effective multi-SIMD realisations, these should be
chosen to expose large numbers of similar, data parallel
operations. In the FSD application, for example, both the
FSD tree branches and the min actors, highlighted in Fig. 6
as k1 and k2 respectively, are ideal kernels.
From the SFG and kernel definitions are derived a work-
load. A workload is a sequence of kernel batches, with a
multi-SIMD workgroup synthesised to process a prescribed
number, n, of batches per second. Since all SIMDs in each
workgroup execute the same kernel a distinct workgroup is
required per kernel class; this paper illustrates the synthesis
process for k1 in Fig. 6.
4.2 Workload Synthesis
Consider the FSD model in Fig. 5a. Each OFDM sub-band
contains multiple instances of the kernel k1, all of which
depend on data emanating from a pp node - yeq and H -
which are local to that sub-band. Hence, when realising the
set of k1 kernels, if two instances from the same sub-band
are realised on different SIMDs or different lanes of the same
SIMD, this local data will have to be stored in multiple
different memories, increasing the total memory capacity
required and the total FPGA resource cost. Conversely, if
both are realised on the same SIMD lane, a substantial re-
source saving may result. This data locality is ensured by the
SFG model’s hierarchy and, as a result, it is important that
this is maintained and exploited to guide the workgroup
synthesis process to realise these kernels on the same SIMD
resource. Accordingly, the workload is described as a batch
of kernel clusters, formed to emphasise local communication
and memory storage.
Realising this feature requires two capabilities. The
graph G must be reformulated to express its behaviour in
terms of the kernel set K , with any instance of a kernel
k ∈ K in G replaced by a single actor k, whilst kernels of
the same class within the same composite node need to be
clustered for batch formation. The effect of this clustering
on the FSD and SSFE-[1, 1, 2, 4] SFGs are shown in Fig. 7.
Note that there are 108 disjoint FSD subgraphs, Q1 − Q108,
each representing an OFDM subcarrier. Two kernels are
identified: k2 designates the min actor as a kernel, whilst k1
identifies a branch of the FSD tree as a kernel. The SFG is fac-
tored to replace the subgraphs represented by each of these
kernels with a single ’kernel’ actor. In addition, the similar
kernels in each disparate subgraph Qi are composed into
clusters and hence two clusters arise - C1 and C2, composed
respectively of all instances of k1 and k2. Similarly, the three
kernels identified in Fig. 7b result in three clusters for each
Qi. The SFG model reformulation process is performed as
described in Algorithm 1.
Input to this process are the set of kernels K and the SFG
G. The goal is to derive G′, a SFG of equivalent behaviour
to G whose child actors are all kernels and members of K .
In the process the set of kernel clusters C is also derived.
The reformulation finds every instance of every kernel in G
(line 6) by isolating its disjoint subgraphs Q ⊆ G (line 3).
All instances of each kernel k in Q (Qk) are replaced with a
single actor representing the kernel (line 7), with the set of
Q108
Q1
k1 k2
C108,1
pp
min C108,2
C1,2
C1,1k1,1 k1,2 k1,15 k1,16
pp
k2,1
(a) Clustered FSD
C108,3
C108,1
K1 K2 K3 C108,2K4
pp
min
k1,1
Q108
Q1
k1
k2 k3
C1,2
C1,3
k2,1 k2,2 k2,3 k2,4
pp
k3,1
C1,1k1,1
(b) Clustered SSFE-[1, 1, 2, 4]
Fig. 7: Clustered FSD and SSFE Results
Algorithm 1 DFG Clustering
1: procedure CLUSTERDFG(G,K)
2: i, j ← 1
3: while ∃ (Q ⊆ G) : (E (Q) ∩ E (G−Q) = ∅) do
4: ci,j ← ∅
5: while j ≤ |K| do
6: Qk = Q ∩ kj
7: Q← REPLACE(Q,Qk, kj)
8: ci,j ← ci,j ∩Qk
9: j ← j + 1
10: i← i+ 1
return G′, C
kernels for each subgraph appended to the cluster definition
(line 8). This process is repeated for every disjoint subgraph
and every kernel type, with G′ and C returned.
4.3 Design Space Scaffolding
The workgroup synthesis strategy adopted is illustrated in
Fig. 8a. A workgroup is synthesised for each class of kernel.
It executes all instances of the kernel batches, where each
batch is a set of clusters (as determined in Section 4.2) of
sufficient size to meet the system throughput requirement.
In deriving the workgroup, there are three key challenges:
determining the batch size, the number of SIMDs and the
width of each. To aid this process, a template workgroup
structure is assumed, illustrated in Fig. 8b.
A workgroup is a two-dimensional SIMD structure, the
rows of which are composed of SIMD units with col-
umn i (i = 1, ..., l) formed by the composite of lanes
(j, i) , j = 1, ..., d. To derive such a structure, d and l must
be determined to execute a batch with a given throughput.
The dimensions (d, l) can vary between:
• (1, 1): one single-lane SIMD is employed to process wi
kernels sequentially,
• (wi, 1): multiple single-lane SIMDs are employed, with
each SIMD realising a single kernel
• (1, wi): one multi-lane SIMD is employed, each lane of
which realises a single kernel
6Batch b
Batch 2
Workgroup
SIM
D d
lane 
1
lane 
b-1
lane 
b
SIM
D 2
lane 
1
lane 
b-1
lane 
b
workgroup 
?lane?
Batch 1
Workgroup 
Synthesis
kernelcluster
SIM
D 1
lane 
1
lanel
l-1
lane 
l
(a) Synthesis Overview
Workgroup
SIMD 1
Lane 
1
Lane 
2
Lane 
3
Lane 
l
PC
PM
IDIM
SIMD 2
Lane 
1
Lane 
2
Lane 
3
Lane 
l
PC
PM
IDIM
SIMD d
Lane 
1
Lane 
2
Lane 
3
Lane 
l
PC
PM
IDIM
ro
w
 1
ro
w
 2
ro
w
 d
col 
1
col 
2
col 
l
(b) 2D multi-SIMD Resource Overview
Fig. 8: Workgroup Structure & Synthesis
To guide the selection of the appropriate combination of
rows/columns, two key observations may be made:
• To achieve highest efficiency, the kernel load of each
column should be balanced so that no lane is idle
awaiting others to finish. This implies that the number
of kernels executed per column is an integer factor of
the number contained in the batch.
• FPE performance scales linearly with number of lanes
up to a width of 16, after which clock period constraints
as a result of wide instruction broadcast imposes in-
creasingly sublinear scaling [41].
A batch describes a subset of the clusters associated
with each class of kernel; the set of viable batch sizes
S = {s ∈ Z : 1 ≤ s ≤ |C|}. For each viable batch size
si ∈ S, a multi-phase workload may be defined as a sequence
W , with each wi ∈ W determining the number of kernels
executed during that phase of the sequence. Specifically, for
each s ∈ S, W = {wi}
d |C|si e
i=1 , where
wi =

s·i∑
j=s·(i−1)+1
|Cj | when i ≤ |C|s
|S|∑
j=s·(i−1)+1
|Cj | when i = |C|s + 1
(1)
Given these observations, a set L of candidate work-
group widths (i.e. number of columns) can be enumerated
as the integer factors of the workload size, up to a limit of
16:
L =
l ∈ Z+ :
 |C|∑
n=1
|ci|
%l = 0 and l ≤ 16
 (2)
Given this set of widths, the viable depths d can be
determined by subdividing the batch across the workgroup
columns and determining the number of SIMDs required by
comparing estimates of the iteration rate to the requirement.
This is achieved via Algorithm 2.
Algorithm 2 Workgroup Synthesis
1: procedure WORKGROUPSYNTHESIS(C, n)
2: M,Mc ← ∅
3: S ← {s ∈ Z : 1 ≤ s ≤ |C|}
4: for each s ∈ S do
5: W ← ENUMERATEWORKLOAD(s)
6: L← ENUMERATEWIDTHS(w1)
7: n′ ← n× |W |
8: for each l ∈ L do
9: M ′ ← ∅
10: M ′ ← DEPLOY(w1, C , l, n′)
11: Mc ←Mc ∩M ′
12: M ← SELECTCANDIDATE(Mc)
13: return M
Two pieces of information allow the workgroup to be
created: the kernel clusters C and n, which defines the
number of iterations of the clusters required every sec-
ond. The goal of workgroup synthesis is to determine a
two-dimensional set M , with each element mij ∈ M the
sequence of kernels to be executed on the lane at row
i, column j of the workgroup. To derive this, batch size
and workgroup widths are successively enumerated and
the batches deployed on the corresponding workgroups.
The best of each of these options is chosen as the final
deployment. Specifically, the set of all viable batch sizes S is
enumerated (line 3 of Algorithm 2), and the corresponding
sequence of batches W and workgroup widths L defined
in lines 5 and 6, as in (1) and (2) (line 6) respectively.
The workload is executed in |W | phases and hence the
iteration rate n is scaled accordingly (line 7). Then, for each
candidate workgroup width, viable depths and workload
mappings are derived (line 10, as described in Section 4.4)
and appended to the set of candidate solutions Mc (line 11).
The final result M is selected from this set (line 12).
4.4 Workgroup Derivation
The DEPLOY process maps a batch w of kernels onto a work-
group of a given width l such that a number of iterations
per second of the batch n is achieved. The behaviour of this
process is described in Algorithm 3.
Algorithm 3 Multi-SIMD Workload Deployment
1: procedure DEPLOY(w,C, l, n)
2: A,M ′ ← ∅
3: A← SHAPEWORKLOAD(w, C , l)
4: M ′ ← ALLOCATEMAPSCHEDULE(A, n)
5: return M ′
7There are two key steps: the batch is subdivided across
the columns which make up the width of the workload
(SHAPEWORKLOAD in line 3, described in Section 4.5),
before the resulting arrangement, denoted by the set A is
used to determine the number of SIMDs required in order to
execute the kernels assigned to each column in satisfaction
of n (line 4, described in Section 4.6). The cardinality of the
resulting two-dimensional set M ′ defines the dimensions of
the SIMD array and whose entries define the sequence of
kernels executed on each lane of each SIMD.
4.5 Workload Shaping
Workload shaping assigns kernels for execution on the
columns of the two-dimensional workgroup; at the point of
entry only the width of the workgroup (i.e. the number of
lanes in each SIMD) is defined, with the number of SIMDs
initially assumed to be one. The goal of this process is to
subdivide the batch across a given number of columns (i.e
workgroup lanes), with the number of rows (i.e. SIMDs)
to be later derived. In order to ensure that kernels sharing
local data variables are assigned to the same SIMD lane,
they must be assigned to the same workgroup column and
hence the mapping of kernels to columns is guided by C
according to Algorithm 4.
Algorithm 4 Workload Shaping
1: procedure SHAPEWORKLOAD(w,C, l)
2: cl ← wl
3: S ← C
4: A← ∅
5: i, j, k ← 0
6: while i ≤ |l| do
7: R← {q ⊆ sj : |q| = min(|sj |, cl − |sj |)}
8: ai ← ai ∪R
9: sj =
sj
R
10: if |ai| = cl then
11: k ← max(1,mod(k + 1, w))
12: if |sj | = 0 then
13: j ← j + 1
14: if k = 1 then
15: i← i+ 1
16: return A
The ultimate aim is to derive a mapping of kernels from
the batch w to workgroup lanes, deriving a set A each
element ai ∈ A of which defines the kernels assigned to
that lane. To derive this subdivision from w, the number of
kernels per lane is calculated (line 2) and the kernels for each
lane isolated (line 6). The assignment maintains column-
local communication, i.e. kernels from the same cluster are
assigned to the same column, as far as is possible. From
each cluster are extracted kernels which number the lower
of either the number of unmapped kernels in the cluster or
the number required in order to fully load the current lane
(line 7). These kernels are assigned to the current lane (line
8) and removed from the cluster (line 9). When a lane is fully
loaded the next is considered (lines 10, 11); otherwise if all
kernels in the cluster have been mapped the process repeats
for the next cluster (line 12, 13).
4.6 Allocation, Mapping and Scheduling
Given the mapping of kernels to workgroup lanes, it re-
mains to determine the number of SIMDs required in order
to execute each lane’s load to meet the system’s throughput
requirements. This is performed by a joint allocation/map-
ping/scheduling process which has three main objectives:
• Determine the number of SIMDs.
• Assign each kernel to a specific lane of a specific SIMD.
• Order the execution of kernels on each SIMD.
This requires a procedure with two inputs, a definition
of the kernels assigned to each workgroup lane A, and a
definition of the required number of iterations per second
n times per second. The resulting two-dimensional set M ′
describes the sequence of kernels executed on each lane of
each SIMD, derived via Algorithm 5.
Algorithm 5 Allocation, Mapping and Scheduling
1: procedure ALLOCATEMAPSCHEDULE(A,n)
2: dmax ← |a1|, dmin ← 1
3: M ′ ← ∅
4: while 1 ≤ dmin ≤ dmax do
5: d = dmin + ddmax−dmin2 e
6: k = |a1|d
7: ne = ESTIMATETHROUGHPUT(k, l, d)
8: if ne > n then
9: M ′ ← MAP(A, d)
10: dmax ← d− 1
11: else
12: dmin ← d+ 1
13: return M ′
There are, potentially, a significant number of options
for the number of SIMDs - any integer number up to a
maximum of |ai| - and a greedy design space pruning
process is used to determine the appropriate value. Upper
and lower bounds on the number of SIMDs, dmax and
dmin are defined (line 2), with the range between these
limits successively halved over multiple iterations. In each
iteration, the performance of the mid-point of the range
is estimated. In the case where it is too low, the upper
half is chosen on the next iteration or, in case it exceeds
the requirement, the lower half is chosen. In each iteration
the number of kernels assigned to each workgroup row is
determined (line 6) and its throughput estimated (line 7
- see Section 5). If the estimated throughput exceeds the
requirement (line 8), the allocation is valid and the kernel
load is mapped across the d SIMDs (line 9) before the upper
bound on the search space is lowered to d − 1 (line 10)
and the process repeated to determine potentially lower-
cost solutions. When performance is not sufficient, the lower
bound dmin is increased to d + 1 (line 11) and the process
repeats until either the lower or upper bounds exceed their
viable ranges. The result is the final workgroup derived
which exceeds the performance requirement.
The process of mapping kernels to SIMD rows is a trivial
subdivision of each a ∈ A into d subsequences each of
length d |a|d e, where each subsequence describes the kernels
to be executed on each row of the workgroup. This process
is not described further here.
84.7 FSD Example
Fig. 9 illustrates the process of synthesising a workgroup
to realise k1 for FSD. As shown, there are 108 clusters,
each containing 16 instances of k1. Accordingly S =
{1, 2, ..., 108}. For each si ∈ S, a workload can be derived.
In the case where si = 54, a two-phase workload results,
each phase of which executes 864 kernels. Given the process-
ing of 54 clusters per batch, the viable widths of workgroup,
i.e. the integer factors of the number of clusters, are given by
L = {1, 2, 3, 6, 9}. For each li ∈ L the number of workgroup
rows may then be defined as D = {di ∈ Z+ ∩ [1, li]}. Fig. 9
illustrates the final arrangement when d = 2 and the kernel
load for each workgroup column subdivided thereon.
k1 k2
Workload Synthesis
k1
S={1,2 ,54,...108}
W={216,216}
Workgroup Synthesis
L={1,2,3,6,9}
D={1,2 ,18}
M
 (l=
3)
Col 1
C1,2
C1,18
C1,1
Col 2
C1,20
C1,36
C1,19
Col 3
C1,38
C1,54
C1,37
A 
(d=
2) lane 1,1
C1,18
C1,10
lane 1,2
C1,36
C1,28
lane 1,3
C1,54
C1,46
lane 2,1
C1,9
C1,1
lane 2,2
C1,27
C1,19
lane 2,3
C1,45
C1,37
Ba
tc
h 
2
(s
=5
4)
C1,1 C1,2 C1,53 C1,54
Fig. 9: Process Overview for FSD
Key to this process is its ability to estimate the through-
put of a given workload, on a given workgroup and to
account for potential estimation inaccuracies. Techniques to
facilitate both these objectives are described in Section 5.
5 PERFORMANCE ESTIMATION AND SELF-
CORRECTION
5.1 Throughput Estimation
To estimate throughput, two key metrics are required: the
number of cycles required to execute the workload and
length of each cycle, i.e. the clock period of the architecture.
Each SIMD executes a sequence of kernels and hence one
prominent component of the throughput estimation prob-
lem is determining the number of instructions and cycles
required to execute a given number of identical kernels.
Given this information, the estimation process can employ
any scheduling approach desired. Since the accelerator ar-
chitecture exploits numerous copies of a single component
(the FPE) in various SIMD configurations, the instruction
stream for a kernel will be identical, regardless of on which
component of the final architecture it is deployed. This
allows pre-synthesis characterisation of the performance
and cost of a kernel, a characterisation which may be used
to enable the allocation process.
In order to reduce resource cost, forwarding hardware
has been omitted from the FPE. In order, then, to avoid data
hazards, NOPs must be inserted in the instruction stream
realising a kernel in order to synchronise operand accesses.
This leads to kernel instruction sequences such as that in
Fig. 10a. Consider the resulting effect on the execution of a
sequence of similar kernels by the FPE. Fig. 10 shows two
example two-kernel workloads.
A11
NOP
NOP
NOP
C11
C12
NOP
NOP
NOP
NOP
A12
A13
A14
NOP
A21
NOP
NOP
NOP
C21
C22
NOP
NOP
NOP
NOP
A22
A23
A24
NOP
(a) Kernel Instruction Streams
A11
C11
C12
A12
A13
A14
A21
NOP
NOP
NOP
C21
C22
A22
A23
A24
NOP
(b) Interleaved Kernels
Fig. 10: Kernel Interleaving Illustration
In both these cases, a series of Effective Instructions
(EIs) is interspersed with NOPs for the purposes of data
synchronisation. Assume that each kernel also requires r
register file locations. In Fig. 10a, the two kernels may
be executed sequentially, requiring only r RF locations;
however, they may also be interleaved, with the EIs from
one kernel occupying the NOPs from the other as in Fig.
10b. The interleaved version enables higher efficiency and
throughput, but has increased RF cost. Hence each SIMD
should interleave kernels as much as possible, so long as
RF capacity constraints allow. The estimation problem is to
determine the number of cycles required to execute a given
multi-kernel workload, within a given constraint on r. This
9is determined by profiling the kernels, deriving instruction-
level statistics of their computational operations and NOPs
and combining these into a single cost metric.
Assuming an RF occupancy per kernel of r registers,
then given a constraint on the number of RF locations rc1,
the maximum number of interleaved kernels f is given by:
f =
⌊rc
r
⌋
(3)
The PM cost effect of interleaving successive kernels can
be estimated by considering each kernel to be a sequence
of instructions subdivided into a sequence of blocks demar-
cated at {NOP,EI} sequence boundaries - i.e. the first EI
following a NOP represents the start of a new block. Each
block consists of a set of EIs EI followed by a set of NOPs
NOP and may then be represented by a coefficient γ:
γ =
|EI|
|NOP | (4)
Letting PIL denote the maximum pipeline stage length2,
kernel instruction statistics are categorised into PIL cata-
logue sets depending on (0 ≤ γ ≤ 1), (1 ≤ γ ≤ 2), . . .,
(γ = PIL − 1). By defining two cost vectors, a and b, where
ai and bi indicate respectively the number of EI and total
instructions of a block in the ith catalog (i ∈ [1, PIL]), then
the PM size increment ∆p of adding a further interleaved
kernel is given by [42]:
∆p =
k−2∑
i=1
ai + k · ak−1 − bk−1 (5)
Hence, for k kernels mapped to an FPE, the total PM cost
is given by
p =
⌊
k
f
⌋
[p+ ∆p (f − 1) + min(x,PIL − 1)]
+ (p+ ∆p [k%f − 1])
(6)
The two additive terms in (6) respectively represent
the total PM cost of the
⌊
k
f
⌋
full interleaves and the final
interleave, which may or may not be fully occupied. The
value p denotes the number of cycles required to execute
the multi-kernel workload for each FPE.
This analysis allows compile-time evaluation of the
number of cycles required to execute a given set of kernels.
The final performance in real-world terms depends not only
on the number of cycles, but the length of each, as dictated
by the clock period of the synthesised architecture. This
period is determined by vendor place-and-route tools, such
as Xilinx ISE or Vivado and the quality of the final result can
be optimised by using additional intelligence to guide the
process [43]. However, for this process the primary concern
is the ability to estimate the final result, without undergoing
the long delays associated with executing these functions.
We need to be able to accurately estimate the length of each
cycle that will result from any approaches such as these
without actually executing them. This is achieved by pre-
profiling, via RTL synthesis, varying numbers of SIMDs
1. For the remainder of this paper, assume a maximum RF size of 64
locations
2. For the remainder of this paper, PIL = 6
of varying width. Fig. 11 illustrates this profile for 1 to 20
SIMDs of each which has 1 to 16 lanes on Xilinx Virtex-5.
123456
SIMD Width
7891011121314151612
34
56
7
No. of SIMD
89
1011
1213
1415
1617
1819
200
220
240
260
280
300
320
340
360
380
400
20
clo
ck
 (M
Hz
)
Fig. 11: Clock Rate Database for Virtex-5
As this shows, the highest clock rate - approximately
370 MHz - is achieved by a single SISD processor with the
lowest experienced for 20 SIMD processors with 16 FPEs. As
shown in Fig. 11, the anticipated clock rates trends are ob-
served - as the total resource realised on the device increases
(represented by points towards the front left hand corner),
clock rate reduces, as a natural result of the optimization
algorithms executed by Xilinx ISE increasingly struggle to
find low-cost/high-performance design space points as the
scale of the gate-level netlist being mapped increases. Given
this profiling and the estimation of the number of cycles
required for workload execution, the throughput of a reali-
sation, in iterations per second, may be estimated. Letting ce
denote the estimated clock frequency the estimated number
of iterations ne - used in Algorithm 5 to determine the
viability of a realisation - is given by
ne =
ce
p
(7)
5.2 Self-Correction
At the point of estimation the number of instructions can, in
fact, be measured rather than estimated. However, the clock
frequency is a true estimate: the precise value cannot be
known until after FPGA place-and-route is complete. At the
proposed pre-synthesis point of estimation there is likely
to be some error between the estimated and actual clock
frequencies. Since this estimate is an intrinsic part of the
design process the inherent inaccuracy may preclude the
result from meeting the intended real-time performance.
Suppose that the estimated clock frequency is higher
than the post-place-and-route actual frequency; this reduced
clock frequency will lead to a reduced real-time perfor-
mance, which in turn may be below the threshold perfor-
mance target. In this case, allocation needs to be repeated to
account for the discrepancy. To automatically derive a viable
accelerator whilst accounting for the estimation discrepancy,
the multi-phase synthesis process in Fig. 12 is employed.
As this shows, an iterative process adjusts the through-
put target to account for inaccuracies in the estimated clock
frequency ce. If this exceeds the actual clock rate ca and is
10
na<n
end
start
Workload 
Synthesis
Workgroup 
Synthesis
n=n. ca
ce
Performance 
Database
yes
no
SFG n
Fig. 12: Iterative Synthesis Process
sufficiently low that the actual number of iterations na < n,
where n is the throughput requirement, then the threshold
is adjusted (increased) to account for the differential, scaling
by the ratio of the estimated and actual clock periods.
6 EXPERIMENTS
To illustrate the capability of the proposed synthesis process,
seven exemplar accelerators are addressed for 4 × 4 MIMO
transceivers:
1) FSD, SSFE-[1, 1, 1, 4] and SSFE-[1, 1, 2, 4] tree-search SD
2) Zero-Forcing (ZF) and Minimum Mean Square Error
(MMSE) equalisation
3) Sorted QR Decomposition (SQRD) pre-processing
We propose to evaluate the ability of the FPE AS
approach by addressing the context of 802.11n, which
demands 480 Mbps detection for FSD and SSFE and
ZF/MMSE equalisation and 30 × 106 iterations/second
SQRD - demanding requirements for even hand-crafted
accelerators [44], [24]. This application has been chosen
because it requires a range of operation types typical in
signal, image and data processing - linear algebraic (matrix
decomposition, matrix-vector and matrix-matrix multiplica-
tion) and tree-search operations, in a demanding real-time
setting.
The SFG-AS process described in Sections 4 and 5 have
been realised in a prototype, the behaviour of which is
described in Fig. 13. In the main, XML is used for all input
and intermediate data exchange, with the final result being
VHDL and C sources describing the respective structure
and executables for the multi-FPE accelerator derived. The
intermediate processing stages match those in Section 4 and
5 and are realised using Java. The C source for each FPE is
compiled using a custom LLVM-based compiler, to produce
assembly. The RTL source is translated to Xilinx Virtex-5
XC5VSX240T via ISE 14.2. In line with standard practise,
to permit objective analysis of the performance and cost of
the accelerators produced and comparison with existing and
future approaches in the areas of HLS [10], [11], [13], [26],
[43], FPGA-based processors [15], [16], [18], [19], [20], [22],
[23], [24], [25], [43] and accelerators [1], [2], [13], [26], all
performance and cost metrics are measured post-place and
route, independent of a specific hardware platform.
FPE Source 
(.C)
FPE Source 
(.C)
FPE Source 
(.C)
SFG
n
FPE Device 
Metrics
Clustering
Kernel
ProfilingProfiling
Kernel
Allocation Mapping
Scheduling
Performance 
Estimate
Code 
Generation
FPE Source 
(.C)
FPE RTL 
(.vhd)
FPE RTL 
(.vhd)
FPE RTL 
(.vhd)
FPE RTL 
(.vhd)
Workload 
Synthesis
Key
xml
Java
Fig. 13: Prototype SFG-AS Tool Structure
6.1 Tree Search: FSD & SSFE
In order to realise FSD and SSFE-[1, 1, 2, 4] the SFG appli-
cation models and corresponding kernels are respectively
shown in Fig. 5 and Fig. 7. The kernels for SSFE-[1, 1, 1, 4]
are illustrated in Fig. 14. Real-time operation for 4 × 4, 16-
QAM 802.11n MIMO demands 480 Mbps throughput and
is taken as the performance target. The resulting accelera-
tors are itemised in Table 1. The FSD and SSFE-[1, 1, 1, 4]
accelerators are depicted in Fig. 14.
The key features of the synthesis process are evident in
the SIMD structures in Fig. 14. For FSD (Fig. 7a), two work-
groups are created, one each for realisation of k1 and k2 in
Fig. 6, with point-to-point FIFOs realising the dependencies
between the two. For k1 a workgroup of twelve 16-way
SIMDs is realised, whilst for k2, the workgroup consists of
two 12-way SIMDs. Similarly, three workgroups are created
for the three kernels which describe SSFE-[1, 1, 1, 4] - a 9-
way SIMD for k1, three 12-way SIMDs for k2 and a 6-way
SIMD for k3.
TABLE 1: 4× 4 FSD Implementation Results
Modulation 16-QAM 64-QAM
SIMDs 14 25
DSP48E1 216 304
LUTs (×103) 31.1 154.42
Clock (MHz) 298 278
Throughput (Mbps) 483.2 506.4
Clock Est. (MHz) 289 276
Error (%) 3 0.72
There are a number of notable aspects of the results
in Tables 1 and 2. Immediately obvious is that the real-
time performance requirements have been satisfied; to the
11
y, R
98
y, R
97
y, R
26
y, R
25
y, R
2
SIMD16
1
SIMD12 SIMD12
SIMD16
2
SIMD16
12
y, R
1
y, R
100
y, R
99
y, R
28
y, R
27
y, R
4
y, R
3
y, R
96
y, R
95
y, R
48
y, R
47
y, R
24
y, R
23
k1
a4
a3
e2
a1
k2
min
(a) FSD multi-SIMD Architecture
y, R
100
y, R
10
SIMD12
SIMD6
y, R
1
y, R
101
y, R
11
y, R
2
y, R
108
y, R
12
y, R
9
k2
e1
e2
e3
k3
min
SIMD9
SIMD12 SIMD12
k1
e0
(b) SSFE-[1, 1, 1, 4] multi-SIMD Architecture
Fig. 14: FSD and SSFE SIMD Architectures
TABLE 2: 4× 4 16/64-QAM SSFE Implementations
[1,1,1,4] [1,1,2,4]
QAM 16 64 16 64
SIMDs 5 4 6 6
DSP48E1 45 33 80 49
LUTs (×103) 8.35 6.3 18.3 11.7
Clock (MHz) 355 357 354 353
Throughput (Mbps) 541.2 499.1 544.87 536.6
Clock Est. (MHz) 353 357 347 348
Error (%) 0.6 0 2.0 1.4
best of the authors’ knowledge, this is the first record of
automatic derivation of multicore accelerators in satisfaction
of a pre-defined performance requirement. In addition, it
is worth noting the effectiveness of the proposed process
in guiding the creation of each accelerator. In no case was
the relative error in the estimated clock rate greater than
2%. This indicates that the pre-synthesis clock rate estimates
were very accurate. Indeed, it is perhaps notable that in a
number of instances, clock rate was underestimated.
6.2 Equalisation: ZF & MMSE
In MIMO communications, ZF or MMSE equalisation forms
an estimate xˆ of the transmitted symbol vector x by forming
the product of the received symbol vector y and an equali-
sation matrix W:
xˆ = W · y (8)
where y ∈ C4×1 and W ∈ C4×4 The equalisation matrix W
takes different forms depending on whether a ZF or MMSE
equalisation strategy is to be employed. For ZF
WZF =
(
HH ·H
)−1 ·HH (9)
where H ∈ C4×4 is the channel matrix and HH denotes
the hermitian transpose of H. For MMSE equalisation,
WMMSE =
(
HH ·H+ IM
ρ
)−1
·HH (10)
where IM is an identity matrix of order 4. The very high
complexity of matrix inversion is the major challenge pre-
sented by this operation. To address this issue, QR decom-
position is applied to H to produce:
W =
(
(Q ·R)H · (Q ·R)
)−1 ·HH
= R−1 · (R−1)H ·HH (11)
where both Q,R ∈ C4×4 According to this reformula-
tion, the SDF application model for ZF and MMSE equalisa-
tion is shown in Fig. 15. As this shows, multiple operations
are invoked, including QR decomposition of the channel
matrix H, followed by back-substitution to derive R−1
of the R matrix produced. Subsequently the products of
R−1, its hermitian and HHy are formed to derive yˆ. Real-
time operation for 4 × 4 802.11n MIMO requires 480 Mbps
throughput - the throughput and cost metrics obtained are
described in Table 3.
108
H=QR
R-1
HHy 
R R-1 R-1(R-1)H
R-1(R-1)HHHy 
2
H=QR
R-1
HHy 
R R-1 R-1(R-1)H
R-1(R-1)HHHy 
1
H=QR
R-1
HHy 
R R-1H1
y1
R-1(R-1)H
R-1(R-1)HHHy 
Fig. 15: ZF / MMSE SFG Model
It is again notable that, in both cases, real-time accel-
erators automatically result; to the best of the authors’
knowledge, this is the first time this capability has been
demonstrated for algebraic operations, such as the matrix
triangularisation, inversion and multiplication operations.
12
TABLE 3: 4× 4 ZF & MMSE Implementations
ZF MMSE
SIMD 13 24
DSP48E1 180 384
LUTs (×103) 31.6 73.1
Clock (MHz) 292 238
Throughput (Mbps) 507.69 591.6
Clock Est. (MHz) 330 255
Error (%) 11.5 6.7
In addition, note again the effectiveness of the design pro-
cess in estimating and refining the accelerator architecture.
In the case of the MMSE accelerator, the estimation clock
rate is only 6.7% in error. The situation is slightly deterio-
rated for ZF, where an 11.5 % error in the initial estimate is
encountered; whilst this is higher than any other estimate, it
is still mild in absolute terms.
6.3 Preprocessing: SQRD
In order to realise real-time ordering for 4×4 802.11n MIMO,
the resulting accelerator has to operate at a rate of 30× 106
iterations per second. SQRD merges QR decomposition of
the channel matrix H with heuristic-based sorting to ensure
that the decoding process addresses antennas in the correct
order to account for the relative distortion experienced by
each [45]. Table 3 reports both MMSE and ZF variants.
TABLE 4: 4× 4 SQRD SD Preprocessing
ZF MMSE
SIMD 5 11
DSP48E1 72 180
LUTs (×103) 39.2 30.9
Clock (MHz) 347 314
Throughput (iterations/s) 31.1 30.4
Clock Est. (MHz) 345 335
Error (%) 0.6 6.3
Once again, it is notable that these automatically derived
accelerators meet the real-time performance requirement
and that the estimation-based design process has been
highly effective. For MMSE, the estimated clock rate and
throughput are only 6.3% in error. Similarly, the ZF esti-
mates are actually underestimates.
7 CONCLUSION & FUTURE WORK
This paper has presented an approach for AS of accelerators
for modern FPGA which achieves two unique capabilities.
By deriving custom multi-SIMD processors it can harness
the programmable datapath resources which increasingly
make up a substantial portion of the computational capacity
of modern FPGA. Furthermore, it automates the generation
of accelerators satisfying real-time performance require-
ments prescribed by their industrial operating context or
as a result in standards-based equipment. This process is
facilitated by offline characterisation of the performance
of multi-SIMD topologies, compile-time evaluation of the
cycle cost of SIMD programs and a self-correcting synthesis
strategy which adapts to account for errors in the estimation
process. When applied to the design of large-scale linear-
algebraic (matrix triangularisation and multiplication) and
tree-search operations, it automatically produces a series
of accelerators capable of supporting real-time performance
for 4 × 4 802.11n MIMO employing either 16-QAM or 64-
QAM. This is a notable achievement since, on the same
FPGA technology, implementations of the same operations
has had to be enabled by hand-crafted RTL design, if indeed
these previously existed - the authors are unaware of any
work which enables real-time FSD employing 64-QAM, for
instance. Furthermore, this paper targets Virtex 5 FPGA, but
the techniques presented are applicable to later generations
since the FPE ’virtualizes’ the FPGA as it derives networks
of FPEs and instructions for execution on each FPE and not
the FPGA device architecture.
Despite the effectiveness of this approach, a series of fur-
ther improvements could be made. For instance, it does not
consider the resource cost of inter-processor communication
and does not explore the potential for cost reduction via
different mapping of the same application on an allocation.
Similarly, automatically tuning the FPE RTL architecture
to its functionality is not considered. Since the DSP48E
slices targetted natively only support fixed-point arithmetic,
the only way to support floating-point is via emulation,
addition of floating-point co-processors next to, or in place
of, the DSP48E in Fig. 2, or by combining this work with
standard AS techniques which derive networks of fixed-
function floating-point components. In addition, previous
work [43] has shown the benefit of considering the nature of
the processing architecture being realised when optimizing
its mapping to the FPGA, and it is likely that a similar
approach could yield increased performance and/or lower
cost FPE-based accelerators. There is considerable perfor-
mance/cost benefit to all of these considerations.
REFERENCES
[1] R. Woods, J. McAllister, G. Lightbody, and Y. Yi, FPGA-based
Implementation of Signal Processing Systems. John Wiley & Sons,
2008.
[2] W. Vanderbauwhede and K. Benkrid, High-Performance Computing
using FPGAs. Springer, 2013.
[3] E. Monmasson et al., “FPGAs in Industrial Control Applications,”
IEEE Trans. Industrial Informatics, vol. 7, no. 2, pp. 224–243, May
2011.
[4] D. Gajski, S. Abdi, A. Gerstlauer, and G. Schirner, Embedded System
Design: Modelling, Synthesis and Verification. Springer, 2009.
[5] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation. Wiley, 1999.
[6] Xilinx Inc., 7 Series DSP48E1 Slice User Guide, Aug. 2013.
[7] ——, 7 Series FPGAs Memory Resources User Guide, Jan. 2014.
[8] O. Pell and V. Averbukh, “Maximum Performance Computing
with Dataflow Engines,” Computing in Science and Engineering,
vol. 14, no. 4, pp. 98–103, 2012.
[9] D. Pellerin and S. Thibault, Practical FPGA Programming in C,
1st ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2005.
[10] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and
Z. Zhang, “High-Level Synthesis for FPGAs: From Prototyping
to Deployment,” IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, vol. 30, no. 4, pp. 473–491, April 2011.
13
[11] A. Canis et al., “LegUp: An Open-source High-level Synthesis
Tool for FPGA-based Processor/Accelerator Systems,” ACM Trans.
Embed. Comput. Syst., vol. 13, no. 2, pp. 24:1–24:27, Sep. 2013.
[12] C. E. Leiserson and J. B. Saxe, “Retiming Synchronous Circuitry,”
Algorithmica, vol. 6, no. 1, pp. 5–35, 1991. [Online]. Available:
http://dx.doi.org/10.1007/BF01759032
[13] Y. Yi and R. Woods, “Hierarchical Synthesis of Complex DSP Func-
tions using IRIS,” IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, vol. 25, no. 5, pp. 806–820, May 2006.
[14] S. J. E. Wilton, S.-S. Ang, and W. Luk, “The Impact of Pipelining
on Energy per Operation in Field-Programmable Gate Arrays,” in
14th Intl. Conf. on Field Programmable Logic and Applications (FPL),
2004, pp. 791–728.
[15] G. Stitt and J. Coole, “Intermediate Fabrics: Virtual Architectures
for Near-Instant FPGA Compilation,” IEEE Embedded Systems Let-
ters, vol. 3, no. 3, pp. 81–84, Sept 2011.
[16] A. K. Jain, D. L. Maskell, and S. A. Fahmy, “Throughput Oriented
FPGA Overlays using DSP Blocks,” in 2016 Design, Automation Test
in Europe Conference Exhibition (DATE), March 2016, pp. 1628–1633.
[17] C. Wang, J. Zhang, X. Li, A. Wang, and X. Zhou, “Hardware Im-
plementation on FPGA for Task-Level Parallel Dataflow Execution
Engine,” IEEE Trans/ Parallel and Distributed Systems, vol. 27, no. 8,
pp. 2303–2315, Aug 2016.
[18] J. Yu, C. Eagleston, C. H. Chou, M. Perreault, and G. Lemieux,
“Vector Processing as a Soft Processor Accelerator,” ACM Trans.
Reconfigurable Technology and Systems, vol. 2, no. 2, June 2009.
[19] P. Yiannacouras, J. Steffan, and J. Rose, “Portable, Flexible, and
Scalable Soft Vector Processors,” IEEE Trans. Very Large Scale
Integration (VLSI) Systems, vol. 20, no. 8, pp. 1429–1442, Aug. 2012.
[20] K. Andryc, M. Merchant, and R. Tessier, “FlexGrip: A soft GPGPU
for FPGAs,” in 2013 International Conference on Field-Programmable
Technology (FPT), Dec 2013, pp. 230–237.
[21] A. Al-Dujaili, F. Deragisch, A. Hagiescu, and W. F. Wong, “Guppy:
A GPU-like Soft-core Processor,” in 2012 International Conference
on Field-Programmable Technology, Dec 2012, pp. 57–60.
[22] M. Al Kadi, B. Janssen, and M. Huebner, “FGPU: An SIMT-
Architecture for FPGAs,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, ser.
FPGA ’16. New York, NY, USA: ACM, 2016, pp. 254–263.
[Online]. Available: http://doi.acm.org/10.1145/2847263.2847273
[23] H. Y. Cheah, B. F., S. Fahmy, and D. L. Maskell, “The iDEA DSP
Block Based Soft Processor for FPGAs,” ACM Trans. Reconfigurable
Technology and Systems, vol. 7, no. 1, Feb. 2014.
[24] X. Chu and J. McAllister, “Software-Defined Sphere Decoding
for FPGA-Based MIMO Detection,” IEEE Trans. Signal Processing,
vol. 60, no. 11, pp. 6017–6026, Nov. 2012.
[25] P. Wang and J. McAllister, “Streaming Elements for FPGA Signal
and Image Processing Accelerators,” IEEE Trans. Very Large Scale
Integration (VLSI) Systems, vol. 24, no. 6, pp. 2262–2274, June 2016.
[26] P. Milder, F. Franchetti, J. Hoe, and M. Pu¨schel, “Computer
Generation of Hardware for Linear Digital Signal Processing
Transforms,” ACM Trans. Design Automation for Electronic Systems,
vol. 17, no. 2, pp. 15:1–15:33, Apr. 2012.
[27] M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram,
“Virtual Thread: Maximizing Thread-Level Parallelism beyond
GPU Scheduling Limit,” in 43rd ACM/IEEE Intl. Symp. on Computer
Architecture (ISCA), June 2016, pp. 609–621.
[28] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,
E. Lindholm, and K. Skadron, “A Hierarchical Thread Scheduler
and Register File for Energy-Efficient Throughput Processors,”
ACM Trans. Comput. Syst., vol. 30, no. 2, pp. 8:1–8:38, Apr. 2012.
[Online]. Available: http://doi.acm.org/10.1145/2166879.2166882
[29] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU
Architecture,” IEEE Micro, vol. 31, no. 2, pp. 50–59, March 2011.
[30] Y. Zhang and J. D. Owens, “A Quantitative Performance Analysis
Model for GPU Architectures,” in 17th IEEE Intl. Symp. High
Performance Computer Architecture, Feb 2011, pp. 382–393.
[31] E. Lee and T. Parks, “Dataflow Process Networks,” Proceedings of
the IEEE, vol. 83, no. 5, pp. 773–801, May 1995.
[32] S. Sriram and S. Bhattacharyya, Embedded Multiprocessors: Schedul-
ing and Synchronization. Marcel Dekker, Inc., 2000.
[33] S. Bhattacharyya, P. Murthy, and E. Lee, “Synthesis of Embedded
Software from Synchronous Dataflow Specifications,” Journal of
VLSI Signal Processing, vol. 21, no. 2, pp. 151–166, June 1999.
[34] E. A. Lee and D. G. Messerschmitt, “Synchronous Data flow,”
Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, Sept 1987.
[35] S. S. Bhattacharyya, E. F. Deprettere, R. Leupers, and J. Takala,
Handbook of Signal Processing Systems. Springer, 2010, vol. 1.
[36] E. A. Lee and D. G. Messerschmitt, “Static Scheduling of Syn-
chronous Data Flow Programs for Digital Signal Processing,” IEEE
Trans. Computers, vol. C-36, no. 1, pp. 24–35, Jan 1987.
[37] S. S. Battacharyya, E. A. Lee, and P. K. Murthy, Software Synthesis
from Dataflow Graphs. Norwell, MA, USA: Kluwer Academic
Publishers, 1996.
[38] L. G. Barbero and J. S. Thompson, “Fixing the Complexity of
the Sphere Decoder for MIMO Detection,” IEEE Trans. Wireless
Communications, vol. 7, no. 6, pp. 2131–2142, June 2008.
[39] Min Li et al., “Selective Spanning with Fast Enumeration: A
Near Maximum-Likelihood MIMO Detector Designed for Parallel
Programmable Baseband Architectures,” in 2008 IEEE Intl. Conf.
on Communications (ICC), May 2008, pp. 737–741.
[40] Khronos OpenCL Working Group, “The OpenCL C Specification,”
Sept 2015.
[41] X. Chu, J. McAllister, and R. Woods, “A Pipeline Interleaved Het-
erogeneous SIMD Soft Processor Array Architecture for MIMO-
OFDM Detection,” in 7th Intl. Symp. on Reconfigurable Computing:
Architectures, Tools and Applications, 3 2011, pp. 133–144.
[42] C. Zheng, J. McAllister, and Y. Wu, “A Kernel Interleaved Schedul-
ing Method for Streaming Applications on Soft-core Vector Pro-
cessors,” in 2011 International Conference on Embedded Computer
Systems (SAMOS), July 2011, pp. 278–285.
[43] C. E. LaForest and J. G. Steffan, “Maximizing Speed and Density
of Tiled FPGA Overlays via Partitioning,” in 2013 Intl. Conf. on
Field-Programmable Technology (FPT), Dec 2013, pp. 238–245.
[44] L. Ma, K. Dickson, J. McAllister, and J. McCanny, “QR
Decomposition-Based Matrix Inversion for High Performance Em-
bedded MIMO Receivers,” IEEE Trans. Signal Processing, vol. 59,
no. 4, pp. 1858–1867, April 2011.
[45] Y. Wu, J. McAllister, and P. Wang, “High Performance Real-
time Pre-Processing for Fixed-Complexity Sphere Decoder,” in
2013 IEEE Global Conference on Signal and Information Processing
(GlobalSIP), Dec 2013, pp. 1250–1253.
Yun Wu received the B.S. degree in Electronic
and Information Engineering from Dalian Na-
tionalities University, Dalian, China, in 2003; the
M.Sc. in Circuits and System from Hunan Uni-
versity, Changsha, China, in 2007; the M.Sc. in
Radio Frequency Communication Systems from
University of Southampton, Southampton, U.K.
2008 and the Ph.D. degree in Electronic Engi-
neering from Queen’s University Belfast, Belfast,
U.K. in 2014. Before that, he was a wireless
algorithm engineer with ZTE Shanghai R & D
Centre from 2008 to 2010. He is currently a Research Fellow with
Queen’s University Belfast. His current research interests include signal
processing, processor architecture synthesis, and energy proportional
computing.
John McAllister (S’02-M’04-SM’12) received
the Ph.D. degree in Electronic Engineering from
Queen’s University Belfast, U.K., in 2004. He
is currently a member of academic staff in the
Institute of Electronics, Communications and In-
formation Technology (ECIT) at the same in-
stitution. His research interests are in custom
stream computing systems, including domain-
specific languages, compilers and computing ar-
chitectures. He is a cofounder of Analytics En-
gines Ltd., a member of the Advisory Board to
the IEEE Technical Committee on Design and Implementation of Sig-
nal Processing Systems (DISPS), a former Associate Editor of IEEE
Transactions on Signal Processing, Chief Editor of the IEEE Signal
Processing Society Resource Center and a member of the editorial
board of Springer’s Journal of Signal Processing Systems.
