Harmonic-summing Module of SKA on FPGA--Optimising the Irregular Memory
  Accesses by Wang, Haomiao et al.
1Harmonic-summing Module of SKA on FPGA –
Optimising the Irregular Memory Accesses
Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen
Abstract—The Square Kilometre Array (SKA), which will be
the world’s largest radio telescope, will enhance and boost a large
number of science projects, including the search for pulsars. The
frequency domain acceleration search is an efficient approach
to search for binary pulsars. A significant part of it is the
harmonic-summing module, which is the research subject of this
paper. Most of the operations in the harmonic-summing module
are relatively cheap operations for FPGAs. The main challenge
is the large number of point accesses to off-chip memory
which are not consecutive but irregular. Although harmonic-
summing alone might not be targeted for FPGA acceleration,
it is a part of the pulsar search pipeline that contains many
other compute-intensive modules, which are efficiently executed
on FPGA. Hence having the harmonic-summing also on the
FPGA will avoid off-board communication, which could destroy
other acceleration benefits. Two types of harmonic-summing
approaches are investigated in this paper: 1) storing intermediate
data in off-chip memory and 2) processing the input signals
directly without storing. For the second type, two approaches of
caching data are proposed and evaluated: 1) preloading points
that are frequently touched 2) preloading all necessary points
that are used to generate a chunk of output points. OpenCL is
adopted to implement the proposed approaches. In an extensive
experimental evaluation, the same OpenCL kernel codes are
evaluated on FPGA boards and GPU cards. Regarding the
proposed preloading methods, preloading all necessary points
method while reordering the input signals is faster than all the
other methods. While in raw performance a single FPGA board
cannot compete with a GPU, in terms of energy dissipation,
GPU costs up to 2.6x times more energy than that of FPGAs
in executing the same NDRange kernels.
Index Terms—Irregular memory access optimisation,
harmonic-summing, field programmable gate arrays (FPGA),
OpenCL.
I. INTRODUCTION
THE Square Kilometre Array (SKA) is built to extend ourunderstanding of the Universe and ourselves and it will
be the world’s largest radio telescope array when finished [6].
A number of key science goals are targeted by the SKA [2]
project and one of them is strong-field tests of gravity using
pulsars, which are highly magnetized rotating neutron stars.
Since most pulsar signals are weaker than white noise and their
details are unknown, a number of techniques are employed
to search for different types of pulsars over a wide range
of searching scales (e.g. sky coverage, frequency, bandwidth,
and integration time) [21]. The enormous signal rate of the
SKA makes an efficient solution only using general processors
to complete the searching tasks in the given time period
extremely difficult.
Taking the high-performance computing ability, power
consumption, and flexibility into consideration, the field-
programmable gate array (FPGA) seems to be an ideal device
to accelerate the Central Signal Processor (CSP) of the SKA
project. The SKA stage 1 (SKA1) project plans to adopt high-
end FPGAs to accelerate part of the function modules in
the CSP regarding pulsar search such as frequency domain
acceleration search. However, the general hardware descrip-
tion language (HDL, e.g. Verilog HDL and VHDL) based
development process makes it hard to achieve fast prototyping
design and design space exploration. Additionally, developers
of an internationally distributed team, including non-hardware
experts, would need to understand the hardware structure of
FPGA devices.
To address these problems, we employed a high-level ap-
proach by using a high-level language compared to HDL. In
this paper, we take a pulsar search module called harmonic-
summing as a case study. The harmonic-summing module is a
part of the Fourier domain acceleration search (FDAS) mod-
ule that contains a compute-intensive module. The compute-
intensive module performs very well on FPGAs [20], so in
order to avoid unnecessary data transfer, it is important to
have the harmonic-summing module on the FPGA. The main
feature of the harmonic-summing module is that the access
to the input signals is irregular and this affects the hardware
accelerator in achieving high-performance computing. We
investigate a number of methods and architectures to opti-
mise the irregular memory accesses of the harmonic-summing
module and using Open Computing Language (OpenCL) for
the prototype design. The main contributions are as follows:
1) Reducing Intermediate Data Accesses: The straight-
forward and proposed approaches for the harmonic-
summing module are investigated and designed. The
proposed approach reduces the total number of off-chip
memory accesses by changing the processing order and
storing the intermediate data in on-chip memory.
2) Preloading Data: Based on the proposed approach,
two preloading data methods are investigated by: 1)
loading points with high touch frequency and 2) loading
necessary points that are needed to calculate a block
of points. Both these methods preload data to on-chip
memory before processing and further reduce the total
amount of off-chip memory accesses.
3) Reordering Input: Based on the preloading necessary
points method, we investigate reordering the input points
to improve the memory access speed. After reordering
the input, the data needed for each work group are from
consecutive addresses and they can be streamed to the
FPGA from off-chip memory.
4) Across Device Evaluation: The proposed methods are
ar
X
iv
:1
80
5.
12
25
8v
2 
 [c
s.D
C]
  2
9 J
un
 20
18
2implemented on FPGA using OpenCL. We adjust and
port the implementations to different devices and evalu-
ate on different series of FPGAs, general-purpose graph-
ics processing units (GPGPUs) and CPUs for compari-
son.
The rest of the paper is organized as follows. Section II gives
related work on optimising irregular memory accesses and
high-level tools for developing for FPGAs. Section III provides
the details of the harmonic-summing module and the design
goals. In Section V, two approaches of OpenCL-based designs
of the harmonic-summing module are proposed and compared.
Section VI presents the evaluation and results are discussed.
Finally, the conclusions are given in Section VII.
II. RELATED WORK
A. Irregular Memory Access Optimisation
In hardware-based high-performance computing, the effi-
ciency of data transfer between the accelerator and the memory
system is an important factor. A large amount of research
has been done to improve the memory access efficiency for
accelerators such as GPGPUs [15] and FPGAs.
For some applications, the accesses to memory are irregular
that limits the performance of the accelerator, and this problem
has been well-studied [11]. For most applications with irregu-
lar memory access, there are mainly two types of optimisation
techniques: 1) reducing the number of accesses and 2) schedul-
ing as many accesses in parallel[28]. These two methods can
be applied to various platforms such as FPGAs [29]. For some
graph computation problems in [27], an on-chip distributed
off-chip shard memory architecture with high-performance
shuffle network was investigated and the intermediate buffers
were reduced to save off-chip memory bandwidth. In [30],
prefetching is researched to reduce the number of memory
accesses. In [14], an irregular stream buffer (ISB) that tar-
gets the irregular sequences of temporally correlated memory
references is proposed. Data and computation reordering is
employed in [17] to improve memory hierarchy performance.
Besides these approaches, many compilers focus on irregular
memory access such as ROCCC [10] for FPGAs and Sparse
matrix-vector multiplication(SMVM) [8].
Regarding the optimisation of two-dimensional harmonic
summing calculations done in this research, we are not aware
of any prior work which investigating it on a large-scale,
especially in the context of acceleration devices such as GPUs
and FPGAs.
B. FPGA as an Accelerator
High-end FPGAs have been widely adopted as accelerators
in many commercial applications and research areas such as
high-frequency trading [16] and cloud computing [7]. Because
of the outstanding energy-efficient performance over GPGPU
devices, Microsoft applied high-end FPGAs in their data
centers [20], and FPGA-based accelerators appear in other
cloud data centers as well [24]. Several science projects of
different areas such as SKA [25], CERN [23], and DNA
sequence analysis [12] exist that employ a large number of
FPGA devices for acceleration, connected through the PCI
Express (PCIe) bus or Ethernet cable.
Besides these, FPGAs are widely employed in radio astron-
omy projects as accelerators. In [5], hundreds of Xilinx Virtex-
4 FPGAs are used to implement the correlator of the SKAMP
project. In [22], FPGA platforms are employed to accelerate
digital channelised receivers. The Berkeley CASPER group,
MeerKAT, and NRAO released an FPGA-based acceleration
device for implementing the FX correlator for radio telescope
array [18].
C. High-level Synthesis
One barrier of employing FPGAs as accelerators are the
usual use of the HDL-based development process that makes
the time-to-market longer than GPGPUs and multi-core pro-
cessors. To address this, many high-level synthesis tools have
been released. Two primary FPGA vendors, Intel and Xilinx,
provide developers with their high-level tools. Intel released
several high-level development tools such as high-level synthe-
sis (HLS) compiler, which supports C++ based development,
and FPGA SDK for FPGA, which supports OpenCL [3], [4]
based development. Xilinx provides two main tools: 1) high-
level synthesis of C/C++ and SystemC and 2) SDAccel that
supports OpenCL. Besides these official tools, there are several
open source high-level synthesis tools such as LegUp [1].
OpenCL for Intel FPGA: OpenCL is an open parallel pro-
gramming language. The main advantage of OpenCL is that it
is compatible with different types of acceleration devices such
as GPGPUs, CPUs, and FPGAs. Intel released a dedicated
FPGA development tool using OpenCL, which is called Intel
FPGA SDK for OpenCL (AOCL). An FPGA-based OpenCL
application is divided into two parts: the host programs and
the kernels for devices. The host program is written in C/C++.
Before launching an OpenCL kernel in the host program, the
arguments of it are set, and all necessary data are sent to the
off-chip memory of FPGA devices through PCIe bus. OpenCL
classifies memory into two types local memory and global
memory, with the understanding that access to local memory
is faster than global, but sharing is limited. For OpenCL on an
FPGA local memory corresponds to on-chip memory such as
BRAM and global memory corresponds to off-chip memory
such as DDR3 on the FPGA board. In this research, the
Intel FPGAs are adopted to implement the harmonic-summing
module, so the optimisation syntax and techniques that are
mentioned in this paper are targeting Intel FPGAs and AOCL.
Single Work-item and NDRange Kernels: NDRange is an
important attribute of an OpenCL kernel that represents its
index space. Based on OpenCL 1.0 [9], it contains three integer
values, where each value specifies the extent of the index
space in a dimension. The FPGA-based OpenCL kernels can
be classified into two types based on their NDRange sizes:
single work-item kernel and NDRange kernel. For the single
work-item kernel, its NDRange size is (1,1,1), which means
the index space for all three dimensions are one, resulting
in a single work-group with one work-item. The kernel code
of a single work-item kernel looks more like C/C++ code
than that of NDRange kernels. However, some OpenCL-based
3optimisation attributes are included within the kernel code.
Generally, there is at least one loop in a single work-item
kernel and the number of iterations equals to the global work
size of the NDRange kernel. The ideal case of the single work-
item kernel is to launch one iteration of the outermost loop
per clock cycle, which is called loop pipelining. Regarding
NDRange kernels, its NDRange size is larger than (1,1,1) and
the overall work size has to be divided into small groups. In
each small work group, a small group of data is processed. The
size of an NDRange kernel is normally related to the details
of a task. For example, if a two-dimension NDRange kernel
is designed to process an image with 256 points (16× 16), its
global work size can be set as (16,16,1). In our research, both
two kernel types are studied and the combination of single
work-item and NDRange kernels are investigated.
III. HARMONIC-SUMMING MODULE
The harmonic-summing module is a part of the frequency
domain acceleration search (FDAS) module [21] of the pulsar
search engine (PSS), whose details are depicted in Figure 1.
In the FT-based convolution module, the overlap-save algo-
rithm [19] is employed to process the input signals in the
frequency domain and the outputs are divided into chunks,
several thousands values long. The final output from the
FT-based convolution module, which is also the input of
the harmonic-summing module, is called filter-output-plane
(FOP). The size of the FOP equals to NtempNchan, with Ntemp
being the number of templates in the FT-based convolution
and Nchan being the number of channels Nchan. In essence,
each template is an FIR filter, and the FIR filter lengths of
different templates are different. The total Ntemp templates
can be divided into three groups, group one (index 1 to
(Ntemp − 1)/2), group two (index -1 to −(Ntemp − 1)/2), and
the (unfiltered) input signals (index 0, one-tap FIR filter). The
number of channels is the same as the length of input array of
the FT convolution module. In our previous work [26], the FT
convolution module has been implemented in an FPGA using
OpenCL. Based on current requirements, an FOP contains
85 × 221 single precision floating-point (SPF) points, that is
Ntemp = 85 and Nchan = 221.
The harmonic-summing module (In Figure 1 (right)) con-
sists of two parts: 1) harmonic plane calculation and 2)
candidate detection. The task of the harmonic plane calculation
part is to generate Nhp harmonic planes using the FOP. First,
the FOP is stretched by an integer k to obtain the kth stretch
plane SPk , which is computed separately for template group
one and template group two by generating Nhp stretch planes
with Equation (1)
SPk(i, j) = SP1(
⌊
i
k
⌋
,
⌊
j
k
⌋
), k = 2, 3, ...Nhp (1)
where SP1 is the FOP and the ranges of i and j are [−(Ntemp−
1)/2, (Ntemp − 1)/2] and [0, Nchan − 1], respectively. After
all Nhp − 1 stretch planes are generated, the FOP and these
Nhp−1 stretch planes are progressively added to form Nhp−1
harmonic planes (HPs):
HPk(i, j) = HPk−1(i, j) + SPk(i, j), k = 2, 3, ...Nhp . (2)
Table I
SPECIFICATION OF THE HARMONIC-SUMMING MODULE
Parameter Description Value
Ntemp Number of templates of the FOP (row) 85
Nchan Number of channels of the FOP (column) 221
Nhp Total number of harmonic planes 8
Ncand Number of candidates per harmonic plane 200
tlimit Computation time limit of each DM trail 88ms
It can be seen that the size of each HPk is the same as that
of the FOP.
For the candidate detection, a threshold-detection logic is
applied and the potential candidates are recorded. For each
harmonic plane, a threshold array (TA) that contains Ntemp
thresholds is employed and one threshold corresponds to one
row (Nchan points) of the harmonic plane. For example,
TA(ki) is the threshold for the ith row of HPk . In each
harmonic plane, at most Ncand candidates are stored and the
maximum size of the candidate list for each de-dispersion
measure (DM) trail is NhpNcand . The output from the can-
didate detection part is the candidate list and it will be sent to
the Fourier Domain Candidates optimisation (FDAO) module
for further processing (which is part of the post-processing in
Figure 1).
Each candidate in the candidate list contains four elements:
periodicity, orbits, pulse-width, and signal power of each
detection. We use {Fi, Hi, Bi, Ai} to represent the ith candidate
in the candidate list, where Fi , Hi , Bi , and Ai are the index
of filter, harmonic plane, and bin and the amplitude of the ith
element, respectively. To minimize the data transfer bandwidth
and save off-chip memory, we use two 32-bit numbers, CLi1
and CLi2, to store the ith candidate. For Fi , Hi , and Bi ,
the minimum number of bits required is dlog2 (∗)e, so the
data sizes for them are 7-bit (85), 3-bit (8) and 21-bit (221),
respectively. These three factors can be combined together to
form one 32-bit integral CLi1 by using the formula as follows:
CLi1 = Fi × 224 + Hi × 221 + Bi .
In terms of the amplitude (spectral power), since the default
data type from the FT-based convolution module is SPF
(Single Precision Floating point, 32-bit), the same data type
is maintained after the harmonic-summing calculation, which
means CLi2 = Ai .
The details of the harmonic summing algorithm are given in
Algorithm 1, where the order of the three for loops can be
interchanged. The basic parameters of the harmonic-summing
module are shown in Table I.
IV. PROPOSED METHODS
The main problem for the harmonic summing module is the
irregular memory accesses of the harmonic plane calculation
part and it limits the data transfer efficiency. We consider two
types of memory access optimisation methods while designing
the harmonic plane calculation part: 1) increasing the off-
chip memory bandwidth and 2) reducing the number of off-
chip memory accesses. Based on the number of processed
harmonic planes at a time, two approaches are investigated:
4Beam2
Beami
Over 2,000 beams 
are formed at 4,096 
channels/beam
Beami signals are 
de‐dispersed  for 6,000 DMs
Post‐
processing
PSS Engine_i
FT 
Convolution 
Module
BeamN
DM1
DM2...
DMj
DM6000
...
FOP size: 85 x 221 points
Pre‐
Processing
.RFIM
 .DDTR 
.PSBC 
.CXFT 
.BRDZ 
.DRED 
 Single Pulse Search Modules 
 Time Domain Acceleration or
FDAS Module
Harmonic‐summing Module
FOP
FOPHP2
HP8
Candidate 
lists
SD
P
Detection logic
>
Threshold 
Array
2,1 2,2
1,2
...
... 42, 221
1,1 1, 221
Filter‐output‐plane (FOP)
‐1,1 ‐1,2
‐2,1 ‐2,2...
... ‐1, 221
‐42,1 ‐42, 221
0,1 0,2 ... 0, 221
42,1 42,2
...
Harmonic plane 1 (HP1, FOP)
HP2
HP3
HP8
Stretch the 
FOP by k and 
add to HPk‐1 
to form HPk
candidate0
candidate1
Candidate list
CandidateNc
...
Threshold arrays (TAs)
>
Detection 
logic 
(Threshold)
Harmonic plane calculation Candidate detection
Figure 1. The processing flow of the Pulsar Search Engine (PSS) of SKA1-MID CSP system and the details of harmonic-summing module
Algorithm 1 General Harmonic-summing Algorithm (SIN-
GLEHP)
SP1 ←(filter-output-plane)
CL ← 0 {initialize the detection output}
for k = 1 to Nhp do
for i = −(Ntemp − 1)/2 to (Ntemp − 1)/2 do
for j = 0 to Nchan − 1 do
SPk (i, j) ←stretch(SP1, k, i, j) {generate the value in
stretched plane}
HPk (i, j) ← HPk−1(i, j)+SPk (i, j) {based on the stretched
plane, generate the value in harmonic plane}
CL ←append detection[HPk (i, j), TA(k, i)] {threshold-
detection logic to identify valid peak signals}
end for
end for
end for
Candidate List ← CL
the SINGLEHP method (processing a single harmonic plane
at a time) and the MULTIPLEHP method (processing multiple
harmonic planes at a time).
A. Design Goals
In designing the harmonic-summing module, we mainly
consider the latency and energy dissipation of calculating the
harmonic planes and detecting the candidates using high-end
FPGAs. There are two major factors that affect the execution
latency and energy dissipation: 1) parallelisation capacity of an
FPGA and 2) data transfer rate between the FPGA and off-chip
memory. Most operations in the harmonic-summing module
are floating-point operations, however, they are inexpensive
functions such as floating-point additions and comparisons
with a constant. For high-end FPGAs, there are hundreds
of DSP blocks (to implement Floating point operations) and
hundred thousand of logic elements that can handle these
operations effectively.
In the harmonic plane calculation, the accesses to off-
chip memory is not consecutive but irregular due to the
index calculations in Equation (1). Ideally, the data transfer
bandwidth of any design equals to the device’s theoretical
maximum bandwidth, however, this cannot be achieved easily
in the harmonic-summing module. Taking a small size FOP
(64 × 212) as an example, the touching frequencies of the
FOP elements in calculating 7 harmonic planes are depicted
in Figure 2. For example, 8 points from different positions are
Frequency Channel
Frequency ChannelF
IR 
Filt
ers
FIR
 Fil
ter
s
Sum
 (1000, 60) of HP8
Figure 2. Touching frequency of each point in the FOP and an example of
calculating point (1000, 60) of HP8
needed to calculate point (1000, 60) of HP8. In this figure,
the size of the deep red area is only 1.7% of the whole FOP,
however, each value is touched 204 times. The size of the high
touching frequency area (zoomed-in area) is 16 × 210 and the
sum of the touching times of this area is 73.4% of the overall
touching times. It can be seen that the distribution of touching
frequency and memory access while calculating do exhibit a
very complex pattern. In this paper, we investigate a general
design of the harmonic-summing module with low latency, by
optimising memory accesses.
The input to the harmonic-summing module, which is the
FOP, is up to 710MBytes under current requirements and
it exceeds the on-chip memory size of high-end FPGAs and
other types of processors. Though the FOP can be transferred
to FPGAs through PCIe bus or Ethernet cable in practice, it
is assumed in this research that the FOP is stored in off-chip
memory before processing the harmonic-summing module (for
example as the output of the FT-convolution also executed on
the FPGA device [26].
In terms of the candidate detection of the harmonic sum-
ming module, when there are more than Ncand candidates
detected in one harmonic plane, the strategy of sorting candi-
dates has not yet been settled in the PSS sub-project. There
are a number of plausible strategies to select candidates, such
as storing the largest Ncand candidates or first/last Ncand
candidates.
5Due to the lack of a settle requirement, and with the
assumption that there are usually less than Ncand candidates
(which can be tuned by increasing thresholds), we investigate
the methods of storing the last Ncand candidates. The FPGA
device needs to go through all the candidates from each
harmonic plane. When there are less than or equal to Ncand
candidates in one harmonic plane, all the candidates will be
recorded. Note that based on the method and process order of
harmonic plane calculation, the recorded last Ncand candidates
might vary between different approaches.
B. SINGLEHP
For the algorithm in Algorithm 1, the processor needs to
calculate all harmonic planes individually. The SINGLEHP
method is a straightforward implementation of the harmonic-
summing module.
To calculate the points of the kth harmonic plane HPk (k ≥
2), points of the FOP and the k−1th harmonic plane HPk−1 are
required. During processing, each generated point of HPk is
compared with a threshold. Since the FOP size, NtempNchan,
exceeds the on-chip memory of FPGA devices, the FOP and
other generated harmonic planes have to be stored in the off-
chip memory of FPGA device.
The accesses of loading points from HPk−1 and storing
points to HPk are both in-order and of consecutive addresses.
However, the accesses of loading points from the FOP cannot
be calculated as a simple offset. So the data cannot be steamed
between off-chip memory and device while processing. To
optimise the memory accesses of the SINGLEHP method, the
overall pipeline can be parallelised to increase the off-chip
memory bandwidth, and we use that in our implementation.
C. MULTIPLEHP
In the harmonic summing module, only the candidates are
recorded for further processing, it is unnecessary to store the
data of all harmonic planes in off-chip memory. To reduce
the number of off-chip memory accesses, we investigate the
method to get rid of storing harmonic planes except for the
FOP. If the points of the same index in multiple or all Nhp
harmonic planes can be generated in parallel, these points can
be discarded directly after candidate detection. Without storing
the generated points back to off-chip memory, the number of
overall off-chip memory accesses can be halved.
By reordering the three for loops in Algorithm 1, we
obtain Algorithm 2, where the innermost for loop can be
parallelised and the points are discarded after detection.
To optimise the MULTIPLEHP method by reducing the off-
chip memory accesses, part of the FOP can be loaded before
calculating a chunk of points of all harmonic planes. Two
alternatives are proposed and based on the loaded data, they
can be distinguished as 1) high touching frequency (by loading
as many points as possible in the high touching frequency area
of the FOP) and 2) necessary points (by loading points that are
needed to calculate a chunk of points in all harmonic planes
such as one or several columns of all harmonic planes). For
the second method, an FOP reordering method is proposed
below to increase data transfer efficiency. Each of these three
Algorithm 2 Multiple Harmonic-summing planes based
method (MULTIPLEHP)
SP1 ←(filter-output-plane)
CL ← 0 {initialize the detection output}
for j = 0 to Nchan − 1do
for i = −(Ntemp − 1)/2 to (Ntemp − 1)/2 do
for k = 1 to Nhp do
SPk (i, j) ←stretch(SP1, k, i, j) {generate the value in
stretched plane}
HPk (i, j) ← HPk−1(i, j)+SPk (i, j) {based on the stretched
plane, generate the value in harmonic plane}
CL ←detection[HPk (i, j), TA(k, i)] {threshold-detection
logic to identify valid peak signals}
end for
discard[HP1(i, j), HP2(i, j), ..., HPNhp (i, j)]{discard the point
of same index after detection}
end for
end for
Candidate List ← CL
0 10 20 30 40 50 60 70 80 90 100
0
20
40
60
80
100
Percentage over the FOP size(%)
Pe
rc
e
n
ta
ge
 o
ve
r 
to
ta
l t
ou
ch
 ti
m
es
(%
)
Figure 3. Relationship between the size of preloaded points and the reduced
number of global memory accesses
MULTIPLEHP-based methods adopted at least one type of
memory accesses optimisation method and the details of them
are as follows.
Preloading Points with High Touching Frequency: To create
and threshold test 8 consecutive harmonic planes, each point
with the highest touching frequency needs to be loaded over
200 times. If most points with high touching frequency can be
preloaded, a large number of load operations can be saved. To
further reduce the amount of off-chip memory accesses, part
or all of the high touching frequency points can be preloaded
in on-chip memory. We use MULTIPLEHP-H to represent the
preloading points with high touching frequency method.
The main factor of the MULTIPLEHP-H method is the num-
ber of preloaded high touching points NMultipleHP−H−preld .
If the points in the FOP are sorted by touching times, the
relationship between the percentage of the FOP size and the
percentage of overall touching times is depicted in Figure 3.
It can be seen that 2.2% points in the FOP have about 50%
of overall touching times and 25% points have 90% percent
of overall touching times.
Loading Necessary Points: For the Naïve MULTIPLEHP
method, calculating one point with the same index of Nhp
harmonic planes, at most Nhp points need to be loaded from
the FOP. However, calculating a chunk of points in all Nhp
harmonic planes need less than Nhp times the number of
points. Take one column with Ntemp points as an example,
6l
l
l
ll
l
l
lll
l
lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
20 40 60 80 100
120
150
180
210
240
Number of columns per work group NMultipleHP−N−col
Av
e
ra
ge
 n
ee
d 
da
ta
 p
er
 c
ol
um
n
Figure 4. Relationship between columns per work group and the number of
points per column for the MULTIPLEHP-N method
it needs Ntemp points for HP1, however, 2
⌈(Ntemp − 1)/2⌉+1
points for HP2, 2
⌈(Ntemp − 1)/3⌉ + 1 points for HP3 and so
on. To save loading operations, the harmonic plane calculation
task can be decomposed into a number of work-groups. The
task of each work group is to generate a number of columns
NMultipleHP−N−col of all Nhp harmonic planes, where each
column has Ntemp points. In a pipeline, the loading part of
a work-group can overlap with the computing part of the
previous work-group. We use MULTIPLEHP-N to represent
the loading necessary points method.
For the MULTIPLEHP-N method, NMultipleHP−N−col is an
important factor that affects the reduced off-chip memory
accesses. Assuming the task for each work-group is to generate
one column (NMultipleHP−N−col = 1) of Nhp harmonic planes
(NhpNtemp points in total) and the maximum needed data is
2
Nhp∑
i=1
⌈
Ntemp − 1
2i
⌉
+ Nhp
instead of NhpNtemp points. When NMultipleHP−N−col is
larger than one, more off-chip memory accesses can be
reduced. However, the amount of data needed for the
same harmonic plane varies based on the column index.
For example, if the work-group generates eight columns
(NMultipleHP−N−col = 8) of all harmonic planes, the data
needed to generate the 3rd harmonic plane are 3 to 4 columns
of the FOP. To guarantee that the amount of data loaded for
each work-group is a constant (which is needed for efficient
pipelining), the maximum number of points for each harmonic
plane is chosen.
In this case, when the NMultipleHP−N−col is specified,
the needed number of columns for each harmonic plane
can be listed and then the number of needed points for
NMultipleHP−N−col columns can be calculated. Based on the
number of overall needed points, the average needed points
per column for a work-group is plotted in Figure 4. It
can be seen that the average amount drops fast when the
value of NMultipleHP−N−col is smaller than 16 (green dot
line) and it decreases slightly toward 64 (red dot line) as
the NMultipleHP−N−col increases. Besides these, the larger
the NMultipleHP−N−col , the larger space it needs in the on-
chip memory. If the NMultipleHP−N−col is too large, the on-
chip memory size might limit the NMultipleHP−N−col . As a
consequence, it is unnecessary to assign tens or hundreds of
columns to a work group.
Reordering the FOP: Comparing with the Naïve MULTI-
PLEHP method, the MULTIPLEHP-N method can further re-
duce the total amount of off-chip memory accesses. However,
the points needed for each work-group are from at least Nhp
blocks in FOP and they are from non-consecutive addresses.
Thus, the points for each work-group cannot be streamed
between off-chip memory and FPGA device.
To optimise the off-chip memory bandwidth of the
MULTIPLEHP-N method, we propose the MULTIPLEHP-
R method which reorders the FOP to form the reordered
FOP (RFOP). After reordering, the needed points to calculate
NMultipleHP−R−col columns of all harmonic planes are from
consecutive addresses that can be streamed to the FPGA
while processing. However, the size of the reordered FOP is
larger than the standard FOP size. Theoretically, the number
of rows in the reordered RFOP is increased from Ntemp to
the average needed points per column in Figure 4. Take the
NMultipleHP−R−col = 16 as an example, the smallest average
needed points per column is 141.5 that makes the size of RFOP
at least 1.66x times larger than the original FOP size. It can
be seen that the larger the NMultipleHP−R−col , the smaller the
relative size of RFOP. The details of RFOP generation and
optimisation are discussed in Section V-B The latency of extra
data transfer and FOP reordering have to be considered in the
evaluation of the MULTIPLEHP-R method.
V. ARCHITECTURE AND OPTIMISATION
In this section, we investigate the architecture of the
proposed methods and employ OpenCL as the high-level
language, whose kernels can be executed on both FPGAs
and GPUs. Having that said, since the goal is to evaluate
FPGA performance, the optimisation techniques and syntax
are dedicated to FPGAs.
A. SINGLEHP kernel
The basic structure of the SINGLEHP kernel while process-
ing the kth harmonic plane HPk is depicted in Figure 5, where
Nparal is the parallelisation factor that is restricted by global
memory (off-chip memory in this research) bandwidth (GMB)
and the logic resources of the FPGA. One optimisation goal for
the SINGLEHP kernel is to find the maximum parallelisation
factor Nparal−max that leads to a required GMB which equals
the physical off-chip memory bandwidth of a specific device.
The FOP, HPk−1, HPk , candidate list and TA are all stored
in global memory before launching the kernel. When the
kernel is launched, Nparal points from HPk−1 and
⌈
Nhp
k
⌉
points from FOP are loaded per clock cycle. These points
are summed, according to Equation (1) and Equation (2), to
calculate Nparal points of HPk . The generated Nparal points
are compared with the corresponding thresholds and detected
candidates are saved in a shift register or local memory (on-
chip memory in this research) of length Ncand , until all
FOP points have been processed. Then these Nparal points
overwrite the values at the same address of HPk−1.
In OpenCL, both single work-item and NDRange kernel
types can be adopted to implement the SINGLEHP kernel.
7FOP
(in Global Memory)
HPk‐1
(in Global Memory)
Load NP points 
(NP x 32bits) / 
clock cycle
Load NP points 
(NP x 32bits) / 
clock cycle
[i,j] [i+1,j] ...
[i,j] [i+1,j] ...
[i+Np‐1,j] 
[i+Np‐1,j] 
[i,j]
[i+1,j]... ...
>
>
>
[j]
[i+Np‐1,j] 
Shift Register 
Candidate list
(in Global 
Memory)
HPk
(in Global Memory)
Threshold Arrays
(in Global Memory as 
Constant Memory)
Figure 5. Architecture of the SINGLEHP kernel
To parallelise Nparal points in a single work-item kernel,
we partially unroll the outermost loop by a factor of Nparal
(#pragma unroll Nparal). Before partially unrolling the
outermost loop, the innermost loops are completely unrolled
to achieve loop pipelining.
For the NDRange kernel, kernel vectorisation
(num_simd_work_items(Nparal)) and compute unit
replication (num_compute_units(Nparal)) techniques can
be employed to parallelise the kernel. Note that detected
Ncand candidates might be different for the vectorized and
replicated kernels, under the condition that the threshold
has been triggered more than Ncand times. As only the last
Ncand candidates are stored, different parallelisation result in
different execution orders and hence candidates.
The SINGLEHP kernel can be implemented as a generic
kernel that needs to be launched Nhp times (multiple launches)
or a specific kernel that only needs to be launched once
(single launch) to generate the candidate list of Hhp harmonic
planes. The overhead of launching a kernel such as setting
kernel arguments will affect the overall latency, especially
when the kernel execution latency is short. So the kernel
launch time is an important factor for the SINGLEHP kernel.
Multiple launches provide more flexibility than the single
launch SINGLEHP kernel, as it can be used for any harmonic
plane configuration. Both single and multiple launches kernels
are evaluated in Section VI.
B. MULTIPLEHP Methods based Kernels
Although parallelising the SINGLEHP kernel can shorten
kernel execution latency by increasing GMB, the total amount
of global memory accesses (GMA) is not reduced. The main
advantage of the MULTIPLEHP method is the reduction of the
required GMA by processing multiple harmonic planes at the
same time. A number of optimisation techniques are investi-
gated for the MULTIPLEHP-based methods in the following.
Naïve MULTIPLEHP: The Naïve MULTIPLEHP kernel cal-
culates Nparal points of all Nhp harmonic planes with the
same index, where Nparal is the parallelisation factor. The
architecture of the Naïve MULTIPLEHP kernel is shown in
Figure 6, where the operations in the red dot rectangle have
to be parallelised Nparal times to process Nparal points of all
harmonic planes. In OpenCL, this is implemented as a single
work-item type, and the #unroll pragma Nparal is added
before the main for loop in the kernel code.
[i,j]
[i/2,j/2]
[i/3,j/3]
[i/4,j/4]
[i/5,j/5]
[i/6,j/6]
[i/7,j/7]
[i/8,j/8]
[i,j]
[i,j]
[i,j]
[i,j]
[i,j]
[i,j]
[i,j]
[i,j]
Load 8 points (8 
x32bits) / clock 
cycle
summing and generate the [i,j] 
element in all HPs
[i] [i] [i] [i] [i] [i] [i] [i]
>
>
>
>
>
>
>
>
Shift Reg 4
Shift Reg 3
Shift Reg 1
Shift Reg 2
Shift Reg 8
Shift Reg 7
Shift Reg 5
Shift Reg 6
[0][0][0][0][0][0][0][0]...
...
...
...
...
...
...
...
[85] [85] [85] [85] [85] [85][85] [85]
Threshold 
Arrays
Summing and Detection logic
 (Single work-item Kernel)
To global 
memory
Constant memory 
2,1 2,2
1,2
...
... 42, 222
1,1 1, 222
-1,1 -1,2
-2,1 -2,2...
... -1, 222
-42,1 -42, 222
0,1 0,2 ... 0, 222
42,1 42,2
...
Filter-output-plane (FOP)
Figure 6. Architecture of the Naïve MULTIPLEHP kernel (Single work-item)
The FOP is stored in global memory and Nhp points
((i, j), (bi/2c , b j/2c), ..., ( ⌊i/Nhp⌋ , ⌊ j/Nhp⌋ )) are loaded in
parallel to generate point (i, j) of all Nhp harmonic planes.
Then these Nhp points are compared with the corresponding
thresholds, stored as constant memory. Nhp independent arrays
of size Ncand , one corresponding to each harmonic plane,
are employed to store the candidates. Both local memory and
shift register can be adopted to implement Nhp arrays and the
performance difference is evaluated in Section VI. After all
Nhp harmonic planes have been processed, the Nhp candidate
arrays are sent back to global memory. Because the loading
accesses to the global memory are irregular, a high memory
stall percentage will impede the kernel from achieving a high
performance.
MULTIPLEHP-H: The MULTIPLEHP-H kernel builds on
the Naïve MULTIPLEHP kernel, which is a single work-
item kernel. MULTIPLEHP-H is however split into two
parts, preloading and computing. The NMultipleHP−H−preld
preloaded points that can be seen as constant cache memory
are loaded into a FIFO at runtime. In processing one FOP, there
is no overlap between the prefetching and computing parts.
The available local memory of the FPGA and the number
of high touching frequency points affects the performance of
the MULTIPLEHP-H kernel. If the FOP size is comparable
to the available local memory, most of the points with high
touching frequency can be loaded and then most of the global
memory accesses can be reduced. However, if the number of
high touching frequency points is significantly larger than the
local memory size, it is impossible for the device to hold most
of these important points. Besides these, the large proportion
of the used on-chip memory might lead to the decrease of
kernel frequency. In this case, it is necessary to search for
the suitable NMultipleHP−H−preld for the target FPGA by
testing a range of preloading data sizes. The relationship
between the NMultipleHP−H−preld and the kernel performance
is investigated in Section VI.
MULTIPLEHP-N: The MULTIPLEHP-N method is a mem-
ory accesses saving method, as discussed in Section IV-C.
It decomposes the overall task into a number of work-
groups, and the task for each work-group is to process
NMultipleHP−N−col columns of all harmonic planes. The
NDRange kernel type is employed and the preloading part
of a work-group overlaps with the computing part of the
8previous work-group. For the NDRange kernel, different work-
groups do not share local memory and it is inefficient to
save candidates in global memory during processing. The
hybrid kernel type that contains both single work-item type
and NDRange type is employed to implement the preloading
necessary points kernel (MULTIPLEHP-N).
The relationship between the work group size of
the NDRange kernel and the execution latency is stud-
ied next. The task of each work-group is to gener-
ate NMultipleHP−N−col columns of all harmonic planes,
which contains NMultipleHP−N−colNchan points. For each
work-group, NhpNMultipleHP−N−colNchan points are stored
in local memory using the OpenCL barrier technique
(barrier(CLK_LOCAL_MEM_FENCE)). A number of points in
these NhpNMultipleHP−N−colNchan points are from the same
index in the FOP and they only need to be loaded once.
The NDRange harmonic plane calculation kernel is con-
nected with the single work-item candidate detection kernel
through OpenCL channels, which is a FIFO buffer in essence.
The OpenCL channel is an effective approach to transfer data
between different kernels without touching global memory.
The candidate detection part is the same as that of Naïve
MULTIPLEHP kernel and MULTIPLEHP-H kernel.
MULTIPLEHP-R: The MULTIPLEHP-R kernel is based
on the MULTIPLEHP-N kernel and the main difference is
the order of the data for each work-group. After reordering,
the points needed for a work-group are from consecutive
addresses.
The total amount of needed data for a work-group
(Ntotal/wg) is the product of average needed data per
column times the number of columns per work-group
(NMultipleHP−R−col) (see also Figure 4). To achieve stream
mode in global memory access, the number of loaded points
per clock cycle (Nlpoints/cc) has to be an integer constant,
which makes the product of Nlpoints/cc and work-group size
(Sworkgroup) usually larger than Ntotal/wg and never less,
Ntotal/wg ≤ Nlpoints/ccSworkgroup .
In case of difference, the input array for each work-group has
to be padded with dummy values at the end. The relationship
between Nlpoints/cc and NMultipleHP−R−col is shown in Ta-
ble II, where Npoints/wi is the executed points of all harmonic
planes per work-item. The value in the bracket (×∗) represents
the ratio of total loaded points over the FOP size:
Nlpoints/ccSworkgroupNworkgroup
NchanNtemp
,
where Nworkgroup is the total number of work-groups. We
use MULTIPLEHP-R-(NMultipleHP−R−col, Npoints/wi) to rep-
resent kernel MULTIPLEHP-R with the specified settings.
The larger NMultipleHP−R−col and Npoints/wi , the less data
needs to be loaded from global memory. Because of physical
limitation, if the needed bandwidth of loading Nlpoints/cc
points exceeds the total device off-chip memory bandwidth,
the performance will not increase and the kernel was not
implemented.
It is clear that Nlpoints/cc , NMultipleHP−R−col , and
Npoints/wi are the three main parameters for kernel MUL-
Table II
NUMBER OF LOADED POINTS PER CLOCK CYCLE Nl point s/cc OF
DIFFERENT Npoint s/wi AND NMult ipleHP−R−col COMBINATIONS FOR
GENERAL AND OPTIMISED MULTIPLEHP-R (NUMBER IN (×∗) SHOWS
TOTAL LOADED POINTS IN RELATION TO FOP SIZE)
Npoint s/wi Opt. ×1 ×2 ×4 ×8
Columns
1 × 3 (×3) 6 (×3) 12 (×3) 23 (×2.9)√ 4 (×4) 8 (×4) 16 (×4) 32 (×4)
4 × 2 (×2) 4 (×2) 8 (×2) 15 (×1.9)√ 2 (×2) 4 (×2) 8 (×2) 16 (×2)
16 × 2 (×2) 4 (×2) 7 (×1.8) 13 (×1.6)√ 2 (×2) 4 (×2) 8 (×2) 16 (×2)
64 × 2 (×2) 4 (×2) 7 (×1.8) 13 (×1.6)√ 2 (×2) 4 (×2) 8 (×2) 16 (×2)
SP1 SP2 SP3 SP4 SP5 SP6 SP7 SP8 PADDED
Points for one work 
group in  the optimised  
reorder FOP
Points for one work 
group in the general 
reorderd FOP
0 200 400 600 12001000800
16x(Ntemp‐1)/2
Figure 7. Needed points for one work group of MULTIPLEHP-R, the input
array is reordered without optimising Npoint s/wi (top) and the optimised
input array when Npoint s/wi is a power of 2 (bottom)
TIPLEHP-R and they have to be balanced to achieve good
performance. Using the AOCL compiler, it becomes apparent
that using the number that is powers of 2 for Nlpoints/cc results
in more efficient implementations than other numbers. Hence,
to make the value of Nlpoints/cc equal a power of 2, more data
might need to be loaded for each work group. Take the kernel
MULTIPLEHP-R-(8, 8) for example, the value of Nlpoints/cc
is 13, it has to be increased to the nearest power of 2, which
is 16. Since the number of loaded data per work-group is
Nlpoints/ccSworkgroup , the increase of Nlpoints/cc leads to the
increase of loading operations (as can be seen in the example
in Figure 7). The optimised Nlpoints/cc , where Nlpoints/cc is
the lowest power of 2 greater or equal to the corresponding
Nlpoints/cc of values without optimisation in Table II. When
NMultipleHP−R−col ≥ 4, the total loaded data is twice the FOP
size (value in the bracket).
Take the NMultipleHP−R−col = 16 and half FOP as an ex-
ample, the input array needed for one work-group is depicted
in Figure 7, where SPi represents the needed points to form
the ith stretch plane and ’PADDED’ are the dummy points to
be padded at the end of each array. It can be seen that when
Npoints/wi is optimised to a power of 2, more points need to
be loaded during processing.
For the hybrid kernels (combining NDRange
and single work-item kernels) MULTIPLEHP-H,
MULTIPLEHP-N, and MULTIPLEHP-R, adding
attributes num_simd_work_items(Nparal) or
num_compute_units(Nparal) can only parallelise the
NDRange part but not the single work-item part. To vectorize
the hybrid kernel and make it execute in a single instruction
multiple data (SIMD) fashion [13], it has to be parallelised
9manually in the kernel code.
C. Comparison
The main challenge for FPGA devices in efficiently imple-
menting the harmonic-summing module is the global memory
bandwidth and the number of global memory accesses. In this
section, we analyze the GMA of the kernel discussed above
and the GMB is evaluated in Section VI.
For the SINGLEHP method, to process Nhp harmonic
planes, the minimum number of FOP point accesses is
Nhp∑
i=1
⌈
Nchan
i
⌉ ⌈
Ntemp − 1
i
⌉
.
Except for calculating HP1, the loaded points from the FOP
have to be summed with the points of the HPk harmonic plane
and then the generated points of the next harmonic planes
HPk+1 are stored. The numbers of load and store operations of
this part are both (Nhp −1)NtempNchan . Hence, the minimum
amount of global memory accesses for SINGLEHP method
GMASingleHP−minis the sum of these accesses:
GMASingleHP−min =
Nhp∑
i=1
⌈
Nchan
i
⌉ ⌈
Ntemp
i
⌉
+ 2(Nhp − 1)NtempNchan.
In the MULTIPLEHP methods, only the FOP and candidates
need to be stored in global memory and the number of store
operations to global memory for all MULTIPLEHP kernels is 0.
For the Naïve MULTIPLEHP kernel, at most Nhp points from
FOP need to be loaded to calculate one point of the same
index in all Nhp harmonic planes. In this case, the maximum
number of memory accesses is GMANaïve−MultipleHP−max =
NhpNtempNchan.
For the preloading high touching frequency
method MULTIPLEHP-H, the GMA depends on
NMultipleHP−H−preld . The maximum amount of
memory accesses is GMAMultipleHP−H−max =
NhpNtempNchan − NMultipleHP−H−preld and the minimum
amount is to store the whole FOP in the local memory,
which means GMAMultipleHP−H−min = NtempNchan. For
the preloading necessary points method MULTIPLEHP-
N and the reorder FOP method MULTIPLEHP-R, the
GMAMultipleHP−N and GMAMultipleHP−R are both
multiple of the FOP size NtempNchan. Table III
summarizes the GMA of the different kernels. C0, C1
and C2 are all constants and no less than 1. The range
of C0 is
[
1, NhpNtempNchan/NMultipleHP−H−preld
]
. For
C1 and C2, C1 ≤ C2 and, in Table II, C2 = 2 when
NMultipleHP−R−col ≥ 4.
The load accesses of the SINGLEHP kernel
GMASingleHP−min is larger than the overall point accesses
(store+load) of each MULTIPLEHP method based kernel.
This is the major advantage of the MULTIPLEHP method
over the SINGLEHP method.
Table III
NUMBER OF ACCESSES TO AND FROM GLOBAL MEMORY
Kernels GMA (Store) GMA (Load)
SINGLEHP (Nhp − 1)× NhpNt empNchan+
Nt empNchan
∑Nhp
i=2
⌈
Nchan
i
⌉ ⌈
Nt emp
i
⌉
Naïve 0 NhpNt empNchanMULTIPLEHP
MULTIPLEHP-H 0 NhpNt empNchan−
C0NMult ipleHP−H−preld
MULTIPLEHP-N 0 C1Nt empNchan
MULTIPLEHP-R 0 C2Nt empNchan
VI. EXPERIMENTAL EVALUATION
To experimentally evaluate the harmonic-summing module,
the straightforward SINGLEHP method and the proposed
MULTIPLEHP-based methods are evaluated in this section.
The FPGA-based harmonic-summing kernels are assessed ac-
cording to their resource usage, execution latency, and energy
dissipation. Additionally, we compare those results to latency
and energy dissipation of the kernels implemented on GPU
and multicore CPUs.
A. Experimental Setup
Four different devices are employed to evaluate the perfor-
mance of the proposed designs on CPU, GPU, and FPGAs.
Two types of Intel FPGAs (Stratix V, referred to as S5, and
Arria 10, referred to as A10) are compared with one mid-range
AMD R7 GPU, referred to as R7, and a general Intel i7 CPU,
referred to as I7. The specifications of these platforms are
given in Table IV. The FPGA and GPU cards are connected
to the host processor through the PCIe bus.
All FPGA-targeting OpenCL kernels are compiled using
AOCL version 16.0.0.222 and GPU-targeting kernels are
compiled using AMD APP SDK version 3.0. For the CPU
platform, the C code, which is based on the same kernel code,
is compiled using GCC, using OpenMP for parallelisation.
Since the top half (from row 1 to Nt emp−12 ) and the bot-
tom half (from row 1−Nt emp2 to -1) are independent for the
harmonic-summing module, we investigate the performance,
in terms of the execution latency and energy dissipation, of
half of the FOP as specified in Table I, which size is 42×221.
Remember from Section III that the upper and lower half
of the FOP can be processed independently and the required
processing is identical. The size of candidate list is 200.
B. Resource Usage
Because the harmonic-summing module is not a compute-
intensive application, the DSP block utilization of all imple-
mentations is less than 5%. We discuss the logic utilization,
RAM blocks utilization, and kernel frequency in this section.
SINGLEHP: A series of SINGLEHP kernels with different
parallelisation factors Nparal are evaluated. These kernels are
employed to generate eight harmonic planes of half FOP. All
these kernels are NDRange kernels and the work group sizes
are set to 256. The usage of logic cells and RAM blocks
of these kernels are given in Figure 8, where ’S’ and ’M’
10
Table IV
SPECIFICATIONS OF CPU, GPU AND FPGA PLATFORMS
Device Terasic DE5-Net (S5) Nallatech 385A (A10) Sapphire Nitro R7 370 (R7) Intel CPU Host (I7)
Hardware Intel Stratix V 5SGXA7 Intel Arria 10 GX1150 AMD Radeon R7 370 Intel Core i7-6700K
Technology 28nm 20nm 28nm 14nm
Compute resource 622,000 LEs 1,506,000 LEs 1,024 Stream Processors 8 Processors256 DSP blocks 1,518 DSP blocks (16 Compute Units) (4 Cores)
On-chip memory size 50Mb 53Mb — 64Mb
Off-chip memory size 2 x 2GB DDR3 2 x 4GB DDR3 4GB GDDR5 64GB DDR4
Memory interface width 2 x 64-bit 2 x 72-bit 256-bit —
Max clock frequency 600MHz 1.5GHz 985MHz 4.2GHz
Max power consumption — 75W 150W —
l
l
l
l
l
1 2 4 8 16
10%
20%
30%
40%
Parallelization Factor Nparal
Lo
gi
c 
Ut
iliz
at
io
n
l (M,V)
(M,R)
(S,V)
(S,R)
l
l
l
l
l
1 2 4 8 16
10%
20%
30%
Parallelization Factor Nparal
R
AM
 b
lo
ck
s 
Ut
iliz
at
io
n l (M,V)(M,R)
(S,V)
(S,R)
Figure 8. Logic utilization and RAM block usage of SINGLEHP kernels on
A10
represent single launch and multiple launches, and ’V’ and ’R’
represent kernel vectorization and replication. The candidate
detection part is included, and the local memory is employed
to store the candidate during processing. When Nparal = 1, it
means the kernel is not parallelised and that vectorization and
replication are not employed.
It can be seen that the usage of both resources increases
as Nparal increases. These trends are similar to those ob-
served for execution on S5. The kernel frequency drops as
the resource usage increases across all kernels. Take the
kernel SINGLEHP-(M,V) on A10 as an example, its frequency
decreases from 266.9MHz at Nparal = 1 to 236.8MHz at
Nparal = 16.
MULTIPLEHP: In terms of the MULTIPLEHP designs,
Naïve MULTIPLEHP, MULTIPLEHP-H, MULTIPLEHP-H, and
MULTIPLEHP-R (Section V) are evaluated.
Naïve MULTIPLEHP and MULTIPLEHP-H: The MUL-
TIPLEHP-H is based on the Naïve MULTIPLEHP-H, and the
main difference is that it preloads a block of data before
calculating. The resource usages of these kernels is plot-
ted over the preloaded data size in Figure 9. The value
points for NMultipleHP−H−preld = 0 correspond to Naïve
MULTIPLEHP. The logic utilization is not affected by the
increase of NMultipleHP−H−preld , however, the RAM blocks
utilization increases. The kernel frequency is around 210MHz
for S5 based implementations and 220MHz for A10 based
implementations.
MULTIPLEHP-N and MULTIPLEHP-R: In contrast to
MULTIPLEHP-H, kernel MULTIPLEHP-N and MULTIPLEHP-
R do not depend heavily on local memory size. MULTIPLEHP-
R is based on MULTIPLEHP-N, however, it does not need to
load points from different locations.
For MULTIPLEHP-N, different column numbers
l
l l l l l l
0 5x210 5x211 5x212 5x213 5x214 5x215
20%
40%
60%
80%
Preload data size
R
es
ou
rc
es
 u
sa
ge
l Logic cells−S5
RAM blocks−S5
Logic cells−A10
RAM blocks−A10
Figure 9. Resource usage of MULTIPLEHP-H with different
NMult ipleHP−H−preld on S5 and A10
Table V
RESOURCE USAGE AND KERNEL FREQUENCY OF MULTIPLEHP-N WITH
CANDIDATE DETECTION ON A10
Columns 1 2 4 6 8
Logic cells 17% 25% 25% 29% 30%
RAM blocks 19% 44% 49% 56% 69%
Frequency 276.54 193.38 171.11 148.67 165.48(MHz)
Latency 328.0 469.0 530.1 610.1 548.1(ms)
(NMultipleHP−N−col) are evaluated, and the results are listed in
Table V. As can be seen with increasing NMultipleHP−N−col
both logic cell and RAM block utilization increase. For
most of the kernels, the kernel frequency is decreased as
NMultipleHP−N−col increases.
Regarding MULTIPLEHP-R, to arrange the data for each
work group into a consecutive address area, the half FOP
is reordered into a half RFOP (Section IV-C), in the host
program using memcpy(). The reordering latency on the
employed host is 87.8ms and the performance of two variants
of MULTIPLEHP-R kernels (generating 16 and 64 columns
of all eight harmonic planes per work group) are evaluated,
which is shown in Table VI. Four different points per work-
item values Npoints/wi (1, 2, 4, and 8) are tested in this
research. Since the values of Nlp/cc for Npoints/wi = 1 and
Npoints/wi = 2 are already powers of 2, so we focus on the
other two conditions (Npoints/wi = 4 and Npoints/wi = 8)
and the resource usage of the general and the optimised
implementations with these values are given in Table VI. For
the optimised implementations, the values of Nlpoints/cc are
powers of 2 and this costs fewer logic cells than the general
implementations. Since more points are loaded per clock cycle,
11
l
l
l
l l
1 2 4 8 16
0
4
8
12
16
20
Parallelization Factor Nparal
G
lo
ba
l m
em
or
y 
ba
nd
wi
dt
h 
(G
B/
s)
l (M,V)
(M,R)
(S,V)
(S,R) l
l
l l
l
1 2 4 8 16
0
1000
2000
3000
4000
5000
6000
7000
Parallelize Factor
Ke
rn
e
l E
xe
cu
tio
n 
La
te
nc
y 
(m
s)
l (M,V)
(M,R)
(S,V)
(S,R)
Figure 10. GMBs and execution latency of SINGLEHP on A10
the optimised implementations consume more RAM blocks.
Besides these, the kernel frequency of the optimised imple-
mentations is higher than that of general implementations.
Since NMultipleHP−R−col , Np/wi , and Nlp/cc are three main
parameters that affect the performance of MULTIPLEHP-R, we
investigate the trend of changing these parameters, but here
without candidate detection, hence the values in Table VI are
only for the NDRange part. We do this because after combin-
ing with the candidate detection, some of the MULTIPLEHP-R
kernels such as MULTIPLEHP-R-(64, 8) cannot be compiled
because of the limited resources, and we wanted to explore
the influence of the parameters in a good range. We employ
the MULTIPLEHP-R-(16, 4) kernel with candidate detection,
which can be compiled on both S5 and A10, to compare with
other methods. In the future, as FPGA technology upgrades,
the amount of on-chip logic cells and RAM blocks increase.
The values of NMultipleHP−R−col and Npoints/wi can be raised,
and the execution latency is likely to be faster than that
achieved in Table VI.
:
C. Latency Evaluation
Harmonic Plane Calculation on FPGA: To find the suitable
design for a specific device, we evaluate the overall execu-
tion latency of the harmonic-summing module, including the
harmonic plane calculation and the candidate detection. The
points of the 8th harmonic plane are compared with the result
of a Matlab implementation to verify the correctness of the
harmonic plane calculation in the different designs.
SINGLEHP: The used GMBs and execution latencies of
the SINGLEHP kernel with various Nparal in Section VI-B
are shown in Figure 10. As Nparal increases the GMBs of
all SINGLEHP kernels increase, however, not all execution
latencies are decreased.
For the two multiple launches (’M’) kernels SINGLEHP-
(M,V) and SINGLEHP-(M, R), the launching overhead is
hundreds of times smaller than the kernel execution latency
and hence negligible. For the two single launch kernels SIN-
GLEHP-(S,V) and SINGLEHP-(S, R), the performance stops
increasing when Nparal is larger than 8. When Nparal =
8, kernel SINGLEHP-(S, R) performs better than other ker-
nels and the SINGLEHP-(M, R) kernel performs best when
Nparal = 16, which is about 7.5 times faster than SINGLEHP-
(M, ) with Nparal = 1.
l
l
l
l
l
l
5x210 5x211 5x212 5x213 5x214 5x215
380
400
420
440
Preload data size
Ke
rn
e
l E
xe
cu
tio
n 
La
te
nc
y 
(m
s)
l S5
A10
Figure 11. Execution latencies of the MULTIPLEHP-H kernels with different
sizes of preloaded points
Naïve MULTIPLEHP: The execution latency of kernel
Naïve MULTIPLEHP on S5 is over one second (1, 210ms),
however, the same kernel achieves a better performance, which
is less than 400ms on A10. The main reason is the kernel
frequency achieved on A10 is over two times higher than that
on S5. This might be caused by the board support packages
(BSPs) provided by different vendors.
MULTIPLEHP-H: The relationship between the number
of preloaded data points NMultipleHP−H−preld and the execu-
tion latency of MULTIPLEHP-H is investigated on both S5 and
A10. The half FOP is transposed and then processed row by
row (each row has Nt emp−12 points). The execution latencies
of these kernels are depicted in Figure 11. It is clear that the
execution latency does not have a linear relationship with the
NMultipleHP−H−preld and the execution latency might increase
as NMultipleHP−H−preld gets larger. Unfortunately, even the
largest NMultipleHP−H−preld (5×215) used in the experiments,
and limited by the available FPGA resources, contains only
4.7% of the total number of all memory accesses. The best
performance achieved on S5 and A10 are both by executing
kernel MULTIPLEHP-H-(5 × 213). Some improvements could
be made by overlapping the loading of the high touching
frequency points with the computing part, but not substantially.
Overall, MULTIPLEHP-H is not gaining performance if the
local memory size is not large enough to hold most of the
points with high touching frequency.
MULTIPLEHP-N: For kernel MULTIPLEHP-N, the nec-
essary data for each work group are from nonconsecutive
addresses and this affects the loading section in achieving
streaming mode, which is crucial to fully use the available
theoretical bandwidth. Although executing more columns per
work group can reduce GMA, the value of NMultipleHP−N−col
does not affect performance. The execution latency of MUL-
TIPLEHP-N is affected by the kernel frequency, which is
given in Table V. We employ the kernel with the fastest
execution latency to compare with other methods, which is
MULTIPLEHP-N-(1).
MULTIPLEHP-R: The kernel execution latency and
global memory occupancy during execution on A10 are given
in Table VI as well. When the value of Nlpoints/cc is a power
of 2, the execution latency decreases as Npoints/wi increases.
Although the occupancy of loading operations drops, the
values for the optimised kernels decreases slower than that of
the general kernels. The fastest variant of MULTIPLEHP-R in
12
Table VI
RESOURCE USAGE AND EXECUTION LATENCY OF MULTIPLEHP-R (NDRANGE PART ONLY) WITH (Nl p/cc IS POWER OF 2) AND WITHOUT OPTIMISING
GMB ON A10 (without candidate detection)
NMult ipleHP−R−col 16 64
Np/wi Nl p/cc
Logic RAM Freq. Latency Occup. Logic RAM Freq. Latency Occup.
utilization blocks (MHz) (ms) utilization blocks (MHz) (ms)
1 2 14% 12% 269.2 350.8 93.4% 15% 12% 266.7 336.2 98.3%
2 4 15% 13% 286.5 176.7 87% 16% 12% 252.8 180.9 96.6%
4 7 22% 14% 196.3 189.1 59.4% 22% 15% 161.9 248.1 55%
4 8 19% 16% 263.0 107.7 77.8% 19% 30% 229.6 102.5 93.8%
8 13 35% 17% 130.3 163.9 51.6% 37% 17% 135.1 158.4 51.6%
8 16 28% 29% 168.1 93.1 70.5% 29% 48% 171.4 71.7 89.7%
SHP−(S,R) Naive MHP MHP−H−(8,192) MHP−N−(1) MHP−R−(16,4)
0
200
400
600
800
1000
1200
OpenCL kernels
La
te
nc
y(m
s)
S5
A10
A10x3
Figure 12. Execution latency of proposed harmonic summing methods with
candidate detection on A10, where SHP represents SINGLEHP and MHP
represents MULTIPLEHP
Table VI is MULTIPLEHP-R-(64, 8). By adding the candidate
detection, the execution latency increases, however, faster
than other MULTIPLEHP kernels. For kernel MULTIPLEHP-
R-(16, 4), the execution latencies on a single S5 and A10 are
143ms and 120ms, respectively.
Overall Comparison: Based on the discussion above,
the execution latency of each well-optimised method with
candidate detection is given in Figure 12, and both types of
FPGA devices are evaluated, where the red dashed line is the
current time limitation for the SKA harmonic summing. We
also evaluate a setting where three A10 FPGAs are used in
parallel.
Note that, SINGLEHP-(M, R) and Nparal = 16 on S5 cost a
large number of RAM blocks and cannot be compiled, hence,
SINGLEHP-(S, R) with Nparal = 8 is used. The execution
latency of MULTIPLEHP-N-(1) is faster than that of Naïve
MULTIPLEHP and MULTIPLEHP-H-(8, 192), however, it is
about 3x times slower than MULTIPLEHP-R-(16, 4). Except
for Naïve MULTIPLEHP on S5, all MULTIPLEHP kernels
perform better than SINGLEHP-(S, R) with Nparal = 8.
Although the performance is improved by adopting MULTI-
PLEHP kernels, none of these kernels on a single A10 meets
the requirement. By installing three A10 FPGA cards, they
can work in parallel by processing three different half FOPs.
The average execution latencies of half FOP using three A10
cards are given in Figure 12 as well. It can be seen that
kernel MULTIPLEHP-R on three A10 cards is over 2x times
faster than the required time limitation, so three A10 cards can
process the whole FOP while meeting the requirements.
Comparison with CPU and GPU: We are now comparing
the performance of the proposed kernels on GPU (using
Table VII
SPEEDUP OF MULTI-CORE CPU, GPU, AND FPGA PLATFORMS OVER
SINGLE CORE CPU IN PROCESSING SINGLEHP KERNEL INCLUDING
CANDIDATES DETECTION
Device Execution latency(ms) Speedup over I7 − 1C
S5 875 4.8
A10 671 6.2
R7 119 35.2
I7 − 4C 1, 100 3.8
I7 − 1C 4, 174 1
adjusted OpenCL code) and CPU (using equivalent OpenMP
implementations). SINGLEHP-(M, ) is evaluated on R7 GPU,
and the host argument settings are the same as for the
FPGA-based implementation. The straightforward C code with
OpenMP directives, using three levels of for loops, which is
the same as Algorithm 1, is evaluated on the I7 CPU using
all four cores. The execution latency of SINGLEHP using one
core of I7 CPU (I7 − 1C) is taken as the baseline and the
speedups over it on other devices are given in Table VII, where
I7 − 4C represents using four cores of the I7 CPU. It can be
seen that R7 performs best among these devices and it is about
3.6x times faster than the A10 FPGA. The R7 has two major
advantages over S5 and A10: 1) operating frequency and 2)
off-chip memory bandwidth. Though the maximum frequency
of A10 is higher than R7, the maximum frequencies of the
implemented kernels are less than 300MHz in this work.
Regarding the MULTIPLEHP kernels on GPU, a simi-
lar OpenCL code as used for the FPGA kernels of Naïve
MULTIPLEHP and MULTIPLEHP-H are tested. The execution
latencies of these kernels are both over 30 seconds, which
are about a hundred times slower than that of a single
A10 FPGA. Because these two variants are single work-item
kernels, the GPU cannot parallelise operations on multiple
stream processors. For the fastest MULTIPLEHP kernel on
A10, which is MULTIPLEHP-R-(64, 8) (NDRange kernel part),
the execution latency of it (without candidates detection) on
R7 is 19.7ms, and it is 3.7 times faster than achieved on A10.
After combining with the candidate detection, which is a single
work-item kernel, the performance drops as Ncand increases.
When Ncand = 1, the execution latency is 46.8ms. However,
when Ncand is increased to 200, the latency increases to
10 seconds. Since single work-item kernels on GPU cannot
explore their performance potential, we only compare the
performance of NDRange kernels on FPGA and GPU devices.
Based on the above, an R7 is over 3.7 times faster than
13
Table VIII
POWER CONSUMPTION AND ENERGY DISSIPATION OF FPGA, GPU, AND
CPU PLATFORMS (WITHOUT CANDIDATE DETECTION)
Kernel-Setting (Device) Power Energy Saving ratio(watts) (Joules)
SINGLEHP-(M, R) (A10 × 3) 23 3.36 19.9
SINGLEHP-(M, ) (R7) 65 8.9 7.5
SINGLEHP(I7 − 4C) 43 47.85 1.4
SINGLEHP(I7 − 1C) 16 66.8 1
Naïve MULTIPLEHP(A10 × 3) 7 0.91 73.4
MULTIPLEHP-H (A10 × 3) 10 1.75 38.2
MULTIPLEHP-N (A10 × 3) 14 1.11 60.0
MULTIPLEHP-R (R7) 49 0.965 69.2
MULTIPLEHP-R (A10 × 3) 22 0.526 127.0
an A10 in executing the same NDRange kernels. Regarding
the single work-item kernels, GPU implementations cannot
compete with FPGAs, being tens to hundreds of times slower
than FPGAs.
D. Energy Dissipation and Power Consumption
The execution latency is a significant performance criterion
for the harmonic-summing module. However, in the context
of the pulsar search engine in SKA1-MID, there will be over
2,000 beams that need to be computed in parallel, which
is constantly done for many years. As a result, the power
consumption is another essential criterion which we investigate
in this subsection.
To do so, we calculate the difference between the system
power consumption Pidle, including the acceleration device,
in idle status and the power consumption Prunning when the
system is executing the kernel. To make sure the value of
Prunning is stable, each kernel is launched hundreds of times
using a loop, which takes several minutes.
The power consumption is measured using a plug-in power
meter (Ego smart socket ESS-AU). For the FPGA measure-
ments, the calculated power consumption is the value of using
three A10 cards in one host. The power consumption and
energy dissipation of executing different kernels are given in
Table VIII. The energy cost is the dissipation of processing the
input half FOP, and the energy saving ratio is compared with
the I7 − 1C. Since the execution latencies of MULTIPLEHP
kernels with the single work-item kernel (in Section VI-C)
on GPU are over ten times larger than those on FPGA,
the MULTIPLEHP kernels with single work-item part are not
compared with GPU.
Although the execution latency on R7 is faster than that of
A10, the energy dissipation of R7 is over 1.8 times higher than
that of three A10s. An interesting observation from Table VIII
is that the power consumption of kernel SINGLEHP-(M, R)
and MULTIPLEHP-R on A10 are significantly higher than
other MULTIPLEHP kernels on A10. The main reason is that
the used GMB of SINGLEHP-(M, R) and MULTIPLEHP-R
are optimised and much higher than other kernels. Streaming
data between off-chip memory and FPGA makes the power
consumption of a kernel up to 3 times higher than that of
other MULTIPLEHP kernels.
In summary, it can be found that a single R7 needs over 2x
times more power than three A10 cards. Regarding the energy
dissipation, the cost of R7 is up to 2.6x times higher than
three A10 cards in executing the same kernels while providing
similar performance.
VII. CONCLUSIONS
In this paper, we investigated FPGA designs of one module
of the SKA pulsar search engine called harmonic-summing.
OpenCL was chosen to implement the proposed designs,
and two types of FPGA cards (Intel Stratix V and Arria
10 FPGAs) and a GPU card were employed for evaluation.
Two approaches of harmonic-summing were studied: 1) store
intermediate data in off-chip memory and 2) process the
input signals directly without storing intermediate data. For
the second approach, since a naive implementation does not
provide good performance, two approaches of preloading data
were proposed and evaluated: 1) preloading points that are
touched most 2) preloading all necessary points that are used
to generate a chunk of output points. For the necessary points
approaches, the reorder of input signals is investigated as well.
The extensive experimental evaluation demonstrated that
kernels with intermediate data storage perform worse than
kernels without storing intermediate data in both execution
latency and power consumption. A single FPGA can achieve
9.5x speedup over single-core CPU using the general SIN-
GLEHP method. By using three A10 FPGAs, the NDRange
MULTIPLEHP kernels perform significantly better than a
single R7 GPU in power consumption, while only being
slightly slower regarding execution latency. To process the
same amount of data using the same OpenCL kernel, R7 GPU
costs up to 2.6x times more energy than three A10 FPGAs.
This work shows that FPGA devices can be a good solution for
the SKA project for the processing parts of the pulsar search
pipeline.
ACKNOWLEDGMENT
The authors acknowledge discussions with the TDT, a
collaboration between Manchester and Oxford Universities,
and MPIfR Bonn and the work benefitted from their collab-
oration. We would like to thank Petr Dobias and Emmanuel
Casseau from IRISA, University of Rennes 1. We gratefully
acknowledge that this research was financially supported by
the SKA funding of the New Zealand government through the
Ministry of Business, Innovation and Employment (MBIE).
REFERENCES
[1] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed
Kammoona, Jason H Anderson, Stephen Brown, and Tomasz Czajkowski.
Legup: high-level synthesis for fpga-based processor/ accelerator systems.
In Proceedings of the 19th ACM/SIGDA international symposium on Field
programmable gate arrays, pages 33-36. ACM, 2011.
[2] Christopher Carilli and Steve Rawlings, Science with the Square Kilome-
ter Array: motivation, key science projects, standards and assumptions,
arXiv preprint astro-ph/0409274, 2004.
[3] Doris Chen and Deshanand Singh, Invited paper: Using OpenCL to
evaluate the efficiency of CPUS, GPUS and FPGAS for information
filtering, In 22nd International Conference on Field Programmable Logic
and Applications (FPL), 5–12. IEEE, 2012.
[4] Tomasz S Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman,
Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and
Deshanand P Singh. From OpenCL to high-performance hardware on
FPGAs. In 22nd International Conference on Field Programmable Logic
and Applications (FPL), 531–534. IEEE, 2012.
14
[5] Ludovico De Souza, John D Bunton, Ducan Campbell-Wilson, Roger J
Cappallo, and Bart Kincaid. A radio astronomy correlator optimized for
the Xilinx Virtex-4 SX FPGA, author=, and Bunton, D and Campbell-
Wilson, Ducan and Cappallo, Roger J and Kincaid, Bart, In International
Conference on Field Programmable Logic and Applications (FPL), pages
62–67, IEEE, 2007.
[6] Peter E Dewdney, Peter J Hall, Richard T Schilizzi, and T Joseph LW
Lazio. The square kilometre array. Proceedings of the IEEE, 97(8):1482–
1496, 2009.
[7] Ken Eguro and Ramarathnam Venkatesan. FPGAs for trusted cloud
computing. In 22nd International Conference on Field Programmable
Logic and Applications (FPL), pages 63–70. IEEE, 2012.
[8] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg
Stitt. A high memory bandwidth fpga accelerator for sparse matrix-
vector multiplication. In Field-Programmable Custom Computing Ma-
chines (FCCM), 2014 IEEE 22nd Annual International Symposium on,
pages 36–43. IEEE, 2014.
[9] Khronos OpenCLWorking Group et al. The opencl specification, version
1.0. 29, 8 december 2008.
[10] Robert J Halstead, Jason Villarreal, and Walid Najjar. Exploring irregular
memory accesses on fpgas. In Proceedings of the 1st Workshop on
Irregular Applications: Architectures and Algorithms, pages 31–34. ACM,
2011.
[11] Antal Hiba, Zoltan Nagy, and Miklos Ruszinko. Memory access opti-
mization for computations on unstructured meshes, In Cellular Nanoscale
Networks and Their Applications (CNNA), 2012 13th International Work-
shop on, pages 1–5. IEEE, 2012.
[12] Sitao Huang, Gowthami Jayashri Manikandan, Anand Ramachandran,
Kyle Rupnow, W Hwu Wen-mei, and Deming Chen. Hardware Ac-
celeration of the Pair-HMM Algorithm for DNA Variant Calling. In
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pages 275–284. ACM, 2017.
[13] Intel. Intel FPGA SDK OpenCL Best Pratices Guide, 2016.
[14] Akanksha Jain and Calvin Lin. Linearizing irregular memory accesses
for improved correlated prefetching. In Proceedings of the 46th Annual
IEEE/ACM International Symposium on Microarchitecture, pages 247–
259. ACM, 2013.
[15] Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. Ex-
ploiting memory access patterns to improve memory performance in
data-parallel architectures. IEEE Transactions on Parallel and Distributed
Systems, 22(1):105–118, 2011.
[16] Christian Leber, Benjamin Geib, and Heiner Litz. High frequency trading
acceleration using fpgas. In Field Programmable Logic and Applications
(FPL), 2011 International Conference on, pages 317–322. IEEE, 2011.
[17] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving
memory hierarchy performance for irregular applications using data and
computation reorderings. International Journal of Parallel Programming,
29(3):217–247, 2001.
[18] Aaron Parsons, Dan Werthimer, Donald Backer, Tim Bastian, Geoffrey
Bower, Walter Brisken, Henry Chen, Adam Deller, Terry Filiba, Dale
Gary, et al. Digital instrumentation for the radio astronomy community.
arXiv preprint arXiv:0904.1181, 2009.
[19] Karas Pavel and Svoboda David. Algorithms for efficient computation
of convolution. In Design and Architectures for Digital Signal Processing
InTech, 2013.
[20] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou,
Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow-
ers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for
accelerating large-scale datacenter services. In In Computer Architecture
(ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 13–24.
IEEE, 2014.
[21] Scott M Ransom, Stephen S Eikenberry, and John Middleditch. Fourier
techniques for very long astrophysical time-series analysis. The Astro-
nomical Journal, 124(3):1788, 2002.
[22] MA Sanchez, Mario Garrido, Marisa Lopez-Vallejo, Jesus Grajal, and
Carlos Lopez-Barrio. Digital channelised receivers on fpgas platforms.
In Radar Conference, 2005 IEEE International, pages 816–821. IEEE,
2005.
[23] Srikanth Sridharan, Paolo Durante, Christian Faerber, and Niko Neufeld.
Accelerating particle identification for high-speed data-filtering using
opencl on fpgas and other architectures. In Field Programmable Logic
and Applications (FPL), 2016 26th International Conference on, pages
1–7. IEEE, 2016.
[24] Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto
Leon-Garcia, and Paul Chow. Enabling flexible network fpga clusters in a
heterogeneous cloud data center. In Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages
237–246. ACM, 2017.
[25] Haomiao Wang, Ming Zhang, Prabu Thiagaraj, and Oliver Sinnen.
FPGA-based Acceleration of FDAS Module Using OpenCL. In Field
Programmable Technology (FPT), 2016 International Conference on,
pages 53–60. IEEE, 2016.
[26] Haomiao Wang, Ming Zhang, Prabu Thiagaraj, and Oliver Sinnen.
FPGA-based Acceleration of FDAS Module Using OpenCL. In Field
Programmable Technology (FPT), 2016 International Conference on,
pages 53–60. IEEE, 2016.
[27] Xu Wang, Linan Huang, Yongxin Zhu, Yipeng Zhou, Huwan Peng, and
Haifei Xiong. Addressing memory wall problem of graph computation in
reconfigurable system. In High Performance Computing and Communica-
tions (HPCC), 2015 IEEE 7th International Symposium on Cyberspace
Safety and Security (CSS), 2015 IEEE 12th International Conferen on
Embedded Software and Systems (ICESS), 2015 IEEE 17th International
Conference on, pages 302–307. IEEE, 2015.
[28] Markus Weinhardt and Wayne Luk. Memory access optimization and
ram inference for pipeline vectorization. In International Conference
on Field Programmable Logic and Applications (FPL), pages 61–70.
Springer, 1999.
[29] Markus Weinhardt and Wayne Luk. Memory access optimisation for
reconfigurable systems. IEEE Proceedings-Computers and Digital Tech-
niques, 148(3):105–112, 2001.
[30] Hsin-Jung Yang, Kermin Fleming, Michael Adler, and Joel Emer.
Optimizing under abstraction: Using prefetching to improve fpga per-
formance. In Field Programmable Logic and Applications (FPL), 2013
23rd International Conference on, pages 1–8. IEEE, 2013.
