AccD: A Compiler-based Framework for Accelerating Distance-related
  Algorithms on CPU-FPGA Platforms by Wang, Yuke et al.
AccD: A Compiler-based Framework for
Accelerating Distance-related Algorithms on
CPU-FPGA Platforms
Yuke Wang1, Boyuan Feng1, Gushu Li2, Lei Deng2, Yuan Xie2, and Yufei Ding1
1Department of Computer Science
2Department of Electrical and Computer Engineering
1{yuke_wang,boyuan,yufeiding}@cs.ucsb.edu
2{gushuli,leideng,yuanxie}@ece.ucsb.edu
University of California, Santa Barbara
Abstract—As a promising solution to boost the performance
of distance-related algorithms (e.g., K-means and KNN), FPGA-
based acceleration attracts lots of attention, but also comes
with numerous challenges. In this work, we propose AccD,
a compiler-based framework for accelerating distance-related
algorithms on CPU-FPGA platforms. Specifically, AccD provides
a Domain-specific Language to unify distance-related algorithms
effectively, and an optimizing compiler to reconcile the benefits
from both the algorithmic optimization on the CPU and the
hardware acceleration on the FPGA. The output of AccD is a
high-performance and power-efficient design that can be easily
synthesized and deployed on mainstream CPU-FPGA platforms.
Intensive experiments show that AccD designs achieve 31.42×
speedup and 99.63× better energy efficiency on average over
standard CPU-based implementations.
I. INTRODUCTION
Distance-related algorithm (e.g., K-means [1], KNN [2],
and N-body Simulation [3]) plays a vital role in many do-
mains, including machine learning, computational physics, etc.
However, these algorithms often come with high computation
complexity, leading to poor performance and limited applica-
bility. To improve their performance, FPGA-based acceleration
gains lots of interests from both industry and research field,
given its great performance and energy-efficiency. However,
accelerating distance-related algorithms on FPGAs requires
non-trivial efforts, including the hardware expertise, time and
monetary cost. While existing works try to ease this process,
they inevitably fall in short in one of the following aspects.
Rely on problem-specific design and optimization while
missing effective generalization. There is no such unified
abstraction to formalize the definition and optimization of
distance algorithms systematically. Most of the previous hard-
ware designs and optimizations [4–7] are heavily coded for a
specific algorithm (e.g., K-means), which can not be shared
with different distance-related algorithms. Moreover, these
”hard-coded” strategies could also fail to catch up with the
ever-changing upper-level algorithmic optimizations and the
underlying hardware settings, which could result in a large
cost of re-design and re-implementation during the design
evolvement.
Lack of algorithm-hardware co-design. Previous algorith-
mic [8, 9] and hardware optimizations [4–7, 10] are usually
applied separately instead of being combined collaboratively.
Existing algorithmic optimizations, most of which are based
on Triangle Inequality (TI) [8, 9, 11, 12], are crafted for
sequential-based CPU. Despite removing a large number of
distance computations, they also incur high computation ir-
regularity and memory overhead. Therefore, directly applying
these algorithmic optimizations to massively parallel platforms
without taking appropriate hardware-aware adaption could
lead to inferior performance.
Count on FPGAs as the only source of acceleration.
Previous works [4, 7, 13–16] place the whole algorithm on
the FPGA accelerator without considering the assists from the
computing resource on the host CPU. As a result, their designs
are usually limited by the on-chip memory and computing
elements, and cannot fully exploit the power of the FPGA.
Moreover, they miss the full performance benefits from the
heterogeneous computing paradigm, such as using the CPU
for complex logic and control operations while offloading the
compute-intensive tasks to the FPGA.
Lack of well-structured design workflow. Previous
works [5–7, 13, 15] follow the traditional way of hardware
implementation and require intensive user involvement in
hardware design, implementation, and extra manual tuning
process, which usually takes long development-to-validation
cycles. Also, the problem-specific strategy leads to a case-
by-case design process, which cannot be widely applied to
handle different problem settings. To this end, we present
a compiler-based optimization framework, AccD, to auto-
matically accelerate distance-related algorithms on the CPU-
FPGA platform (shown in Figure 1). First, AccD provides a
Distance-related Domain-Specific Language (DDSL) as a
problem-independent abstraction to unify the description and
optimization of various distance-related algorithms. With the
assist of the DDSL, end-user can easily create highly-efficient
CPU-FPGA designs by only focusing on high-level problem
specification without touching the algorithmic optimization or
hardware implementation.
ar
X
iv
:1
90
8.
11
78
1v
1 
 [c
s.D
C]
  2
6 A
ug
 20
19
D
D
SL A
lg
o
rith
m
 
D
escrip
tio
n
OpenCL Kernel
Compiler
GCC Compiler
OpenCL
Libarary
Linking
.cpp
source
.cl
source
PCIe
AccD Compiler
Design Space Explorer
Hardwre 
Optimization
Kernel
Memory
Algorithmic
Optimization
Strategy
Selection
GTI
D
D
SL A
lgorith
m
 D
escrip
tio
n OpenCL Kernel 
Compiler
GCC Compiler
OpenCL Libarary
Linking
Host 
Source
 Kernel 
Source
PCIe 
AccD Compiler
Design Space Explorer (DSE)
Algorithmic Optimization
Generalized Triangle-
Inequality (GTI) 
Hardware Optimization
Distance 
Computation Kernel
Memory Layout
Executing
Deploying
• Two-landmark Bound
• Trace-based Bound
• Group-level Bound
Fig. 1: AccD Overview.
Second, AccD offers a novel algorithmic-hardware co-
optimization scheme to reconcile the acceleration from both
sides. At the algorithmic level, AccD incorporates a novel
Generalized Triangle Inequality (GTI) optimization to elimi-
nate unnecessary distance computations, while maintaining the
computation regularity to a large extent. At the hardware level,
AccD employs a specialized data layout to enforce memory
coalescing and an optimized distance computation kernel to
accelerate the distance computations on the FPGA.
Third, AccD leverages both the host and accelerator side of
the CPU-FPGA heterogeneous system for acceleration. In
particular, AccD distributes the algorithm-level optimization
(e.g., data grouping and distance computation filtering) to
CPU, which consists of complex operations and execution
dependency, but lacks pipeline and parallelism. On the other
hand, AccD assigns hardware-level acceleration (e.g., distance
computations) to the FPGA, which is composed of simple and
vectorizable operations. Such mapping successfully capitalizes
the benefit of CPU for managing control-intensive tasks and
the advantage of FPGA for accelerating computation-intensive
workloads.
Lastly, AccD compiler integrates an intelligent Design
Space Explorer (DSE) to pinpoint the ”optimal” design for
different problem settings. In general, there is no existing ”one
size fits all” solution: the best configuration for algorithmic and
hardware optimization would differ across different distance-
related algorithms or different inputs of the same distance-
related algorithm. To produce a high-quality optimization
configuration automatically and efficiently, DSE combines
the design modeling (performance and resource) and Genetic
Algorithm to facilitate the design space search.
Overall, our contributions are:
• We propose the first optimization framework that can
automatically optimize and generate high-performance
and power-efficient designs of distance-related algorithms
on CPU-FPGA heterogeneous computing platforms.
• We develop a Domain-specific Language, DDSL, to unify
different distance-related algorithms in an effective and
succinct manner, laying the foundation for general opti-
mizations across different problems.
• We build an optimizing compiler for the DDSL, which
automatically reconciles the benefits from both the algo-
rithmic optimization on CPU and hardware acceleration
on FPGA.
• Intensive experiments on several popular algorithms
across a wide spectrum of datasets show that AccD-
generated CPU-FPGA designs could achieve 31.42×
speedup and 99.63× better energy-efficiency on average
compared with standard CPU-based implementations.
II. RELATED WORK
Previous research accelerates distance-related algorithms in
two aspects: Algorithmic Optimization and Hardware Acceler-
ation. More details are discussed in the following subsections.
A. Algorithmic Optimization
From the algorithmic standpoint, previous research high-
lights two optimizations. The first one is KD-tree based opti-
mization [17–21], which relies on storing points in special data
structures to enable nearest neighbor search without computing
distances to all target points. These methods often deliver
3× ∼ 6× performance improvement [17–21] compared with
the unoptimized versions in low dimensional space, while suf-
fering from a serious performance degradation when handling
large datasets with high dimension (d ≥ 20) due to their
exponentially-increased memory and computation overhead.
The second one is TI based optimization [8, 9, 11, 12],
which aims at replacing computation-expensive distance com-
putations with cheaper bound computations, demonstrates its
flexibility and scalability. It can not only reduce the compu-
tation complexity at different levels of granularity but is also
more adaptive and robust to the datasets with a wide range
of size and dimension. However, most existing works focus
on one specific algorithm (e.g., KNN [12], K-means [8, 9],
etc.), which lack extensibility and generality across different
distance-related problems. An exception is a recent work,
TOP [11], which builds a unified framework to optimize
various distance-related problems with pure TI optimization
on CPUs. Our work shares a similar high-level motivation with
their work, but targets at a more challenging scenario: algorith-
mic and hardware co-optimization on CPU-FPGA platforms.
B. Hardware Acceleration
From the hardware perspective, several FPGA accelerator
designs have been proposed, but still suffer from some major
limitations.
First, previous FPGA designs are generally built for spe-
cific distance-related algorithm and hardware. For example,
works from [4–6] target on KNN FPGA acceleration, while
researches from [7, 13, 14] focus on K-means. Moreover,
previous designs [4, 5] usually assume that dataset can be fully
fit into the FPGA on-chip memory, and they are only evaluated
on a limited number of small datasets, for example, in [5], K-
means acceleration is evaluated on a micro-array dataset with
only 2,905 points. These designs often encounter portability
issues when transferring to different settings. Besides, these
”hard-coded” designs and optimizations create difficulties for a
fair comparison among different designs, which hamper future
studies in this direction.
The second problem with previous works is that they
fail to incorporate algorithmic optimizations in the hardware
design. For example, works from [4, 6, 7, 13], directly port
the standard K-means and KNN algorithms to FPGA, and
only apply hardware-level optimization. One exception is a
recent work [22], which promotes to combine TI optimization
and FPGA acceleration for K-means. It gives a considerable
speedup compared to state-of-the-art methods, showcasing
the great opportunity of applying algorithm-hardware co-
optimization. Nevertheless, this idea is far from well-explored,
possibly because it requires the domain knowledge and exper-
tise from both the algorithm and hardware to combine both of
them effectively.
In addition, previous works largely focus on the traditional
hardware design flow, which requires a long implementation
cycle and huge manual efforts. For example, works from [4,
6, 10, 15, 16, 23–25] build the design based on VHDL/Ver-
ilog design flow, which requires hardware expertise and over
months of arduous development. In contrast, our AccD design
flow brings significant advantages of programmability and
flexibility due to its high-level OpenCL-based programming
model, which minimizes the user involvement in the tedious
hardware design process.
III. DISTANCE-RELATED ALGORITHM DOMAIN-SPECIFIC
LANGUAGE (DDSL)
Distance-related algorithms share commonalities across dif-
ferent application domains and scenarios, even though they
look different in their high-level algorithmic description.
Therefore, it is possible to generalize these distance-related
algorithms. AccD framework defines a DDSL, which pro-
vides a high-level programming interface to describe distance-
related algorithms in a unified manner. Unlike the API-based
programming interface used in the TOP framework [11],
DDSL is built on C-like language and provides more flexibility
in low-level control and performance tuning support, which is
crucial for FPGA accelerator design.
Specifically, DDSL utilizes several constructs to describe
the basic components (Definition, Operation, and Control) of
the distance-related algorithms, and also identify the potential
parallelism and pipeline opportunities during the design time.
We detail these constructs in the following part of this section.
A. Data Construct
Data construct is a basic Definition Construct. It leverages
DSet primitive to indicate the name of the data variable,
and the DType primitive to notate the type characters of the
defined variable. Data construct serves as the basis for AccD
compiler to understand the algorithm description input, such as
the data points that are used in the distance-related algorithms.
An example of data constructs is shown in the code below,
where we define the variable and dataset using DDSL data
construct.
/* Define a single variable */
DVar [setName] DType [Optional_Initial_Value];
/* Define the matrix of dataset */
DSet [setName] DType [size] [dim];
In most distance-related algorithms, the dataset can be
defined as the source set and the target set. For example, in K-
means, the source set is the set of data points, and the target
set is the set of clusters. Currently, AccD supports several
data types including int (32-bit), float (32-bit), double (64-
bit) based on the users’ requests, algorithm performance, and
accuracy trade-offs.
B. Distance Compute Construct
Distance computation is the core Operation Construct for
distance-related algorithms, which measures the exact distance
between two different data points. This construct requires
several fields, including data dimensionality, distance metrics,
and weight matrix (if weighted distance is specified).
AccD_Comp_Dist(Input p1, Input p2, Output
disMat, Output idMat, Dim dim, Met mtr,
Weg mat)
p1, p2 Input data matrix. (n1 × d, n2 × d)
disMat Output distance matrix. (n1 × n2)
idMat Output id matrix. (n1 × n2)
dim Dimensionality of input data point.
mtr Distance metric:(Weighted|Unweighted)
mat Weight matrix: Used for weighted distance (1× d)
TABLE I: Distance Compute Construct Parameters.
C. Distance Selection Construct
Distance selection construct is an Operation Construct for
distance value selection and it returns the Top-K smallest or
largest distances and their corresponding points ID number
from the provided distance and ID list. This construct helps
AccD compiler to understand the distances of users’ interests.
AccD_Dist_Select(Input distMat, Input idMat,
Output TopKMat, Range ran, Scope scp)
TopKMat Top-K id matrix (n1 × k)
ran Scalar value of K (e.g., K-means, KNN) or
distance threshold (e.g., N-body Simulation)
scp Top-K (smallest|largest) values
TABLE II: Distance Selection Construct Parameters.
D. Data Update Construct
Data update construct is an Operation Construct for
updating the data points based on the results from the prior
constructs. For example, K-means updates the cluster centers
by averaging the positions of the points inside. This construct
requires the variable to be updated and additional information
to finish this update, such as the point-to-cluster distances. The
status of this data update will be returned after the completion
of all its inside operations. The status variable is to tell whether
the data update makes a difference or not.
AccD_Update(Update var, Input p1 ,..., Input
pm, Status s)
upVar Input data/dataset to be updated
p1, ..., pm Additional information used in update
S Status of update operation.
TABLE III: Data Update Construct Parameters.
E. Iteration Construct
Iteration construct is a top-level Control Construct. It is
used to describe the distance-related algorithms that require
iteration, such as K-means. Iteration construct requires users
to provide either the maximum number of iteration or other
exit condition.
AccD_Iter(maxIterNum|exitCond){
subConstruct sc1;
subConstruct sc2;
...
subConstruct scn;
}
F. Example: K-means
To show the expressiveness of DDSL, we take K-means as
an example. From the code shown below, with no more than
20 lines of code, DDSL can capture the key components of
user-defined K-means algorithm, which is essential for AccD
compiler to generate designs for CPU-FPGA platforms.
DVar K int 10;
DVar D int 20;
DVar psize int 1400;
DVar csize int 200;
DSet pSet float psize D;
DSet cSet float csize D;
DSet distMat float psize csize;
DSet idMat int psize csize;
DSet pkMat int psize K;
AccD_Iter(S){
S = false;
/* Compute the inter-dataset distances */
AccD_Comp_Dist(pSet, cSet, distMat, idMat,
D, "Unweighted L1", 0);
/* Select the distances of interests */
AccD_Dist_Select(distMat, idMat, K,
"smallest", pkMat);
/* Update the cluster center */
AccD_Update(cSet, pSet, pkMat, S)
}
IV. ALGORITHM OPTIMIZATION
This section explains a novel TI optimizations tailored
for CPU-FPGA platforms. TI has been used for optimizing
distance-related problems, but is often on the sequential pro-
cessing systems. Our design features an innovative way of
applying TI to obtain low-overhead distance bounds for un-
necessary distance computation elimination while maintaining
the computation regularity to ease the hardware acceleration
on FPGAs.
A. TI in Distance-related Algorithm
As a simple but powerful mathematical concept, TI has
been used to optimize the distance-related algorithm. Figure 2a
gives an illustration. It states that d(A,B) ≤ d(A,Lref ) +
d(Lref , B), where d(A,B) represents the distance between
point A and B in some metrics (e.g., Euclidean distance).
The assistant point Lref is a landmark point for reference.
Directly from the definition, we could compute both the lower
bound (lb(A,B)) and upper bound (ub(A,B)) of the distance
between two points A and B. This is the standard and most
common usage of TI for deriving bounds of distance.
In general, bounds can be used as a substitute for the exact
distances in the distance-related data analysis. Take N-body
simulation as an example. It requires to find target points
that are within R (the radius) from each given query point.
Suppose we get the lb(A,B) = 10 and 10 > R, then we
are 100% confident that source point A is not within R of
query point B. As a result, there is no need to compute
the exact distance between point A and B. Otherwise, the
exact distance computation will still be carried out for direct
comparison. While many previous researches [5, 8, 9, 26, 27]
gain success in directly porting the above point-based TI to
optimize distance-related algorithms, they usually suffer from
memory overhead and computations irregularity, which result
in inferior performance.
B. Generalized Triangle Inequality (GTI)
AccD uses a novel Generalized TI (GTI) to remove re-
dundant distance computation. It generalizes the traditional
point-based TI while significantly reducing the overhead of
bound computations. The traditional point-based TI focuses
on tighter bound (more closer to the exact distance) to remove
more distance computations, but it induces the extra bound
computations, which could become the new performance bot-
tleneck even after many distance calculations being removed.
In contrast, GTI strikes a good balance between distance
computation elimination and bound computation overhead. In
particular, AccD highlights GTI from three perspectives: Two-
landmark bound computation, Trace-based bound computa-
tion, and Group-level bound computation.
a) Two-landmark Bound Computation: Two-landmark
scheme aims at reducing the bound computation through
effective distance reuse. In this case, the distance bound
between two points can be measured through two landmarks as
the reference points. As illustrated in Figure 2b, the distance
bound between point A and B can be computed based on
d(A,Aref ), d(B,Bref ) and d(Aref , Bref ) through Equation
1, where Aref and Bref are the landmark points for point A
and B, correspondingly.
lb(A,B) ≥ d(Aref , Bref )− d(A,Aref )− d(B,Bref )
ub(A,B) ≤ d(Aref , Bref ) + d(A,Aref ) + d(B,Bref ) (1)
Lref
A B
(a)
A B’
(c)
B
(f)
X
Point
Point Old Pos.
X Landmark
Distance Bound
Distance
X X
X
X
A B
X X
(e)
A’
A B
B’
(d)
A
Aref Bref
B
(b)
XX XX
Fig. 2: TI Optimization.
One representative application scenario of Two-landmark
bound computation is KNN-join, where two disjoint sets of
landmarks are selected for the query and target point set.
In this case, much fewer bound computations are required
compared with the one-landmark case (shown in Figure 2a).
This can also be validated through a simple calculation.
Assuming in KNN-join, we have m query points, n target
points, zqry query landmarks, and ztrg target landmarks. Also,
we have zqry << m and ztrg << n in general. Therefore, we
can get m + n + zqry × ztrg bound computations for Two-
landmark case, which is much smaller than One-landmark
bound computation (m× ztrg + n or n× zqry +m).
b) Trace-based Bound Computation: Trace-based bound
computation finds its strength in iterative distance algorithms
with points update, since it can largely reduce the bound
computation overhead over numbers of iterations. The key to
Trace-based bound computation is selecting appropriate land-
mark points as references. For example, in K-means, only the
target points (clusters) change their positions across iterations,
therefore, we can choose the previous positions of clusters
from the last iteration as the landmarks for bound computation
in the current iteration, since these ”old” cluster positions can
be close enough to the current point positions to offer ”tight”
bound. This process can be illustrated in Figure 2c, where the
distance bound d(A,B′) can be calculated based on d(B,B′)
and d(A,B), where B′ is the new point position while B is
the old point position from the last iteration.
In addition, Trace-based bound computation can also work
collaboratively with the Two-landmark cases. For example, in
N-body simulation, the source and target points are essentially
the same dataset and would get updated across iterations. We
can choose the ”old” position of each point from the last
iteration as the landmark for the bound computation at the
current iteration, due to its closeness towards the current point
position. This case can be clarified in Figure 2d, where A
and B are the ”old” point positions from the last iteration,
A′ and B′ are the new source and target point. Then based
on d(A,B), d(A,A′) and d(B,B′), the new distance bound
between A′ and B′ can be easily derived by using the old
points A and B as the reference points. And the cost of this is
also as low as O(n), where n is the number of particles, since
each point only need to maintain the shifted distance between
its new position and old position from the last iteration (Note:
we only compute the inter-point distances at the first iteration).
In contrast, only applying Two-landmark without the effective
temporal reuse of the old point position will result in the
complexity at least O(n×z), where z is number of landmarks.
Since, in this case, distances between each point and all
the landmarks have to be computed, so that each point can
know its new closest landmark before applying the bound
computation.
c) Group-level Bound Computations: Group-level bound
computation aims at reducing the bound computation overhead
while maintaining the computation regularity. Group-level
bound computations features itself with the capability to com-
bine with aforementioned two bound cases as the hybrid bound
computation. In the combination with the Two-landmark case,
as shown in Figure 2e, points in each group (A and B)
share the same landmark (Aref and Bref ) as the reference
point. Then based on d(Aref , Bref ) and the dmax(a,Aref )
and dmax(b, Bref ), we can get the group-level bound based
on Equation 2, where dmax(a,Aref ) and dmax(b, Bref ) get
the distance between the farthest point within each group and
its group reference point.
lb(A,B) ≥ d(Aref , Bref )− dmax(a,Aref )− dmax(b,Bref )
ub(A,B) ≤ d(Aref , Bref ) + dmax(a,Aref ) + dmax(b,Bref )
(2)
In the combination with the Trace-based case, it will gener-
ate a hierarchical bound as a hybrid solution, which includes
point-group bound and point-point bound computation. As
exemplified in Figure 2f, each group regards its old group
center as the landmark for reference, and each point relies on
its old position as the landmark for reference. Then based on
d(A,A′), d(B,B′), and d(d, d′), and the old distance d(c, A),
d(c,B) and d(c, d), where A and B are the point groups, and
point d is the closest point of point c in the last iteration. We
can calculate lb(c,G′) and ub(c, d′) based on Equation 3,
lb(c,G′) ≥ d(c,G)− dmax(G,G′)
ub(c, d′) ≤ d(c, d) + dmax(d, d′)
(3)
where G ∈ {A,B}. If we have d(c,G) − dmax(G,G′) >
d(c, d) + dmax(d, d
′), it is impossible that the points inside
the group A′ and B′ can become the closest point of c in the
current iteration. Therefore, the distance computation between
the point c and all points inside these groups can be safely
avoided.
In addition to distance saving, group-level bound computa-
tion offers another two benefits to facilitate the underlying
hardware acceleration. First, the computation regularity on
the remaining distance computation becomes higher compared
with the point-level bound computation. Since points inside
each group will share the commonality in computation, which
facilitates the parallelization for acceleration. For example,
point-level bound computation usually results in a large di-
vergence of distance computation among different points, as
shown in Figure 3a, which is a killer of parallelization and
pipelining. However, in group-level bound computation, points
inside the same source group will always maintain the same
groups of target points for distance computation, as shown in
Figure 3b. Second, group-level bound computation brings the
Source 
Points
Point Target Point
sp1 tp1, tp2, tp4
tp3, tp4 
tp4 
sp2
sp3
tp2, tp3 sp4
SG Target Group
SG1 TG1, TG2
TG2SG2
T
G
 1
T
G
 2
SG
 1
SG
 2
SG
 1
Target 
Points
Source 
Points
Target 
Points
tp1
tp2
tp3
tp4
sp1
sp2
sp3
sp4
tp1
tp2
tp3
tp4
sp1
sp2
sp3
sp4
(a)
(b)
Fig. 3: Bound Computation at (a) Point-level, (b) Group-level.
benefit of reducing memory overhead. Assuming we have m
source points, n target points, zsrc source groups, and ztrg
target groups. The memory overhead of maintaining distance
bounds is O(m × n) in the point-level bound computation
case. However, in the group-level bound computation case,
we only have to maintain distance bounds among groups, and
the memory overhead is O(zsrc×ztrg), where zsrc << m and
ztrg << n. Therefore, in terms of memory efficiency, group-
level bound computation can outperform the point-level bound
computation to a great extent.
V. HARDWARE ACCELERATION
AccD design is built on the CPU-FPGA architecture, which
highlights its significant performance and energy efficiency,
and has been widely adopted as the modern data center
solution for high-performance computing and acceleration.
The host-side application of AccD design is responsible for
data grouping and distance computation filtering, which con-
sists of complex operations and execution dependency, but
lacks pipeline and parallelism. On the other hand, the FPGA-
side of AccD design is built for accelerating the distance
computations, which are composed of simple and vectorizable
operations.
While FPGA accelerator features with high computation
capability, the memory bandwidth bottleneck constraints the
overall design performance. Therefore, optimizing data place-
ment and memory architecture is the key to improving memory
performance. In addition, the OpenCL-based programming
model adds a layer of architectural complexity of the kernel
design and management, which is also critical to the design
performance. AccD framework distinguishes itself by using a
novel memory and kernel optimization strategy that is tailored
for TI-optimized distance-related algorithms to benefit CPU-
FPGA designs.
A. Memory Optimization
After applying the GTI optimization to remove the redun-
dant distance computation, each source point group will have
different target groups as candidates for distance computation,
as shown in Figure 4a, where Source-grp is ID of the source
group, and Target-grp is ID of the target group. However, this
would raise two concerns about performance degradation.
Source-grp Target-grp
s1 t1, t4, t6
s2 t8, t10, t12
... ...
s5 t2, t4, t6
s6 t8, t10, t12
(a)
Source-grp Target-grp
s1 t2, t4, t6
s5 t2, t4, t6
s2 t8, t10, t12
s6 t8, t10, t12
... ...
(b)
Fig. 4: (a) Non-optimized inter-group memory access; (b)
Optimized inter-group memory access.
The first issue is inter-group memory irregularity and low
data reuse. For example, the target group information (t1, t4,
t6) required by source group s1 can not be reused by s2. Since
s2 requires quite different target groups (t8, t10, and t12) for
distance computation, thus, additional costly memory access
has to be carried out. To tackle this problem, AccD places the
source groups to the continuous memory space to maximize
the memory access efficiency, only if these source groups
have the same set of target groups as candidates for distance
computation. An example has been shown in Figure 4b, where
the source group s2 and s6 are placed side by side in the
memory, since they have the same list of target groups (t8, t10,
and t12), which can take advantage of the memory temporal
locality without issuing another memory access.
The second issue is intra-group memory irregularity. For
example, points from group 1, 2, and 3 have taken up the
memory space at intervals, as shown in Figure 5a. However,
a group of points are usually accessed simultaneously due
to GTI optimization. This would cause frequent inefficient
memory access for fetching individual point distributed at
the discontinuous memory address. To solve this issue, AccD
offers a second memory optimization to re-organize the tar-
get/source points inside the same target/source group into
continuous memory space within the same memory bank,
as illustrated in Figure 5b. This strategy can largely benefit
memory coalescing and external memory bandwidth while
minimizing the access contention, since points inside the same
bank can be accessed efficiently and points inside different
banks can be accessed in parallel.
Group Points
Grp1 3, 8, 9
Grp2 5, 6, 7
Grp3 1, 2, 4
... ...
(a)
G
rp 2
G
rp 3
(a) (b)
G
rp 1
G
rp 1
G
rp 3
G
rp 2
B
a
n
k 1
B
a
n
k 2
5
7
B
a
n
k 3
3
9
1
4
6
2
8
1
2
3
4
5
6
7
8
9
(b)
G
rp 2
G
rp 3
(a) (b)
G
rp 1
G
rp 1
G
rp 3
G
rp 2
B
a
n
k 1
B
a
n
k 2
5
7
B
a
n
k 3
3
9
1
4
6
2
8
1
2
3
4
5
6
7
8
9
(c)
Fig. 5: (a) Group-point mapping; (b) Non-aligned intra-group
memory; (c) Aligned intra-group memory.
B. Distance Computation Kernel
Distance computation takes the major time complexity in
distance-related algorithms. In AccD, after TI filtering on
CPU, the remaining distance computations are accelerated on
FPGA. Points involved in the remaining distance computations
are organized into two sets: source set and target set, which
can be organized as two matrices, MatA (m× d) and MatB
(n × d), respectively, where each row of these matrices
represents a point with d dimension. The distance computation
between MatA and MatB can be decomposed into three
parts, as shown in Equation 4,
(MatA −MatB)2 = Mat2A − 2 ∗MatA ·MatB +Mat2B (4)
where Mat2A or Mat
2
B only takes the complexity of O(m×d)
and O(n×d), while MatA×MatB takes O(m×n×d2), which
dominates the overall computation complexity. AccD spots an
efficient way of accelerating MatA ·MatB through highly-
efficient matrix-matrix multiplication, which can benefit the
hardware implementation on FPGA.
Source Set (A)
Target Set (B)
Distance Results
Ta
rg
et
 M
at
ri
x
M
M
R
esu
lts
So
u
rc
e 
M
at
ri
x + -
x2
D
ista
n
ce
R
esu
lts
Row 
Square 
Sum
+ -
x2
Row 
Square 
Sum
OpenCL
Kernel
O
n
-C
h
ip
Lo
cal 
M
em
ory
...
...
Src matrix
Trg matrix
Kernel
Work Group
Target
M
M
 R
e
su
lts
So
u
rce
+ -
x2
D
ista
n
ce R
esu
lts
RSS
RSS
OpenCL Kernel
Kernel 
Block
Fig. 6: AccD Matrix-based Distance Computation.
The overall computation process can be described as Fig-
ure 6, the source (MatA) and target set (MatB) Row-wise
Square Sum (RSS) is pre-computed through in a fully-parallel
manner. And the vector multiplication between each source
and target point is mapped to an OpenCL kernel thread for a
fine-grained parallelization. Moreover, a block of threads, as
highlighted in the ”red” square box of Figure 6, is the kernel
thread workgroup, which can share a part of the source and
target points to increase the on-chip data locality. Based on
this kernel organization, AccD hardware architectural design
offers several tunable hyperparameters for performance and
resource trade-off: the size of kernel block, the number of
parallel pipeline in each kernel block, etc. To efficiently find
the ”optimal” parameters that can maximize overall perfor-
mance while respecting the constraints, we harness the AccD
explorer for efficient design space search, which is detailed in
Section VI-B.
VI. ACCD COMPILER
In this section, we detail AccD compiler in two aspects: de-
sign parameters and constraints, and design space exploration.
A. Design Parameters and Constraints
AccD uses a parameterized design strategy for better design
flexibility and efficiency. It takes the design parameters and
constraints from algorithm and hardware to explore and locate
the ”optimal” design point tailored for the specific application
scenario. At the algorithm level, the number of groups affects
distance computation filtering performance. At the hardware
level, there are three parameters: 1) Size of computation
block, which decides the size of data shared by a group
of computing elements; 2) SIMD factor, which decides the
number of computing elements inside each computation block;
3) Unroll factor, which tells the degree of parallelization in
each single distance computation. In addition, there are several
hardware constraints, such as the on-chip memory size, the
number of logic units, and the number of registers. All of
these parameters and constraints are included in our analytical
model for design exploration.
B. Design Space Exploration
Finding the best combination of design configurations (a
set of hyper-parameters) under the given constraints requires
non-trivial efforts in the design space search. Therefore, we
incorporate an AccD explorer in our compiler framework for
efficient design space exploration (Figure 7). AccD explorer
takes a set of raw configurations (hyper-parameters) as the
initial input, and generates the optimal configuration as the
output through several iterations of the design configuration
optimization process. In particular, AccD explorer consists of
three major phases: Configuration Generation and Selection,
Performance and Resource Modeling, Constraints Validation.
a) Configuration Generation and Selection: The func-
tionality of this phase depends on its input. There are two kinds
of inputs: If the input is from the initial configurations, this
phase will directly feed these configurations to the modeling
phase for performance and resource evaluation; If the input is
AccD Compiler
Data-Algorithm
Abstraction
Grouping 
Strategy 
Selection
Filtering 
Strategy 
Selection
Data 
Spec.
Oper. 
Spec.
AccD 
Explorer
Intel 
OpenCL 
SDK
Grp.
Strategy 
Filt. 
Strategy 
DDSL 
Algorithm 
Description
Configuration 
Generation & 
Selection
Constraints 
Validation
Hardware Micro-
Benchmark
Design ConstraintsGenetic Algorithm
Modeling
Resource
Performance
In
itial C
o
n
figu
ratio
n
s
O
p
tim
ize
d
 
C
o
n
figu
ratio
n
s
Constrains/Performance 
Unsatisfied 
AccD Explorer
Fig. 7: AccD Explorer.
the result from the constraints validation in the last iteration,
this phase will leverage the genetic algorithm to crossover
the ”premium” configurations kept from the last iteration, and
generate a new set of configurations for the modeling phase.
b) Performance Modeling: Performance modeling mea-
sures the design latency and bandwidth requirement based
on the input design configurations. We formulate the design
latency by using Equation 5,
Latency = Latencyfilt + Latencycomp (5)
where Latencyfilt and Latencycomp are the time of the
GTI filtering process and remaining distance computations,
respectively. And they can be calculated as Equation 6,
Latencyfilt =
ntrg grp × nsrc grp × srcsize × trgsize × d
niteration
Latencycomp =
srcsize × trgsize × ratiosave × d
blk2 × frequency × unroll × simd
(6)
where nsrc grp and ntrg grp are the number of groups for
source and target points, respectively; srcsize and trgsize are
the number of points inside source and target set, respectively;
d is the data dimensionality; niteration is the number of
grouping iteration; blk is the size of computation kernel block;
frequency is the FPGA design clock frequency; unroll is
the distance computation unroll factor; simd is the number
of parallel worker threads inside each computation block;
ratiosave is the distance saving ratio through GTI filtering
(Equation 7),
ratiosave =
niteration
α
×
√
srcsize × trgsize
nsrc grp × ntrg grp (7)
where the α the density of points distribution. This formula
also tells that increasing of number of iterations and number
of points inside each group would improve the performance
of GTI filtering performance. Also, the increase of points
distribution density α, (i.e. points are closer to each other)
will decrease the GTI filtering performance.
To get the required bandwidth BW of the current design,
we leverage Equation 8,
BW =
(srcsize + trgsize)× d× sizedata type
Latency
(8)
where the sizedata type can be either 32-bit for int and float
or 64-bit for double.
c) Resource Modeling: Directly measuring the hardware
resource usage of the accelerator design from high-level algo-
rithm description is challenging because of the hidden trans-
formation and optimization in hardware design suite. However,
AccD uses a micro-benchmark based methodology to measure
the hardware resource usage by analytical modeling. The
major hardware resource consumption of AccD comes from
the distance computation kernel, which depends on several
design factors, including the kernel block size, the number of
SIMD workers, etc.
In AccD resource analytical model, the design factors are
classified into two categories: dataset-dependent and dataset-
independent factors. The main idea behind the AccD resource
modeling is to get the exact hardware resource consumption
statistics through micro-benchmark on the hardware designs
with different dataset-independent factors. For example, we
can benchmark a single distance computation kernel with dif-
ferent sizes of computation block to get its resource statistics.
Since this factor is dataset-independent, which can be decided
before knowing the dataset details. However, to estimate
the resource consumption for datasets with different sizes
and dimensionalities, AccD leverages the formula-based ap-
proach to estimate the overall hardware resource consumption
(Equation 9), which combines online information (e.g., kernel
organization, and dataset properties) and offline information
(e.g., miro-benchmark statistics).
Resourceest = Resourcesingle × ceil(srcsize
blk
)× ceil( trgsize
blk
)
(9)
where the types of Resource can be on-chip memory, com-
puting units, and logic operation units; Resourceest is the
estimated overall usage of a certain type resource for the
overall design; Resourcesingle is the usage of a certain type
of resource for only one distance computation kernel block.
d) Constraints Validation: Constraints validation is the
third phase of AccD explorer, which checks whether the
design resources consumption of a given configuration is
within the budget of the given hardware platform. The input
of this phase are the resource estimation results from resource
modeling step. The design constraint inequalities are listed in
Equation 10, which includes Mem (the size of on-chip mem-
ory), BW (the bandwidth of data communication between
external memory and on-chip memory), Computing Unit
(the number of computing units) and Logic Unit (the number
of logic units):
BW ≤ BWmax,
Mem ≤Memmax,
Computing Unit ≤ Computing Unitmax,
Logic Unit ≤ Logic Unitmax
(10)
Constraints validation phase will also discard the con-
figurations that cannot match the design performance and
constraints, and only keep the ”well-performed” configurations
for further optimization in the next iteration. The constraints
validation phase will also record the modeling information of
the best configuration statistics in the last iteration, which will
be used to terminate the optimization process if the modeling
results difference between the configurations in two consec-
utive iterations is lower than a predefined threshold. This
strategy can also help to avoid unnecessary time cost. After
termination of the AccD explorer, the ”best” configuration with
maximum design performance under the given constraints will
be output as the ”optimal” solution for the AccD design.
VII. EVALUATION
In this section, we choose three representative benchmarks
(K-means, KNN-join, and N-body Simulation) and evaluate
their corresponding AccD designs on the CPU-FPGA plat-
form.
a) K-means: K-means [1, 18, 28–30] clusters a set of
points into several groups in an iterative manner. At each
iteration, it first computes the distances between each point and
all clusters, and then update the clusters based on the average
position of their inside points. We choose it as our benchmark
since it can show the benefits of AccD hierarchy (Trace-based
+ Group-level) bound computation optimization on iterative
algorithms with disjoint source and target set.
b) KNN-join: KNN-join Search [2, 19, 31] finds the Top-
K nearest neighbor points for each point in the source set
from the target set. It first computes the distances between
each source point and all the target points. Then it ranks
the K-smallest distances for each source point and gets its
corresponding closest Top-K target points. KNN-join can
help to demonstrate the effectiveness of AccD hybrid (Two-
landmark + Group-level) bound computation optimization
on non-iterative algorithms.
c) N-body Simulation: N-body Simulation [32, 33] mim-
ics the particle movement within a certain range of 3D space.
At each time step, distances between each particle and its
neighbors (within a radius R) are first computed, and then
the acceleration and the new position of each particle will be
updated based on these distances. While N-body simulation
is also iterative, it has several differences compared with K-
means algorithm: 1) N-body simulation has the same dataset
(particles) for source and target set, whereas K-means operates
on different source (point) and target (cluster) sets; 2) All
points in the N-body simulation would change their positions
according to the time variation, whereas in K-means only the
target set (cluster) would change their positions during the
center update; 3) N-body simulation has the same size of
source and target set, whereas K-means target set (cluster)
is much smaller than source set (point) in general. N-body
simulation can help us to show the strength of AccD hybrid
bound computation (Two-landmark + Trace-based + Group-
level) on iterative algorithms with the same source and target
set.
A. Experiment Setup
a) Tools and Metrics: In our evaluation, we use Intel
Stratix 10 DE10-Pro [34] as the FPGA accelerator and run
the host side software program on Intel Xeon Silver 4110 pro-
cessor [35] (8-core 16-thread, 2.1GHz base clock frequency,
85W TDP). DE10-Pro FPGA has 378,000 Logic elements
(LEs), 128,160 adaptive logic modules (ALM), 512,640 ALM
registers, 648 DSPs, and 1,537 M20K memory blocks. We
implement AccD design on DE10-Pro by using Intel Quartus
Prime Software Suite [36] with Intel OpenCL SDK included.
To measure the system power consumption (Watt) accurately,
we use the off-the-shelf Poniie PN2000 as the external power
meter to get the runtime power of Xeon CPU and DE10 Pro
FPGA.
TABLE IV: Implementation Description.
Name Techniques Description
Baseline Standard Algorithm
without any
optimization, CPU.
Naive for-loop based
implementation on
CPU.
TOP Point-based
Triangle-inequality
Optimized
Algorithms, CPU.
TOP [11] optimized
distance-related
algorithm running on
CPU.
CBLAS CBLAS library
Accelerated
Algorithms, CPU.
Standard
distance-related
algorithm with
CBLAS [37]
acceleration.
AccD Algorithmic-hardware
co-design,
CPU-FPGA platform.
GTI filtering and
FPGA acceleration of
distance
computations.
b) Implementations: The CPU-based implementations
consist of three types of programs: the naive for-loop se-
quential implementation without any optimization (selected as
our Baseline to normalize the speedup and energy-efficiency),
the algorithm optimized by TOP [11] framework and the
algorithm optimized by CBLAS [37] computing library. Note
that the TOP + CBLAS implementation is not included in our
evaluation, since after applying TOP point-based TI filtering,
each point in the source set has a distinctive list of points
from the target set for distance computation, whereas CBLAS
requires uniformity in the distance computations. Therefore, it
is challenging to combine TOP and CBLAS optimization.
c) Dataset: In the evaluation, we use six datasets for
each algorithm. The selected datasets can cover the wide
spectrum of mainstream datasets, including datasets from UCI
Machine Learning Repository [38], and datasets that have
ever been used by previous papers [9, 11, 12] in the related
domains. Details of these datasets are listed in Table V.
Note that KNN-join algorithm will find the Top-1000 closest
neighbors of each query point.
B. Comparison with Software Implementation
a) Performance Comparison: As shown in Figure 8,
TOP, CBLAS, and AccD achieve average 9.12×, 9.19× and
31.42× compared with Baseline across all algorithm and
dataset settings, respectively. As we can see, AccD design can
always maintain the highest speedup among these implementa-
tions. This largely dues to AccD GTI optimization in reducing
distance computation and its efficient hardware acceleration of
the distance computation on FPGA.
We also observe that TOP implementation shows its strength
for large datasets. For example, on dataset 3D Spatial Net-
work (n = 434, 874) in KNN-join, TOP implementation
K-means KNN-join N-body Simulation
Dataset Size Dimension #Cluster Dataset Dimension #Source Dataset #Particle
Poker Hand 25,010 11 158 Harddrive1 64 68,411 P-1 16,384
Smartwatch Sens 58,371 12 242 Kegg Net Directed 24 53,413 P-2 32,768
Healthy Older People 75,128 9 274 3D Spatial Network 3 434,874 P-3 59,049
KDD Cup 2004 285,409 74 534 KDD Cup 1998 56 95,413 P-4 78,125
Kegg Net Undirected 65,554 28 256 Skin NonSkin 4 245,057 P-5 177,147
Ipums 70,187 60 265 Protein 11 26,611 P-6 262,144
TABLE V: Datasets for Evaluation.
3
.3
2
.2
8
2
.5
1
3
.7
5
3
.9
4
6
.8
61.3
7
2
.7
2
.8
5
1
1
.7
8 6.5
4
9
.4
7
3
9
.0
8
2
1
.9
5
2
1
.4
8
5
1
.6
1
2
3
.5
6
6
.6
1
0
10
20
30
40
50
60
70
Poker Hand Smartwatch
Sens
Healthy
Older People
KDD Cup
2004
Kegg Net
Undirected
Ipums
Sp
ee
d
u
p
 (
x)
TOP CBLAS AccD
(a)
1
5
.1
7
1
2
.6
8
3
9
.7
8
8
.3
7
2
6
.0
8
3
.4
8
3
1
.7
7
1
2
.5
1
1
3
.2
2
2
7
.1
7
2
.4
8
5
.0
4
5
0
.5
7
2
7
.2
5
4
3
.1
7
8
8
.9
5
2
8
.2
1
2
1
.6
3
0
10
20
30
40
50
60
70
80
90
100
Harddrive1 Kegg Net
Directed
3D Spatial
Network
KDD Cup
1998
Skin NonSkin Protein
Sp
ee
d
u
p
 (
x)
TOP CBLAS AccD
(b)
4
.6
8
5
.2
3
5
.7
8
6
.0
9
6
.9
1
7
.3
0
2
.1
5
7
.4
6
7
.2
2
7
.2
5
7
.1
5
7
.2
4
9
.1
4
1
3
.1
4
1
3
.1
9
1
4
.3
9
1
7
.4
6 1
4
.2
0
2
4
6
8
10
12
14
16
18
20
P-1 P-2 P-3 P-4 P-5 P-6
Sp
ee
d
u
p
 (
x)
TOP CBLAS AccD
(c)
Fig. 8: Performance Comparison (TOP, CBLAS, AccD): (a) K-means (b) KNN-Join (c) N-body Simulation. Note: Speedup is
normalized w.r.t Baseline.
3
.3
2
.2
8
2
.5
1
3
.7
5
3
.9
4
6
.8
6
0
.6
3
1
.2
5
1
.3
8
4
.5
3
2
.9
5
3
.7
5
1
6
2
.3
8
9
.1
4 4
8
.5
9
8
4
.4
1
6
5
.1
4
2
5
1
.5
2
0
50
100
150
200
250
300
Poker Hand Smartwatch
Sens
Healthy Older
People
KDD Cup 2004 Kegg Net
Undirected
Ipums
En
er
gy
 E
ff
ic
ie
n
cy
 (x
)
TOP CBLAS AccD
(a)
1
5
.1
7
1
2
.6
8
3
9
.7
8
8
.3
7
2
6
.0
8 3
.4
8
9
.3
8
3
.9
3
3
.9
5
7
.7
3
0
.8
1
1
.6
1
4
7
.1
9
1
0
0
.2
7
1
2
3
.5
7
2
1
9
.1
1
9
2
.6
2
7
6
.8
1
0
50
100
150
200
250
Harddrive1 Kegg Net
Directed
3D Spatial
Network
KDD Cup 1998 Skin NonSkin Protein
En
er
gy
 E
ff
ic
ie
n
cy
 (x
)
TOP CBLAS AccD
(b)
4
.6
8
5
.2
3
5
.7
8
6
.0
9
6
.9
1
7
.31.2
4
3
.6
3
2
.7
7
2
.6
2
.6
4
2
.0
3
3
7
.4
7
4
9
.7
3
6
4
.3
7 61
.5
3
7
3
.7
7
4
5
.8
7
0
10
20
30
40
50
60
70
80
P-1 P-2 P-3 P-4 P-5 P-6
En
er
gy
 E
ff
ic
ie
n
cy
 (x
)
TOP CBLAS AccD
(c)
Fig. 9: Energy Efficiency Comparison (TOP, CBLAS, AccD): (a) K-means. (b) KNN-Join. (c) N-body Simulation. Note: Energy
Efficiency is normalized w.r.t Baseline.
achieves 39.78× speedup. Since the fine-grained point-based
TI optimization of TOP can reduce most (more than 90%)
of the unnecessary distance computations, which benefits the
overall performance to a great extent. Note that the intrinsic
point distribution of the dataset would also affect the filtering
performance of TOP, but in general, the larger dataset could
lead TOP to spot and remove more redundant computations.
What we also notice is that CBLAS implementation demon-
strates its performance on datasets with relatively high dimen-
sionality. For example, on dataset KDD Cup 2004 (d = 74)
in the K-means algorithm, CBLAS achieves 11.78× speedup
over Baseline, which is higher than its performance on other
K-means datasets. This is because, on high dimension dataset,
CBLAS implementation can get more benefits of parallelized
computing and more regularized memory access, whereas, in
low dimension settings, the same optimization can only yield
minor speedup.
Our AccD design achieves a considerable speedup on
datasets with large size and high dimensionality. For example,
on dataset KDD Cup 2004 (n = 285, 409, d = 74) and Ipums
(n = 70, 187, d = 60) in K-means, AccD achieves 51.61× and
66.61× speedup over Baseline, and also significantly higher
than both TOP and CBLAS implementations. This conclusion
can also be extended to KNN-join, such as 88.95× speedup
on dataset KDD Cup 1998 (n = 95, 413, d = 56). Since
our AccD design can effectively reconcile the benefits from
both the GTI optimization and the FPGA acceleration, where
the former provides the opportunity to reduce the distance
computation at the algorithm level, and the latter boosts the
performance from hardware acceleration perspective. More
importantly, our AccD design can balance the above two
benefits to maximize the overall performance.
b) Energy Comparison: The energy efficiency of AccD
design is also significant. For example, on the K-means
algorithm, AccD designs deliver an average 116.85× better en-
ergy efficiency compared with Baseline, which is significantly
higher than TOP and CBLAS implementations. There are
namely two reasons behind these results: 1) Much lower power
consumption. AccD CPU-FPGA design only consumes 5w ∼
17.12w across all algorithm and dataset settings, whereas Intel
Xeon CPU consumes at least 20.9w and 42.49w on TOP
and CBLAS implementations, respectively; 2) Considerable
performance. AccD design achieves a much better speedup
(more than 5× on average) compared with the TOP and
CBLAS, which contributes to overall design energy-efficiency.
Among these implementations, CLBAS implementation has
the lowest energy efficiency, since it relies on multi-core
parallel processing capability of the CPU, which improves the
performance at the cost of much higher power consumption
(average 65.79w). TOP only leverages the single-core process-
ing capability of the CPU and achieves moderate performance
with effective distance computation reduction, which results in
less power consumption (average 25.59w) and higher energy
efficiency (average 9.12×) compared with Baseline. Different
from the TOP and CBLAS implementations, AccD design is
built upon a low-power platform with considerable perfor-
mance, which shows a far better energy-performance trade-off.
C. Performance Benefits Analysis
To analyze the performance benefits of AccD CPU-FPGA
design in detail, we use K-means as the example algorithm for
study. Specifically, we build four implementations for compar-
ison: 1) TOP K-means on CPU; 2) TOP K-means on CPU-
FPGA platform; 3) AccD K-means on CPU; 4) AccD K-means
on CPU-FPGA platform. Note that TOP K-means is designed
for sequential-based CPUs, and no publicly available TOP im-
plementation on CPU-FPGA platforms. For a fair comparison,
we implement TOP K-means on CPU-FPGA platform with
memory optimizations (inter-group and intra-group memory
optimization) and distance computation kernel optimization
(Vector-Matrix multiplication). These optimizations improve
the data reuse and memory access performance. We compute
3
.3
0
2
.2
8
2
.5
1
3
.7
5
3
.9
4
6
.8
6
3
.5
3
2
.1
2
2
.3
8
4
.4
2
1
.1
5
2
.2
6
3
.0
3
1
.7
7
2
.2
2
3
.3
1
1
.6
2
4
.2
4
3
9
.0
8
2
1
.9
5
2
1
.4
8
5
1
.6
1
2
3
.5
0
6
6
.6
1
0
10
20
30
40
50
60
70
Harddrive1 Kegg Net
Directed
3D Spatial
Network
KDD Cup 1998 Skin NonSkin Protein
Sp
e
e
d
u
p
 (
x)
TOP (CPU) TOP (CPU-FPGA)
AccD (CPU) AccD (CPU-FPGA)
Fig. 10: AccD Performance Benefits Breakdown.
the normalized speedup performance of each implementation
w.r.t the naive for-loop based K-means implementation on
CPU.
As shown in Figure 10, AccD K-means on CPU-FPGA plat-
form can always deliver the best overall speedup performance
among these implementations. We also observe that TOP K-
means can achieve average 3.77× speedup on CPU, however,
directly porting this optimization towards CPU-FPGA plat-
form could even lead to inferior performance (average 2.63×).
Even though we manage to add several possible optimizations,
applying such fine-grained TI optimization from TOP would
still cause a large divergence of computation among points,
leading to low data reuse and inefficient memory access.
We also notice that AccD design on CPU achieves lower
speedup (average 2.69×) compared with the TOP (average
3.77×), since its coarse-grained GTI optimization spots a
fewer number of unnecessary distance computations. However,
when combining AccD design with CPU-FPGA platform,
the benefits of AccD GTI optimization become prominent
(average 37.37×), since it can maintain computation regularity
while reducing memory overhead to facilitate the hardware ac-
celeration on FPGA. Whereas, applying optimization to max-
imize the algorithm-level benefits while ignoring hardware-
level properties would result in poor performance, such as the
TOP (CPU-FPGA) implementation. Moreover, comparison of
AccD (CPU) and AccD (CPU-FPGA) can also demonstrate
the effectiveness of using FPGA as the hardware accelerator
to boost the performance of the algorithms, which can deliver
additional 9.68× ∼ 15.71× speedup compared with the
software-only solution.
VIII. CONCLUSION
In this paper, we present our AccD compiler framework
to accelerate the distance-related algorithms on the CPU-
FPGA platform. Specifically, AccD leverages a simple but
expressive language construct (DDSL) to unify the distance-
related algorithms, and an optimizing compiler to improve the
design performance from algorithmic and hardware perspec-
tive systematically and automatically. Rigorous experiments on
three popular algorithms (K-means, KNN-join, and N-body
simulation) demonstrate the AccD as a powerful and com-
prehensive framework for hardware acceleration of distance-
related algorithms on the modern CPU-FPGA platforms.
REFERENCES
[1] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on
Information Theory, 1982.
[2] Naomi S Altman. An introduction to kernel and nearest-neighbor
nonparametric regression. The American Statistician, 1992.
[3] M. Trenti and P. Hut. N-body simulations (gravitational). Scholarpedia,
2008.
[4] H. M. Hussain, K. Benkrid, H. Seker, and A. T. Erdogan. Fpga
implementation of k-means algorithm for bioinformatics application: An
accelerated approach to clustering microarray data. In 2011 NASA/ESA
Conference on Adaptive Hardware and Systems (AHS), 2011.
[5] Zhongduo Lin, Charles Lo, and Paul Chow. K-means implementation
on fpga for high-dimensional data using triangle inequality. In 22nd In-
ternational Conference on Field Programmable Logic and Applications
(FPL), 2012.
[6] T. Saegusa and T. Maruyama. An fpga implementation of k-means
clustering for color images based on kd-tree. In 2006 International
Conference on Field Programmable Logic and Applications (FPL),
2006.
[7] Zhe-Hao Li, Ji-Fang Jin, Xue-Gong Zhou, and Zhi-Hua Feng. K-nearest
neighbor algorithm implementation on fpga using high level synthesis. In
2016 13th IEEE International Conference on Solid-State and Integrated
Circuit Technology (ICSICT), 2016.
[8] Charles Elkan. Using the triangle inequality to accelerate k-means. In
Proceedings of the 20th International Conference on Machine Learning
(ICML), 2003.
[9] Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, and Todd
Mytkowicz. Yinyang k-means: A drop-in replacement of the classic k-
means with consistent speedup. In International Conference on Machine
Learning (ICML), 2015.
[10] J. Canilho, M. Vstias, and H. Neto. Multi-core for k-means clustering
on fpga. In 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), 2016.
[11] Yufei Ding, Xipeng Shen, Madanlal Musuvathi, and Todd Mytkowicz.
Top: A framework for enabling algorithmic optimizations for distance-
related problems. Proc. VLDB Endow., 2015.
[12] Guoyang Chen, Yufei Ding, and Xipeng Shen. Sweet knn: An efficient
knn on gpu through reconciliation between redundancy removal and
regularity. In 2017 IEEE 33rd International Conference on Data
Engineering (ICDE), 2017.
[13] Ioannis Stamoulias and Elias S. Manolakos. Parallel architectures for
the knn classifier – design of soft ip cores and fpga implementations.
ACM Trans. Embed. Comput. Syst., 2013.
[14] E. S. Manolakos and I. Stamoulias. Ip-cores design for the knn classifier.
In Proceedings of 2010 IEEE International Symposium on Circuits and
Systems, 2010.
[15] H. M. Hussain, K. Benkrid, A. T. Erdogan, and H. Seker. Highly
parameterized k-means clustering on fpgas: Comparative results with
gpps and gpus. In 2011 International Conference on Reconfigurable
Computing and FPGAs, 2011.
[16] Dominique Lavenier. Fpga implementation of the k-means clustering
algorithm for hyperspectral images. Los Alamos National Laboratory
LAUR, 2000.
[17] G. Di Fatta and D. Pettinger. Dynamic load balancing in parallel kd-tree
k-means. In 2010 10th IEEE International Conference on Computer and
Information Technology (ICCIT), 2010.
[18] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D
Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means
clustering algorithm: Analysis and implementation. IEEE Transactions
on Pattern Analysis & Machine Intelligence (PAMI), 2002.
[19] C. Yang, X. Yu, and Y. Liu. Towards efficient knn joins on data streams.
In 2014 IEEE International Congress on Big Data, 2014.
[20] P. Sirait and A. M. Arymurthy. Cluster centres determination based
on kd tree in k-means clustering for area change detection. In 2010
International Conference on Distributed Frameworks for Multimedia
Applications, 2010.
[21] Ruicheng Zhong, Guoliang Li, Kian-Lee Tan, and Lizhu Zhou. G-tree:
An efficient index for knn search on road networks. In Proceedings of
the 22nd ACM International Conference on Information & Knowledge
Management (CIKM), 2013.
[22] Y. Wang, Z. Zeng, B. Feng, L. Deng, and Y. Ding. Kpynq: A work-
efficient triangle-inequality based k-means on fpga. In 2019 IEEE
27th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), 2019.
[23] Xing Zhang Zhiqiang Yang Jipan Huang Miren Tian, Xin’an Wang and
Hao Chen. The implementation of a knn classifier on fpga with a parallel
and pipelined architecture based on predetermined range search. In
2016 13th IEEE International Conference on Solid-State and Integrated
Circuit Technology (ICSICT), 2016.
[24] H. Hussain, K. Benkrid, C. Hong, and H. Seker. An adaptive fpga
implementation of multi-core k-nearest neighbour ensemble classifier
using dynamic partial reconfiguration. In 22nd International Conference
on Field Programmable Logic and Applications (FPL), 2012.
[25] H. M. Hussain, K. Benkrid, and H. Seker. An adaptive implementation
of a dynamically reconfigurable k-nearest neighbour classifier on fpga.
In 2012 NASA/ESA Conference on Adaptive Hardware and Systems
(AHS), 2012.
[26] Greg Hamerly. Making k-means even faster. In Proceedings of the 2010
SIAM international conference on data mining (SIAM), 2010.
[27] H. Hong, G. Juan, and W. Ben. An improved knn algorithm based on
adaptive cluster distance bounding for high dimensional indexing. In
2012 Third Global Congress on Intelligent Systems, 2012.
[28] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern
recognition letters, 2010.
[29] Adam Coates and Andrew Y Ng. Learning feature representations with
k-means. In Neural networks: Tricks of the trade. 2012.
[30] Siddheswar Ray and Rose H Turi. Determination of number of clusters
in k-means clustering and application in colour image segmentation. In
Proceedings of the 4th international conference on advances in pattern
recognition and digital techniques, 1999.
[31] Michael Gowanlock. Knn-joins using a hybrid approach: Exploiting
cpu/gpu workload characteristics. In Proceedings of the 12th Workshop
on General Purpose Processing Using GPUs (GPGPU), 2019.
[32] Lars Nylons. Fast n-body simulation with cuda. 2007.
[33] Shigeru Ida and Junichiro Makino. N-body simulation of gravitational
interaction between planetesimals and a protoplanet: I. velocity distri-
bution of planetesimals. 1992.
[34] Terasic de10-pro stratix 10 gx/sx fpga development kit. URL http://www.
terasic.com.tw/cgi-bin/page/archive.pl?CategoryNo=248&No=1144.
[35] Intel xeon silver 4110 processor. URL https://www.intel.com/
content/www/us/en/products/processors/xeon/scalable/silver-processors/
silver-4110.html.
[36] Intel quartus prime software suite. https://www.intel.com/content/www/
us/en/software/programmable/quartus-prime/overview.html.
[37] Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. Augem:
automatically generate high performance dense linear algebra kernels
on x86 cpus. In Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis (SC), 2013.
[38] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
URL http://archive.ics.uci.edu/ml.
