Stream processing dual-track CGRA for object inference by Fan, X et al.
1Stream Processing Dual Track CGRA
for Object Inference
Xitian Fan, Di Wu, Wei Cao, Wayne Luk Fellow, IEEE, and Lingli Wang, Member, IEEE
Abstract—With the development of machine learning technol-
ogy, the exploration of energy-efficient and flexible architectures
for object inference algorithms is of growing interest in recent
years. However, not many publications concentrate on coarse-
grained reconfigurable architecture (CGRA) for object inference
algorithms. This paper provides a stream processing, dual-track
programming CGRA-based approach to address the inherent
computing characteristics of algorithms in object inference.
Based on the proposed approach, an architecture called SDT-
CGRA is presented as an implementation prototype. To evaluate
the performance, the SDT-CGRA is realized in Verilog HDL
and implemented in SMIC 55nm process, with the footprint
of 5.19 mm2 at 450 MHz. Seven object inference algorithms
including CNN, k-means, PCA, SPM, linear-SVM, Softmax and
Joint-Bayesian are selected as benchmarks. The experimental
results show that the SDT-CGRA can gain on average 343.8
times and 17.7 times higher energy efficiency for Softmax, PCA
and CNN, 621.0 times and 1261.8 times higher energy efficiency
for k-means, SPM, linear-SVM and Joint-Bayesian algorithms
when compared to the Intel Xeon E5-2637 CPU and the Nvidia
TitanX GPU. When compared to the state-of-ther-art solutions
of AlexNet on FPGA and CGRA, the proposed SDT-CGRA can
achieve a 1.78 times increase in energy efficiency and a 13 times
speedup respectively.
Index Terms—coarse-grained reconfigurable architecture,
domain-specific computing, machine learning, deep learning,
object inference.
I. INTRODUCTION
W ITH the breakthrough of deep learning technology inspeech applications [2], computer vision [3] and other
tasks in artificial intelligence (AI) [4], the architecture explo-
ration of related algorithms is a hot research topic in terms
of energy-efficiency and flexibility. For example, Google’s
Tensor Processing Unit (TPU) is built specifically for machine
learning acceleration and tailored for the TensorFlow software
framework [5], [6]. It is reported that TPU can achieve an order
of magnitude energy-efficiency enhancement compared to the
traditional approaches [5], [6]. In academia, deep learning ac-
celerator, such as DaDianNao [7], ShiDianNao [8], Eyeriss [9],
Cambricon [10], etc., have shown impressive improvements
This paper is an extended version of an earlier paper [1] published in the
26th International Conference on Field-Programmable Logic and Applications
(FPL2016). This work is supported by National Natural Science Foundation
of China 61131001.
X. Fan, D. Wu, W. Cao and L. Wang are with the State Key
Laboratory of ASIC and System, Fudan University, Shanghai, 201203,
China. The corresponding authors are Wei Cao and Lingli Wang (E-mail:
xtfan14@fudan.edu.cn, 15110720011@fudan.edu.cn, wcao@fudan.edu.cn, ll-
wang@fudan.edu.cn).
W. Luk is with the Department of Computing, Imperial College London,
London SW7 2AZ, United Kingdom (E-mail: wl@doc.ic.ac.uk).
Digital Object Identifier XXXXXX.
CNN ݇-means SPM PCA …
SVM
Joint-Bayesian Softmax
…
…
Feature 
Extraction
Feature
Selection
Inference
Cat
Fig. 1. A general flow of object inference.
in energy-efficiency compared to general purpose processors
and GPUs. However, application specific accelerators (e.g. Da-
DianNao, ShiDianNao, Eyeriss) are often tailored to specific
algorithms (e.g. Convolutional Neural Network, CNN). The
hardwired logic prohibits them to migrate from one algorithm
to another. As a result, they may not be proper approaches
to accelerate algorithms in an object inference flow, since
a general object inference flow not only contains CNN, but
also includes other algorithms, as shown in Fig.1. As for
the application specific instruction processors (ASIPs, e.g.
Cambricon), their energy efficiency is restricted due to the
logic and memory overhead to support flexibility. On the
contrary, coarse-grained reconfigurable architecture (CGRA), a
promising paradigm for domain-specific computing, has been
shown that it can outperform ASIPs in energy efficiency and
has more flexibility than application-specific accelerators [11].
However, few CGRA architectures concentrate on the domain
of machine learning, especially the object inference flow.
The increasing demand to process data streams more ef-
ficiently for data-centric applications for embedded system is
another motivation to design an appropriate CGRA. As shown
in Section III, the computing processes in an object inference
flow have three inherent characteristics. The first is stream
processing, which means an algorithm can be divided into
multiple computing kernels while each kernel computes in the
stream manner. The second is fixed kernel-level operations,
which means the computing patterns of the kernels remain
unchanged in a long execution period (even up to thousands
of execution cycles), such as the image filtering and the
gradient computation of image in the edge detection. The one
is large amount of memory storage requirements for input data,
intermediate data and output results.
To design a CGRA that caters to these characteristics,
we first employ a cluster of processing elements (e.g. ALU,
Multiplier) as an elementary reconfiguration cell (RC) for
kernel-level operations. This design choice is made because a
2OP1, OP2, OP3, …,  OP1, OP2, OP3, … ALU
A group of operations 
to perform a kernel
Repeated
Static Configuration 
Context (SCC)
Kernel-level 
RC
SCC
(a) (b)
Fig. 2. (a) Loading OPs in every execution cycle; (b) static configuration
Stream interface
Cluster of 
Processing Elements
Stream interface
…
…
Memory
BlockRegister FileALU
RC
interface
… Address
Generator
(a) (b)
Stream Buffer Unit
OP
VLIW
Fig. 3. Migration from (a) word-level granularity to (b) stream-level granu-
larity.
complex cell has more possibilities to map a complex kernel
in a single cell rather than across multiple cells. As a result,
the cost in data transmission among multiple cells for complex
kernel operations can be reduced and hense the computation
can be speeded up. This approach is similar to EGRA [12]
and FPCA [13], which have proved the clustered RC array
can outperform traditional CGRAs that can only map one
computing operation in each cell. Secondly, stream processing
is adopted as programming paradigm. Due to the characteristic
of the fixed kernel-level operations, we propose a dual-track
programming model based on the stream processing. The
key idea is to adopt static configuration to construct the
functionalities of RCs for kernel-level operations, and to apply
dynamic configuration to manage data streams transmission.
This approach is similar to the differentiation between the
configuration and the I/O data transfers in Morphsys [14].
However, the major difference is that our model applies
static configuration to fix the RCs’ functionalities and local
interconnections in a long execution period, while Morphsys
loads the operations (OPs) from context memory for their RCs
in every execution cycle, as illustrated in Fig.2(a). According
to the characteristic of the fixed kernel-level operations in
object inference applications, the group of OPs for a specific
kernel is repeatedly executed. In this case, the repeated OPs
can be simplified as static configuration in our approach (as
shown in Fig.2(b)), which can reduce the bandwidth to load
OPs and decrease the overall power consumption. Thirdly,
with the granularity of input data increasing from word level
to stream level, the intermediate storage (e.g. register file in
traditional reconfigurable block) of each RC should also be
increased from word-level granularity to stream-level granular-
ity. As a result, stream memory is provided for each RC. This
approach can also meet with the requirements of increasing
intermediate storage in object inference flow. Fig.3 shows this
migration from word-level granularity (register file) to stream-
level granularity (Stream Buffer Unit, SBU).
To summarize, our contributions in this paper are as follows.
 A novel CGRA-based approach to accelerate algorithms
in a general object inference flow is provided. This ap-
proach has three aspects. Firstly, it adopts stream process-
ing, such that each kernel-level RC computes in stream
manner. Secondly, it employs both static configuration
and dynamic configuration for stream processing, such
that static configuration constructs kernel functionalities,
while dynamic configuration is used for scheduling of
data streams. Third, stream memory is used for buffering
the input, output and intermediate streams.
 A CGRA implementation prototype, called Stream Dual-
Track CGRA (SDT-CGRA), is provided to realize the
above approach. Novelties include the composable and
decomposable RC architectures that are interconnected in
a Reverse-S topology, and stream-driven control mecha-
nism to simplify control behavior for cluster-based RC
architecture, as shown in Section IV-D.
The rest of this paper is organized as follows. Section II
covers the previous work in this domain. Section III presents
the design approach based on the analysis of algorithms on
the object inference flow. Section IV introduces an archi-
tecture called SDT-CGRA to implement the proposed design
approach. Section V includes several examples to demonstrate
the mapping strategies. Section VI contains the experimental
results. Finally Section VII discusses future work and con-
cludes the paper.
II. RELATED WORK
A. Architecture Perspective
In the past decades, several CGRA frameworks are pro-
posed. Among them, ADRES [15] is one of the widely studied
templates based on data flow computing. The tightly coupled
characteristic with a host processor allows multiple customized
instructions to be efficiently executed on ADRES. As a result,
instruction-level parallelism can be exploited to accelerate
algorithms. Another CGRA template is Morphsys [14], which
is organized by 2-D mesh homogeneous reconfigurable cells
in single instruction multiple data (SIMD) fashion. It exploits
data-level parallelism to accelerate applications.
These two approaches exploit parallelism from different
perspectives to improve the computing performance for the
applications in a specific domain. However, for the algorithms
in an object inference flow, the characteristic of fixed kernel-
level operation make ADRES and Morphsys no longer be
energy-efficient. For example, both ADRES and Morphsys
have to load context instructions in every execution cycle,
which is not necessary according to the characteristic of the
fixed kernel-level operations in our applications.
On the contrary, DySER [16], a CGRA architecture that
explores functionality and parallelism specialized in a single
array, has shown that specializing the common data paths
in their proposed architecture with certain execution periods
for the programs can improve the energy efficiency. We
further extend this specialized approach to the object inference
domain, where the kernel-level operations are suitable for
specialization.
Kernel-level operations in stream processing require kernel-
based processing units in order to compute efficiently. EGRA
3[12], an expression-level granularity CGRA framework, has
showed that expression-level or kernel-level CGRA fabric
can outperform traditional CGRA approaches. In this paper,
reconfigurable block in kernel-level granularity is employed as
the elementary reconfigurable cell (RC). The major difference
is that we derive the RC architecture and its computing
manner according to the computing characteristics of our
target applications.
BilRC [17] and Elastic CGRA [18] can be categorized
into another CGRA template. Both of them are similar to
commercial FPGA in architecture arrangement and configu-
ration manner. For BilRC, applications’ dataflows are mapped
statically and scheduled dynamically by execution triggering,
while Elastic CGRA depends on elastic interconnection [19] to
manage dataflows. In our approach, elastic interconnection is
adopted for data transmission among RCs and global memory,
interconnections in RC units.
An earlier version of our work was presented in [1]. This
paper further optimizes the architecture of RC and SDT-
CGRA. More specifically, the number of ALUs, FIFOs and
static configuration bits are reduced by 20%, 20% and 38.5%
respectively. Details will be discussed in Section IV. More
complementary experiments are provided in Section VI.
B. Application Perspective
Most of the CGRA architectures proposed in the past mainly
concentrate on the digital signal processing. For example,
ADRES, BilRC, Elastic CGRA mainly target at the accelera-
tion of FFT, DCT/IDCT, etc. In [20], a CGRA architecture is
proposed to accelerate video decoding in multiple standards.
In [13], an architecture called FPCA is designed for medical
image processing.
As for machine learning, MAPLE [21] introduces an FPGA-
based reconfigurable accelerator for classification. They ab-
stract all selected algorithms as matrix multiplications, and
design a matrix multiplication engine for all of them. Actually,
this approach limits the exploitation of locality properties for
some machine learning algorithms, e.g. CNN. In [22], a CNN
acceleration approach based on CGRA (called EMAX) is
proposed. However, limited experiments are done to evaluate
the performance of EMAX on CNN. In [23], a multithread
CGRA (called M-CGRA) are proposed to accelerate CNN
only. However, the object inference flow not only contains
CNN, but also includes other traditional algorithms. If one
wants to design an architecture for CNN acceleration only, the
algorithms specific approaches like Eyeriss [9], DNPU [24] ,
DNA [25] and hybrid-neural-network processor [26] are more
suitable and energy-efficient.
III. COMPUTING CHARACTERISTICS ANALYSIS
A. Algorithms Characteristics
Object inference is one of the most important topics in
computer vision. In the competitions of PASCAL VOC [27]
and ImageNet [28], a general processing flow of the proposed
algorithms in the past several years can be abstracted into three
key stages: feature extraction, feature selection and inference,
TABLE I
THE SELECTED REPRESENTATIVE ALGORITHMS
Stage Representative Algorithms
Feature Extraction Convolutional Neural Network (CNN)
Feature Selection k-means, SPM, PCA
Inference SVM, Softmax, Joint-Bayesian
TABLE II
SUMMARY OF MAJOR COMPUTATION PATTERNS FROM REPRESENTATIVE
ALGORITHMS IN GENERAL OBJECT INFERENCE FLOW
Pattern
Description
Pattern
Equation
Related
Algorithm
conv
P
i ai  bi CNN
interpolation
ai  c+ bi; ci <= c < ci+1;
i=1,2,...,N
CNN, Softmax
div a
b
CNN
sqrt
p
a k-means
distance
P
i jai   bijP
i(ai   bi)2
k-means
matrix
multiplication
Aa1a2Ba2b2
CNN, SVM, PCA,
Joint-Bayesian,
Softmax
histogram
P
i(Ii == Kj?w : 0) SPM
Ă
Ă
Kernel 
scope
stride
(a)
Ă
Ă
Ă
/D\HU/D\HU /D\HU Ă
(b)
Fig. 4. The computation procedure of a kernel over a scope: (a) A scanning
behavior example; (b) Illustration of multi-scale scanning.
as shown in Fig. 1. In this paper, several representative algo-
rithms covering these three stages are selected as case studies.
Among them, convolutional neural network (CNN) is used for
feature extraction; k-means, spatial pyramid matching (SPM)
[29] and principal component analysis (PCA) are adopted for
feature selection; the linear support vector machine (SVM),
Softmax [3] and Joint Bayesian [30] are used for inference.
Table I summarizes these algorithms, while their computing
patterns are analyzed and shown in Table II.
Generally, the detection of objects requires to process im-
ages or feature maps by scanning operations within specific
processing scopes in multiple scales. Each scanning operation
can be regarded as a kernel function executing over a limited
scope of the input image/feature map, and then shifting to the
next scope in a specific order, as illustrated in Fig. 4(a). As
for multi-scale detection, general methods include classifiers
running on the pyramid of images/feature maps (e.g. DPM
[31]), pyramid of filters running on the feature maps (e.g.
SURF [32]) and pyramid of referenced bounding boxes on
the final regression functions (e.g. Faster R-CNN [33]). For
simplicity, multi-scale detection can be regarded as multi-layer
4scanning operation, as illustrated in Fig. 4(b).
Each process in Fig. 4 has the following inherent properties.
The functionality of the computing kernel remains unchanged
for the same input image/feature map. The only difference is
the input data for the computing kernel. As a result, any spe-
cific kernel will execute repeatedly over all the correspondent
kernel scopes. In this process, the computing patterns of the
kernels remain unchanged for a long execution periods, which
corresponds to the fixed kernel-level operations. In this case,
the input data from all the kernel scopes can be organized as
bunches of data streams, while the execution of a kernel is
in stream processing manner. Besides, the multi-scale/multi-
layer operations have pyramid input images/feature maps,
or pyramid output of intermediate results, which means the
storage requirement (includes intermediate storage) is much
critical than CGRAs for other applications, e.g. digital signal
processing. In this case, another computing characteristic of
algorithms in object inference is the requirement of sufficient
storage. To summarize, the computing characteristics of algo-
rithms in the object inference flow are
 stream processing,
 fixed kernel-level operations,
 ample storage.
In our selected representative algorithms, kernel-level
operations in stream processing account for most of
computations. In CNN, take AlexNet [3] for example,
92% of the computation are convolution and pooling.
The rest, which includes fully connected layer (contains
matrix multiplication pattern in Table II) and softmax layer
(contains matrix multiplication and interpolation pattern
in Table II), can also be organized as kernel-level operations in
stream processing manner. More importantly, the computing
pattern of a specific layer, e.g. the first convolutional layer, can
remain unchanged for nearly one million execution cycles. In
this case, if we consider the power consumption in loading
each OP for an ALU in every execution cycle in Fig.2(a),
the power saving of our static configuration approach is
substantial.
B. Guidelines for Architecture Design
As discussed in Section III-A, the computing characteristics
can enable us to design a CGRA in the following approaches.
 Design kernel-level granularity RCs in stream manner.
 Increase the size of intermediate storage.
 Employ static configuration for RCs to construct their
functionalities, and dynamic configuration for data
streams scheduling.
The first two steps explain how to design a CGRA architecture
for our target applications. While the last one, which is called
dual-track programming model in this paper, explains how the
architecture works. We will discuss these two parts in detail
below.
1) How to design a CGRA architecture for target ap-
plications: Convolution and matrix multiplication are two
of the most important computing patterns in algorithms of
object inference flow. These two computing patterns involve
many multiplication-accumulation operations. As a result, the
elementatry RC should be based on multiplier-ALU (MUL-
ALU for short) units. An approach such as placing multipliers
and ALUs in different part, e.g. BilRC [17], would increase the
routing resource requirements and even lead to the placement
or routing conflict. As a result, the tightly couple of multiplier
and ALU for our applications is a proper approach.
On the other hand, extra operations are required to support
further additions, accumulations or logic operations in convo-
lution, matrix multiplication, distance calculation, etc. If single
MUL-ALU unit is employed as the elementary RC, the extra
additions, accumulations or logic operations will lead to low
utilization ratio of multipliers in MUL-ALU units. As a result,
combining MUL-ALU units and extra ALUs as elementary RC
is necessary.
We studied the problem of determining the number of MUL-
ALU units and extra ALUs in each RC by collecting run time
statistics over several representative CNN algorithms, since
CNN is the most important part in terms of computation and
inference accuracy. Fig. 5(a) shows the statistics of different
sizes of convolution kernels in AlexNet, VGG-16 [34] and
GoogleNet [35]. It is clear that the number of the convolution
of 3 3 (denoted as conv 3x3) ranks number one in AlexNet
and VGG-16, while GoogleNet has the most number of the
1  1 convolution kernels (denoted as conv 1x1). However,
from the perspective of computation, the 3  3 convolution
0
3
.4
4
0
.2
5 0
0
.0
0
2
90
1
6
.3
4
0 0 0
1
8
.2
4
3
.4
5
0
.2
4
0
.0
0
1
9 0
0
2
4
6
8
10
12
14
16
18
20
22
N
u
m
b
er
 o
f 
C
o
n
v
o
lu
ti
o
n
 K
er
n
el
x
 1
0
0
0
0
0
AlexNet VGG-16 GoogleNet
(a)
0%
50.53%
33.64%
0%
15.83%
0%
100%
0% 0% 0%
19.30%
64.16%
8.67% 7.87%
0%
0%
20%
40%
60%
80%
100%
C
o
m
p
u
ta
ti
o
n
 R
at
io
AlexNet VGG-16 GoogleNet
(b)
Fig. 5. Information of convolution kernels in AlexNet, VGG-16 and
GoogleNet; (a) The number of convolution kernels with different sizes in three
CNN architectures; (b) The computing workload ratios of different size of
convolution kernels in all convolutional layers for specific CNN architecture.
5kernel still ranks number one in three CNN architectures (see
Fig. 5(b)). In VGG-16, the computation ratio of conv 3x3 in
all the convolutional layers is even up to 100%. As a result,
efficient support for 3 3 convolution kernel can be regarded
as the foundation to design a RC unit. In this case, three
MUL-ALU units and an extra ALU are combined together
as the elementary RC in our CGRA approach. As for other
sizes of convolution kernel and other computation patterns,
flexible interconnections inside and outside RCs are provided
to support them. Details will be discussed in Section IV.
The whole CGRA architecture can be regarded as a 2-
D array of RC units, as illustrated in Fig.6(a). In order to
introduce the static and dynamic configurations clearly, we
can re-organize the architecture in Fig.6(a) as Fig.6(b), where
all the SBUs are lined up as one column. In our dual-
track programming model, the SBUs are controlled by the
dynamic context in VLIW format, while the RC array is
configured statically according to the computing kernels. The
CGRA architecture based on stream processing, dual-track
programming model is also called SDT-CGRA in this paper.
2) How the architecture works: The RC array in Fig.6(b) is
guided by the dual-track programming model, whose configu-
ration flow is shown in Fig.7(a). For each computing kernel in
a kernel-level iteration, the functionality of given computing
kernel is initially constructed by static configuration. Then the
data stream manager is configured according to the scheduling
requirements, such as loading data from the off-chip memory
to SBU or from the SBU to the RC array, storing data
from the RC array to the SBU or from the SBU to the off-
chip memory. After the configurations, the address generator
in SBU starts to generate addresses to issue load or store
operations (indicated by the innermost loop in Fig. 7(a)) while
the RC array performs as a consumer as well as a producer in
this process. It is worth mentioning that, the dynamic VLIW
context only supports load and store operations.
To illustrate the proposed model, a convolution example
which is widely used in image processing and machine learn-
ing domain is presented. Fig. 7(b) shows the process of static
mapping and dynamic scheduling of the data streams for the
convolution operation. For simplicity, we assume the input
image Imap contains two rows of data, denoted as L1 and L2.
The size of the convolution kernel is 13. Based on the dual-
track model, the convolution operation is statically mapped
…
RC SBU
SBU …
…
…
… … …
SBU
SBU
…
SBU
RC SBU RC SBU
…
RC SBURC SBU RC SBU
…
RC SBURC SBU RC SBU
………
(a)
(b)
RC RC RC
RC
RCRC
RCRC
RC
VLIW Static Configuration
Fig. 6. (a) RC Array; (b) RC array re-organization
into one RC, e.g. RC1. The corresponding scheduling of the
data streams is supposed to be compiled into two VLIWs:
Instruction 1 and Instruction 2. Each of them can issue two
concurrent operations, which are used to load the input data
and store back the results. For example, in Instruction 1, L1 is
read out from SBU and then sent to RC1. At the same time,
the output stream result R1 is stored back to SBU as soon
as it is available. Instruction 2 performs the same operations
except that the input data stream and the output stream are
different. In this method, the RC1 is configured to be a stream
processing unit for efficient convolution computation over the
input image. To demonstrate the benefits of the dual-track
programming model for SDT-CGRA, we assume that two
configuration methods in Fig.2 are applied in RC1 respectively.
It can be supposed that both methods consume one clock cycle
to finish the configuration with the bandwidth requirement
of BW on RC1. As for the approach that requires to load
OPs in every execution cycle, the total bandwidth requirement
in loading OPs is 14  BW (The Imap in Fig.7(b) has
Static configuration
Loading dynamic 
configuration
Data Loading
Kernel execution &
Saving results
End of data
loading
End of dynamic
configuration
< 
End of  kernel
Kernel execution in a stream mode
1
<
1
Static configuration
Dynamic configuration
K
e
rn
e
l-
le
v
el
 i
te
ra
ti
o
n
RC array
Step 1: RC configuration
          SBU
Step 2: Data streams scheduling & Execution
VLIWs
AddrGen
SBU
AddrGen
RC array
MEM
MEM
(a)
Ä
Input data stream
Output data stream
Instruction 1
Instruction 2
Static mapping
to RC
Generating
dynamic 
context
L1
L2
RC1
Instruction 1:
R1
R2=
Instruction 2:
ܫ݉ܽ݌ ٔ ߱ ൌ ܴ߱ଵ߱ଶ߱ଷ
߱ଷ ߱ଶ ߱ଵ
ܴଵଶ ൌ ܮଵଶ ൈ ߱ଵ ൅ ܮଵଷ ൈ ߱ଶ ൅ ܮଵସ ൈ ߱ଷ ܴଵଵ ൌ ܮଵଵ ൈ ߱ଵ ൅ ܮଵଶ ൈ ߱ଶ ൅ ܮଵଷ ൈ ߱ଷ
…
ܮͳଵܮͳଶ
…
Load  L1
Store  R1
Load  L2
Store  R2
(b)
Fig. 7. The programming model of the SDT-CGRA and an example
of convolution operation based on the proposed dual-track model; (a) the
programming model; (b) an example for the model.
6Ă
SBU
SBU
SBU
Ă
Ă Ă
RC RC RC PRC
Ă
ĂRC RC RC
ĂRC RC RC
Ă Ă Ă
IRC
Ă
IRC
Ă
External 
memory 
DMA 
interface
Crossbar
Dynamic 
Config. 
Ctr. Unit
Static Config.
 Ctr. Unit
Interconnection between local data bus and RC  
Interconnection between RCs in horizontal direction
Computing array
Dynamic 
configuration interface
Static 
configuration interface
Interconnection between RCs in vertical direction
Global memory
Local data bus  
Off-chip memory Host Processor
SDT-CGRA
Fig. 8. The typical acceleration system consists of an SDT-CGRA architec-
ture, an off-chip memory and a host processor.
14 elements). As for our static configuration approach, the
bandwidth requirement is only BW since we only need to
configure the RC1 in one time. From this point of view,
the static configuration is better. If we consider the power
consumption of loading configuration contexts, the advantages
of our approach are more evident.
IV. SDT-CGRA ARCHITECTURE: A PROTOTYPE
According to the design strategies introduced in Section III,
we present a prototype to implement SDT-CGRA. Each part
of SDT-CGRA is introduced as follows.
A. The Overview of SDT-CGRA
The top-level architecture of the proposed SDT-CGRA and
a typical system is shown in Fig. 8. The SDT-CGRA unit
can be mainly organized into a global memory section and
a computing array section according to the difference of
configuration manner. The global memory section is used
to cache data streams and issue load or store operations
through dynamic configuration. In contrast, the computing
array section, which works in static configuration manner,
consists of several columns of RC and one column of special
RC (shown as IRC and PRC in Fig.8). Special RCs are used
to support some special operations, such as power function
(corresponding to PRC) and transcendental function that can
be approximated by interpolation (corresponding to IRC).
Since such operations account for small computation generally,
only several special RC units are provided. Details of these
units are introduced in the Section IV-C.
In addition to the memory blocks and the computing units,
the interconnections among them play a key role in data
transmission and selection of mapping strategies. For example,
the interconnections marked as blue arrows in Fig.8 are
used to connect the RC units in Reverse-S topology, which
),)22
M1 D1
D2
D3 D4
M2 D3
D4
D5 D6
M3 D5 D6
M1 M2 M3
S1 S2 S3
S1S2 S3
Pre2
Pre2 D3
Ctr_Unit
D1 D2
RC input IF
RC output IF
  From Local Data 
Transmission  Channel
To Multi-channel data bus To the Next RC
From Previous RC
“Map”
part
“Reduce”
part
S3
N1 N1
Next2
S3
N2
N2
Pre1 Next1
S2
FIFO
65$0
Fig. 9. The micro architecture of RC from the data path perspective.
ALU ALU ALU ALU ALU ALU ALU ALU ALU
… …
(a) Independent style (b) Broadcast style (c) Systolic style
ݖି௡ ݖି௡
Fig. 10. Three different work style of MUL-ALU units. The zn means to
delay n clock cycles of the data.
is designed to provide the composition and decomposition
capability among adjacent RCs in horizontal direction. As a
result, the larger computing kernel that exceed the computing
volume of one RC can be realized by several RCs, and help to
reduce the idle processing elements. Details can be found in
Section IV-D. The green arrows, on the other hand, can enable
the results from one RC to pass directly to the next RC in
the vertical direction. This type of direct data transmission
method is inspired by the FDR-CGRA [36] to reduce the
global memory communication congestion. The local data bus
is designed for data transmission between SBUs and RCs.
With the help of scalable crossbar switch [37], each SBU can
be accessed by any RC, IRC and PRC based on the local data
transmission channels.
There are three interfaces in total for the proposed SDT-
CGRA. The external memory DMA interface provides a direct
access to the off-chip memory for the SBUs. The remaining
two interfaces are used for configuration: one for static config-
uration and the other for loading dynamic context instructions.
7B. RC Architecture
1) Detailed Architecture: The guidelines in Section III-B
shows the RC architecture can be designed with three MUL-
ALU units and an extra ALU for our target applications. Fig.9
shows a detailed RC architecture in SDT-CGRA. Three MUL-
ALU units can be configured to execute multiply-accumulate
operations, distance calculations or other computing patterns
in independent manner, broadcast manner or systolic manner,
as shown in Fig.10. An extra ALU is used to perform further
addition, accumulation or other logic operations to reduce
the bandwidth of output interface. This idea follows [38]
where it is called “map-reduce” structure. In our architecture,
three MUL-ALU units belong to the “map” part, while the
remaining ALU belongs to “reduce” part (see Fig.9). The
“map” part can be used to execute concurrent operations while
the “reduce” part is used to collect the results from the “map”
part. The “map” part is composable and decomposable, which
will be introduced in Section IV-D. When compared to the RC
architecture in our previous work [1], the numbers of FIFOs
and ALUs are both reduced from 5 to 4 (20% reduction)
without any impact on the mapping of computing kernels.
Besides, the number of multiplexers, which provide internal in-
terconnections to efficiently support other computing patterns,
is reduced from 37 to 22 (40.5% reduction). In addition, the
static configuration bits for multiplexers are reduced by 38.5%.
The intention to design such tightly coupled RC unit with three
MUL-ALU units is to increase the operation intensity per input
data, which can help to improve computation performance
according to the roofline model [39].
To implement stream processing, the input and output
stream interface in Fig.3 are realized with two types of local
memories (FIFOs and SRAMs). FIFOs are used to maintain
the working status of Ctr Unit by the “full” and “empty”
signals. The SRAMs are adopted to cache the data that are
used frequently, e.g. the weights of convolution kernels. In
many cases, the SRAM can also be used to perform as double-
buffer to overlap the time cost in data transmission from SBU
to RC by the time consumed in computing.
2) Stream-Driven Control Mechanism: The control mech-
anism of RC are determined by two characteristics involving
stream processing and static configuration. To accommodate
stream processing, the FIFOs in input and output stream
interfaces and the interconnection channels based on elastic
interconnection [19] (see Section IV-D) are provided to issue
“processing flag”. That is to say, if the input FIFOs are not
empty and the output FIFO is not full, or the input interconnec-
tion channels have valid data and the output interconnection
channels are writable, the Ctr Unit in Fig.9 will control the RC
to process the input data. Otherwise, the Ctr Unit will stop the
RC from processing. This stream-driven control mechanism
of Ctr Unit is realized by finite state machines (FSMs) and
counters, which is configured in static configuration stage and
designed to generate control signals such as read enable signals
for input FIFOs, read addresses for input SRAMs, write enable
signal for the output FIFO, clear signals for ALUs when they
are configured to be accumulators, ready and valid signals in
interconnection channels for adjacent RCs, etc.
Algorithm 1 Calculate y = xp [40], in C language syntax
Input: x; p; 1  p  1; x > 0, x is in floating-point format
Output: y = xp
1: float xorig = x;
2: int i = (int)x;
3: float j = (1:0  p)  1064975338+ (p  i); // 1064975338
is called magic number
4: i =int(j);
5: x = (float)i;
6: y = (1:0  p)  x+ p  pow(x; (p  1)=p)  xorig;
«
«
«
Data
Valid
Stop
Data
Valid
Stop
RCi-1 RCi
V V
D D
En En
Data
Valid
Stop
«
«
«
V V
D D
En En
Data
Valid
Stop
RCi+1
Fig. 11. The elastic control mechanisim [19].
C. Special RC Design
Two types of special RCs, PRC and IRC, are developed to
support the power functions and piecewise functions respec-
tively. Consider first PRC, designed to calculate x1=2, x 1=2
and x 1 based on the fast inverse square root algorithm [40]
shown in Algorithm 1. Many multiplications in Algorithm 1
can be simplified to be shifting or addition operations, while
the pow(x; (p   1)=p) can be calculated by multiplications
when p set to be 1=2,  1=2 or  1. Due to the requirement
of floating-point representation of the input value x according
to the fast inverse square root algorithm, the input x is first
converted from the fixed-point format to the floating-point
format. In the 6-th line of the code in Algorithm 1, all the
values required to calculate y will be converted back to the
fixed-point value to calculate the final result in order to reduce
the hardware complexity.
Consider next IRC, designed to calculate the interpolation
for transcendental functions that can be approximated by
piecewise functions, as shown by the following expression.
f(x) = ai  x+ bi; x 2 [xi; xi+1); i = 0; 1; :::; N   1: (1)
Specifically, the input data x is compared with the boundary
values of each interval from x0 to xN 1 in parallel to generate
an address for look up tables. Then the coefficients of ai and
bi from look up tables are used to calculate the interpolation
result f(x). It is worth mentioning that both PRC and IRC
are independent of RCs. As a result, the data transmissions
between them are through SBUs and local data buses.
D. Interconnections
The interconnections of SDT-CGRA are organized into
two types. The first is the interconnections between RCs,
while the second is the crossbar between SBUs and the RC
array. The main strategy for these interconnections is the
elastic data transmission mechanism (Fig. 11), which can be
8ACC
Ă
Ă
FIFO
Elastic Interconnection
ALU ALU



Multiplier-ALU
(a) (b)
ACC ACCACC ACC ACC
Fig. 12. (a) Illustration of the RC decomposition and combination in differnt
rows. (b) Illustration of elastic interconnection to support RC decomposition
and combination, ALUs are configured to be accumulators (ACC).
… …
1
2
3
Fetch VLIW instruction
Read data for RCs
…
Execution time
data
ctr
valid
stop
k-th SBU
Crossbar
௜௝
(a) (b)
Fig. 13. (a) Crossbar switch. (b) The illustration of the dynamic configuration
and execution relationship.
used to simplify the control procedure by converting dynamic
scheduling to dataflow control [19]. The “stop” and “valid”
signals determine the handshaking process and maintain the
reliable data transmission between different nodes.
1) The Interconnections Among RCs: A simple example is
used to introduce the functionality of the Reverse-S intercon-
nections among RCs. Assuming that five multiplier-ALUs are
required to implement an expression E, three multiplier-ALUs
in the first RC and two multiplier-ALUs in the second RC can
be combined together to map E. The unused resources of the
second RC can be further combined with other RCs. Fig. 12(a)
illustrates such an approach, where five RCs in two adjacent
rows are configured to calculate three such expressions in
parallel.
Fig. 12(b) illustrates the details of the decomposition and
combination process. The elastic interconnections are used to
transfer the input data and the results that are required by
the next RC. Although the second RC is split into two parts,
its control behavior is still independent from other RCs as a
result of elastic interconnections. That is to say, the working
status of the second RC just depends on the “valid” and “stop”
signals of the elastic interconnections as well as the “full” and
“empty” signals of the FIFOs in the input and output interfaces.
No other control signals from other RCs are required.
2) The Scalable Crossbar Switch: The scalable crossbar
switch in SDT-CGRA performs as a bridge to interconnect the
SBUs and the RC array. It provides the capability for all the
RCs to access each SBU. To accommodate the characteristics
of stream processing and dynamic data stream scheduling, the
crossbar switch is controlled dynamically by each select signal
along with data stream in each input channel, as indicated by
“ctr” in Fig. 13(a).
E. SBU Architecture
As indicated in Fig. 3, each SBU contains a memory block
and an address generator that can provide read and write
addresses simultaneously. The operations of address generator
in each SBU, such as read/write from/to the RC array or
the off-chip memory, are controlled by dynamic configuration
contexts. To demonstrate the control flow of the data streams,
suppose that the k-th SBU issues several write operations
to the j-th RC in the i-th row (denoted as RCij in Fig.
13(a)). In the first VLIW cycle, the k-th SBU and the control
signals corresponding to the output channel in the crossbar
are configured. After the end of configuration, the address
generator in the k-th SBU starts to generate addresses to read
data from the memory block for RCij . At the same time, the
next VLIW instruction is fetched and waits for the finish of the
current instruction, as illustrated in Fig. 13(b). This double-
buffer technique adopted in the dynamic configuration process
can help to reduce the configuration overhead.
V. ALGORITHM MAPPING EXAMPLES
In machine learning, convolution operation and large-scale
matrix multiplication are two of the most common computing
patterns. For example, the convolutional layers and the fully
connected layers account for nearly 92% and 8% computation
workloads respectively in AlexNet. In this section, several
strategies are demonstrated to map computing kernels to the
proposed SDT-CGRA based on the convolution operation
and matrix multiplication. It should be mentioned that these
strategies are only part of the computing methods in the
considered algorithms; other methods can also be mapped on
SDT-CGRA.
A. Mapping Strategies of the Convolution Layers
One of the strategies to map convolution operations in the
convolutional layers can support various sizes of convolution
kernels with arbitrary strides. Without loss of generality,
suppose that the mapped kernel is 55, the stride is 22, and
the width of the 2D input feature map is N. The number of
multiplier-ALUs that are required to map such a convolution
kernel is determined by d5=2ed5=2e. There are two working
phases (depend on the stride of the kernel) for data scheduling,
where each phase corresponds to one dynamic VLIW config-
uration instruction. Fig. 14 shows the mapping and computing
strategies. The convolution kernel is mapped into three RCs
with two working phases, as shown in Fig. 14(a). Before the
computation begins, the weights of the convolution kernel are
loaded into three SRAM blocks in each RC. To simplify the
control procedure, the size of convolution kernel is extended
to be 6 6 by padding 0s at the right and bottom sides of the
kernel. In this case, the weights in the first two rows of the
extended convolution kernel, (w0;0,w0;1,...,w0;4,0) and (w1;0,
w1;1,...,w1;4,0), are loaded into all the SRAM blocks in the first
RC. The weights of the third and the fourth rows are loaded
into the second RC while the last two rows are loaded into the
last RC. After the weights are initialized, the input data are
read from the source SBU in the row order and broadcasted
to all the mapped RCs.
9#Phase 0
0-th RC
1-th RC
2-nd RC
0
0-th SBU
1-th SBU
2-nd SBU
Result
Source SBU
#Phase 1
0-th RC
1-th RC
2-nd RC
0-th SBU
1-th SBU
2-nd SBU
Source SBU
(a)
d0d1,d2,dN Ă
d0
d1
d2
d3
d4
d5
0
X
X
X
X
d6
d7
d8
d9
d10
d11
d12
d0
d1
d2
d3
d4
d5
0
X
X
d6
d7
d8
d9
d10
d11
d12
d0
d1
d2
d3
d4
d5 0
d6
d7
d8
d9
d10
d11
d12
0
Ă Ă Ă
cl
o
ck
 c
y
cl
e
Ă
Ă
i-th SBU
D
a
ta
 t
o
 m
u
lt
ip
li
e
rs
 
0,iω
1,iω
2,iω
3,iω
4,iω
0,iω
1,iω
2,iω  
0,iω
1,iω
2,iω
3,iω
4,iω
0,iω
1,iω
2,iω
3,iω
4,iω  
0,iω
1,iω
2,iω
3,iω
4,iω
0,iω
1,iω
2,iω
3,iω
4,iω
0,iω
Out
pp,+1-iL
In
p,iL
Out
pp,+1-iL
Out
pp,+1-iL
In
p,iL
(b)
Fig. 14. A 55 convolution kernel is mapped into three RCs which includes
two working phases. Each phase is correspondent to a dynamic configuration
instruction.
As for the even rows (row number starts from 0) in the
input feature map, three RCs work in #Phase 0. For the odd
rows, the mapped RCs work in #Phase 1. The computation
process of both phases are illustrated in Fig. 14(b), where LIni;p
and LOuti 1+p;p represent the input and output of intermediate
results buffered in the i-th SBU at phase p, i = 0; 1; 2; p =
0; 1; i   1 + p =  1 means LOuti 1+p;p = 0. For example,
the first row of the input data are loaded into the 0-th RC to
convolve with the first row of the weights. The corresponding
results, denoted as LIn0;0, are buffered in the 0-th SBU. After
the end of #Phase 0, the 0-th RC switches to work in #Phase
1 to compute the convolution of the second row of the input
data and the second row of the weights. At the same time, the
results from #Phase 0 (denoted as LIn0;0 above) are read from
0-th SBU (denoted as LOut0;1 ) and sent to the 0-th RC to add
up with the current convolution results to generate the LIn0;1.
This process is repeated until all the data of the input feature
map are sent to the RCs. We can see that the input data are
fully reused and the convolutions are computed in parallel.
Generally speaking, suppose that the size of convolu-
tion kernel is kx  ky with the stride of sx  sy , then
dkx=sxe  dky=sye multiplier-ALUs are needed according to
this mapping strategy. The decomposition and combination
characteristics of RCs introduced in Section IV-D provide
the capability to map larger convolution kernels flexibly and
efficiently.
B. Mapping Strategy for the Fully Connected Layers
Matrix multiplication is the major computing pattern in each
fully connective layer. For example, the operations in the first
fully connected layer of AlexNet can be expressed as:
O14096 = F19216  !92164096 (2)
If the input feature F from the previous layer is directly used to
compute the output result O, the system suffers from frequent
loading of the weight ! from SBU to SRAM blocks in RCs.
In this case, the overheads of transmission of the weights will
lead to the decline of overall performances. One idea is to
reuse the weights as much as possible. A direct method is to
adopt a “batch” strategy, which means a certain number of
input features, e.g. 100, are batched together to construct a
larger matrix, such as F1009216. In this way, the weights can
be reused 100 times so that the time cost to load new weights
into the SRAM blocks can be hidden by the computing time.
Due to the capacity limitation of each SRAM block, the weight
matrix have to be divided into several smaller sub-matrices so
that each one can be loaded into one SRAM block. In order
to adopt double-buffer strategy, the dimension of each sub-
matrix should not exceed half of the size of a SRAM block,
e.g. 128. As a result, the first dimension of the weight matrix
can be divided into 72 parts (72  128 = 9216). Similarly,
the input feature matrix F1009216 is divided into 100  72
blocks. Each block is denoted as F i;j1128, where i and j are
the block indices. Suppose there are 25 RCs in SDT-CGRA.
The number of SRAM blocks in all RCs is 75. The size of
the second dimension of the weight matrix is not divisible by
75 (4096 = 54  75 + 46). As a result, the weight matrix is
divided unequally into two types. One is t1: !
m;n
12875 and the
other is t2 : !m;n12846, where m and n are the indices. After
the partition, equation (2) can be expressed as
O1004096 =26664
F 0;0    F 0;71
F 1;0    F 1;71
...
...
...
F 99;0    F 99;71
37775
26664
!0;0t1    !0;53t1 !0;54t2
!1;0t1    !1;53t1 !1;54t2
...
. . .
...
...
!71;0t1    !71;53t1 !71;54t2
37775
(3)
To illustrate the computing process of equation (3) on the
SDT-CGRA, we take the multiplication process of F1009216
with !0;0t1 as an example. The computing process mapped into
the SDT-CGRA is shown in Fig.15. Firstly, 75 columns of
!0;0t1 are loaded into 75 SRAMs in the RC array. Then the sub-
matrices F 0;0, F 1;0, ..., F 99;0 in the first column of F1009216
are broadcast to all the RCs one by one for processing with
multiplication and accumulation. The results are sent to the
SBUs and then added up with the products of !1;0t1 and the
second column of F1009216. This process is repeated until
the final matrix result is computed.
VI. EXPERIMENTAL RESULTS
A. Evaluation Setup
In order to evaluate the proposed architecture, seven al-
gorithms shown in Table I (see Section III) are selected as
benchmarks. The typical implementations and problem sizes
are listed in Table III. For example, we take k-means and SPM
from [29] for object detection with the vocabulary size of the
codebook to be 200. The evaluations of these algorithms on
different platforms are all based on the same problem size. It is
worth noting that all the algorithms are evaluated for inference,
while the training stage is not evaluated in this paper.
10
,nput parameter
,nput data
RCRC RC
Ă
RCRC RC
Ă
Ă
Input data Partial products
Ă
Ă
Ă ĂĂ
0
Ă
Input parameter
1
2
3
M
N
SBU 
FIFO 
SRAM
Fig. 15. The mapping strategy of large scale matrix multiplication
TABLE III
TYPICAL IMPLEMENTATIONS, APPLICATIONS AND PROBLEM SIZES OF
THE SELECTIVE ALGORITHMS
Algorithm Implementation Problem Size
CNN
Caffe based on
MKL and cuBLAS
AlexNet: 227 227 3
k-means
MKL and
cuBLAS
Vocabulary size:200
Feature dimension: 128
PCA MKL and cuBLAS
Input dimension: 320000
Output dimension: 150
SPM MKL and cuBLAS
Pyramid layer : 3
Vocabulary size:200
Softmax
Caffe based on
MKL and cuBLAS
Class number: 1000
Feature dimension: 4096
SVM MKL and cuBLAS Feature dimension: 3780
Joint Bayesian MKL and cuBLAS Feature dimension: 150
To map the algorithms into the SDT-CGRA, the static
configurations and dynamic VLIW instructions are written in
the microcode format manually. In order to reduce the effort of
writing static and dynamic configurations, we encapsulate the
static configurations of the most common computing patterns
and the SBU read/write operations into a library. To map a
given task into the SDT-CGRA, it is programmed with the
provided APIs of the library.
B. Implementation Details
The proposed SDT-CGRA contains 5  5 (25) RCs, 5
special RCs, 54 KByte global SRAM (27 SBU) and 54.6
KByte local memory (including FIFOs and SRAMs). The
detailed information is shown in Table IV, where all the
computing units are designed based on 16-bit fixed-point
except several 32-bit fixed-point adders and shifters in PRCs.
The whole SDT-CGRA is implemented in Verilog HDL and
then synthesized, placed and routed with Synopsys Design
Compiler and IC Compiler based on the SMIC 55 nm library.
The final results reported by the IC Compiler show that the
area of the proposed architecture is 5.19 mm2. The average
chip-only power consumption (dynamic power plus static
power) of SDT-CGRA is evaluated based on the simulation
wave files over the 7 selected benchmarks. The results show
that the average power dissipation is 0.84 W. The breakdown
of the area, the average power dissipation and the chip-only
TABLE IV
DETAILED INFORMATION OF EACH UNIT
Unit Type
Information
Details Number
RC 16-bit fixed-point 5 5
PRC
4 16-bit fixed-point multipliers
2 16-bit fixed-point adders
1 32-bit fixed-point adder
2 32-bit fixed-point shifters
2
IRC 16-bit fixed-point 3
FIFO 16-bit64, 128 Byte
4 for each RC,
3 for each PRC,
2 for each IRC
SRAM in RC 16-bit256, 512 Byte 3
SBU 1 SRAM, 1 Address Generator 27
SRAM in SBU 16-bit 1024, 2 KByte 27
TABLE V
CHARACTERISTICS OF THE LAYOUT OF SDT-CGRA AND THE AVERAGE
POWER DISSIPATON AND TOTAL ENERGY CONSUMPTIONS OVER 7
BENCHMARKS
Component Area (m2)
Chip-only Power
(W)
Chip-only Energy
(mJ)
SDT-CGRA
5193865.40
(100%)
0.841 (100%) 29.443 (100%)
RC array
3381494.93
(65.11%)
0.524 (62.31%) 18.327 (62.25%)
SBUs
1449493.83
(27.91%)
0.199 (23.66%) 6.973 (23.68%)
Switch
& Interfaces
362876.64
(6.99%)
0.118 (14.03%) 4.143 (14.07%)
energy consumption are listed in Table V, where we can
see that RC array accounts for 65% of the chip area and
62.3% of the average power dissipation. The SBU ranks the
second place in all metrics. Besides, the delay of the critical
path is 2.21 ns, which means the architecture can run at 450
MHz. Since the SDT-CGRA contains 86 multipliers and 119
ALUs, its peak performance can reach 92.3 GOP/s. The layout
of the SDT-CGRA generated by IC Compiler is shown in
Fig. 16. It is worth to note that the IO pads is not added
since the SDT-CGRA is not designed as a independent chip
for acceleration. Instend, it is designed as a reconfigurable
accelerator in a typical System-on-Chip (SoC) for object
inference applications.
As for system power evaluation, we adopte an approximated
evaluation method proposed in [25] to estimate the power
consumption of the whole system:
Energy = EnergySDT CGRA + Energyo chip (4)
where EnergySDT CGRA is the chip-only power,
Energyo chip is the power consumption in off-chip
memory accesses. According to [41], the typical energy
consumption of off-chip memory (e.g. DDR3) is 70 pJ/bit.
Consequently, we can estimate the Energyo chip according
to the data accessment on off-chip memory. For example, the
11
Configuration Interface 
& Memory Interface
SBUs
Crossbar
RCs
Special RCs
Fig. 16. The layout of SDT-CGRA (SMIC 55nm).
0.70
0.58
0.23
0.46
1.03
0.25
0.75
0.39
-0.16
1.15
1.76
0.60
-0.31
2.17
-0.50
0.00
0.50
1.00
1.50
2.00
2.50
lo
g
(S
p
ee
d
u
p
) 
(v
s.
 C
P
U
)
SDT-CGRA GPU
Fig. 17. Speedup of SDT-CGRA, GPU over CPU (the higher the better).
estimated power consumption of AlexNet is shown in the
following comparisons.
C. Comparisons with CPU and GPU
The algorithms in Table III are mapped on the proposed
SDT-CGRA. We also implement these algorithms on both
CPU and GPU. The CPU solution is based on the Intel
E5-2637 (8 threads, 22nm process) with the state-of-the-art
Intel MKL library and Caffe [42] library, which are multi-
threaded and widely used in linear algebra computing and deep
learning applications. For the GPU solution, the algorithms are
programmed with CUDA by cuBLAS [43] and Caffe library
based on the Nvidia TitanX GPU (3584 CUDA Cores, 16nm
process). In our evaluation approach, all the data are stored in
the device memories before execution. In this case, the time
cost in GPU implementations does not include the time of data
transmission from host memory to the internal device memory.
In SDT-CGRA, we assume the data are stored in the off-chip
memory with the bandwidth of 12.5 GB/s, which is a typical
value of DDR3. Besides, we take the thermal design power
(TDP) of CPU (80 W) and GPU (250 W) provided by vendors
as the power consumption of these two devices. As we do not
take the energy consumption of off-chip memory in CPU and
GPU into account, only the EnergySDT CGRA is used for fair
comparison.
To evaluate the speedup, the CPU solution is selected
as baseline for comparison. Fig. 17 shows the accelera-
tions of different algorithms on SDT-CGRA and GPU vs.
2.99
2.52
1.95
2.61
2.96
2.40
2.73
3.10
3.17
1.30 1.34
2.85
3.20
1.05
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
lo
g
(E
n
er
g
y 
E
ff
ii
ce
n
cy
) CPU/SDT-CGRA GPU/SDT-CGRA
Fig. 18. Energy efficiency of SDT-CGRA compared to CPU and GPU.
32.2
25.2
33.8 34.3
8.9
34.3
39.2
0
10
20
30
40
50
% Dynamic Configuration Time/Computation Time
Fig. 19. The time cost in dynamic configuration over the total computation
time
CPU. It is clear that the SDT-CGRA is faster than CPU
(log(Speedup) > 0) for all algorithms. For heavyweight
algorithms (including Softmax, PCA and CNN (AlexNet)),
whose computation complexities are much larger than other
algorithms in our experimental setup, GPU outperform both
CPU and SDT-CGRA greatly. For the lightweight algorithms
(including SVM, SPM, k-means, Joint Bayesian), SDT-CGRA
gets better speedup than GPU, and the CPU can even faster
than GPU in SPM and Joint Bayesian.
However, in the case of energy efficiency, the SDT-CGRA
outperforms the CPU and GPU in all algorithms listed in Table
III. We select the SDT-CGRA as the baseline for comparison.
Results in Fig. 18 show that the energy cost in SDT-CGRA
are smaller than the CPU and GPU. For more specific,
SDT-CGRA can achieve on average 343.8 times and 17.7
times energy efficiency for heavyweight algorithms (including
Softmax, PCA and CNN), and 621.0 times and 1261.8 times
energy efficiency for lightweight algorithms (including SVM,
SPM, k-means, Joint Bayesian) when compared to CPU and
GPU.
As the SDT-CGRA is designed based on the proposed
dual-track programming model, the overheads of static con-
figuration on RCs are small enough to be neglected in the
selected algorithms. On the contrary, the time cost in dynamic
configuration, which is used to schedule data streams, is much
higher. Fig.19 shows the time cost in dynamic configuration
over the total time cost in computing for a specific algorithm.
As we have illustrated in Fig.13(b), the time cost in fetching
dynamic configuration contexts is mainly hidden by the time
12
cost in computation. As a result, dynamic configuration has
little effect on the overall processing performance.
D. Comparisons with FPGA and ASIC
Several representative highly customized implementations
of CNN on FPGA and ASIC are adopted for comparison. In
[44], the acceleration of five convolutional layers of AlexNet
on Xilinx VC707 FPGA board is reported. In [45], a fully
piplined architecture (each layer is one pipline stage) is
proposed to accelerate all the layers of AlexNet on FPGA.
In [9], an ASIC designed for CNN acceleration only is
presented. The results are shown in Table VI, where the power
consumption refers to the system power that includes the
power dissipation in off-chip memory. It is worth noting that
only the results of convolutional layers are reported in [44]
and [9]. As a result, we calculate the energy consumption and
the energy efficency in two different way: convolutional layer
only and all layers. From Table VI we can see that SDT-CGRA
achieves better results than FPGA in energy consumption,
GOPS per watt and frame per watt. Specifically, SDT-CGRA
is 1.78 times better than the state of the art acceleration of
AlexNet in energy efficiency. As an ASIC implementation,
[9] achieves a better energy efficiency than SDT-CGRA, due
to its highly customized (for CNN) architecture and memory
system. However, SDT-CGRA is a flexible architecture that
is not only designed to support CNN, but a wide range of
algorithms.
E. Comparison with CGRA Implementations
EMAX [22] is proposed to accelerate convolutional neural
network. In [22], operations per memory bandwidth is adopted
as the criteria to evaluate their architecture. And only the
mapping results of the second convolutional layer of AlexNet
are provided. According to [22], the number of operations
per memory bandwidth in EMAX is about 6. In SDT-CGRA,
the whole architecture can reach 78.75 GOP/s when mapped
with the second convolutional layer of AlexNet, with the
requirement of 4.5 GB/s off-chip memory bandwidth. As
a result, the number of operations per memory bandwidth
in SDT-CGRA can reach 17.5, which is almost 3 times of
EMAX.
M-CGRA [23] is a CGRA architecture that is designed
for CNN acceleration. For comparison purpose, we list the
mapping results of AlexNet on M-CGRA and SDT-CGRA in
Table VII. The power or energy consumption is not available
in [23]. From Table VII we can see that SD-CGRA can achieve
13.4 times speedup compared to M-CGRA.
F. Performance Scalability
The performance scalability of the SDT-CGRA is explored
using several computing patterns. According to [46], [47],
the number of operations per word (denoted by Va) of a
given algorithm determines its upper bound of computing
performance. As a result, different computing patterns that
have different values of Va are selected for this study, including
(1) convolution in convolutional layer: Va  2  k2  O
TABLE VI
COMPARISON WITH FPGA AND ASIC ACCELERATORS
FPGA’15
[44]
FPL’16
[45]
JSSC’17
[9]
SDT-CGRA
Process 28nm 28nm 65nm 55nm
AlexNet
Layer
Time (ms)
Conv1 7.67 <2.56 5.23 4.23
Conv2 5.35 <2.56 10.48 8.21
Conv3 3.79 <2.56 5.9 4.82
Conv4 2.88 <2.56 4.6 3.62
Conv5 1.93 <2.56 2.63 2.36
FC6 – <2.56 – 3.31
FC7 – <2.56 – 1.47
Softmax – <2.56 – 0.36
Total 21.62 2.56 28.84 28.38
Frequency
(MHz)
100 156 200 450
GOPS 61.62 565.94 46.2 77.4
System
Power(W)
18.61 30.2 0.577 1.526
Energy (mJ)
c1: 402.3
a2: –
c : –
a : 77.3
c : 16.6
a : –
c : 35.6
a : 43.3
Energy
Efficency
c : 1
a : –
c : –
a : 1
c : 24.2
a : –
c : 11.3
a : 1.78
GOPS/W 3.31 18.7 80.07 50.78
Frame/W 2.49 12.9 60.09 28.20
1 c: contains convolutional layers only
2 a: contains all layers
1
10
100
1000
S
p
ee
d
u
p
RC1
RC2
RC4
RC9
RC16
RC25
RC64
RC100
RC256
Fig. 20. The scalability of different computing patterns with different number
of RC. The conv1 refers to the first convolutional layer of AlexNet; The
softmax-batchsize10 refers to the softmax with the batch size of 10; RC256
means that the number of RC is 256. The rests are similar.
(O  k2  N2) or Va  2 N2 (O  k2  N2), where k
is the size of the convolution kernel, N is the size of input
feature map, O is the number of output feature map; (2) vector-
vector multiplication: Va = 1; (3) vector-matrix multiplication:
Va = 2N=(N   1), where N is the size of vector; (4) matrix-
matrix multiplication: Va = N , where N is the size of matrix.
The bandwidth of the off-chip memory for the SDT-CGRA
is assumed to be 12.5 GB/s. With this premise, the perfor-
mance scalability of SDT-CGRA is estimated with different
13
TABLE VII
COMPARISON BETWEEN M-CGRA AND SDT-CGRA
AlexNet Layer
Conv1 Conv2 Conv3 Conv4 Conv5 FC6 FC7 FC8 Total
GOPS System Power (W)
Time (ms) Time (ms) Speedup
M-CGRA [23] 80.72 98.30 106.17 19.91 13.27 55.38 5.24 1.31 380.3 1 86.37 –
SDT-CGRA 4.23 8.21 4.82 3.62 2.36 3.31 1.47 0.36 28.38 13.4 77.49 1.526
numbers of RCs. The results are shown in Fig. 20, where the
performance improvements of conv1 (the first convolutional
layer in AlexNet) and conv3 (the third convolutional layer in
AlexNet) are nearly linear with the increase of the number
of RCs. For conv1, the performance under the RC100 and
RC256 is nearly the same. The reason is that, when the number
of output feature maps O is less than the number of RCs,
the performance will saturate. As for svm, which involves
vector-vector multiplication, the performance relies heavily on
off-chip memory bandwidth. The softmax, an algorithm that
contains a vector-matrix multiplication process, also suffers
from the same problem. However, when we adopt the batch
processing technique (the vector-matrix multiplication is con-
verted to matrix-matrix multiplication), the problem can be
alleviated. For example, when the batch size increases from 1
to 50, the performance will improve correspondingly with the
increase in the number of RCs. It is worth to note that batch
processing is only useful for applications that are not sensitive
to latency.
VII. CONCLUSION
This paper proposes a stream processing, dual-track pro-
gramming coarse-grained reconfigurable architecture which
targets algorithms in the object inference flow. The proposed
SDT-CGRA is implemented using the SMIC 55nm standard
cell technology with a footprint of 5.19 mm2. When running
at 450 MHz over 7 typical algorithms, the average chip-only
power consumption of SDT-CGRA is 0.84 W. When compared
to CPU and GPU, the SDT-CGRA can gain on average
343.8 times and 17.7 times energy efficiency for heavyweight
algorithms, and 621.0 times and 1261.8 times energy efficiency
for lightweight algorithms, respectively. Although the SDT-
CGRA does not gain competitive energy efficiency compared
to ASIC solution for specific algorithms, SDT-CGRA in 55nm
transistor technology can achieve 1.78 times improvement in
energy efficiency compared to the state-of-the-art solution of
AlexNet on FPGA in 28nm transistor technology. When com-
pared to the CGRA approach, SDT-CGRA is 3 times better
than EMAX in terms of operations per memory bandwidth
and 13 times of M-CGRA in terms of speedup. Current and
future work includes extending the proposed approach for
other applications, and automating the development of the
associated compilation and debugging tools.
REFERENCES
[1] X. Fan, H. Li, W. Cao, and L. Wang, “DT-CGRA: Dual-Track Coarse-
Grained Reconfigurable Architecture for Stream Applications,” in Inter-
national Conference on Field Programmable Logic and Applications,
pp. 78–86, 2016.
[2] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying
Convolutional Neural Networks Concepts to Hybrid NN-HMM Model
for Speech Recognition,” pp. 4277–4280, 2012.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica-
tion with Deep Convolutional Neural Networks,” Advances in Neural
Information Processing Systems, vol. 25, no. 2, p. 2012, 2012.
[4] R. Ranganath, A. Perotte, N. Elhadad, and D. Blei, “Deep Survival
Analysis,” arXiv preprint arXiv:1608.02158, 2016.
[5] N. Jouppi, “Google Supercharges Machine Learning Tasks with
TPU Custom Chip.” https://cloudplatform.googleblog.com/2016/05/
Google-supercharges-machine-learning-tasks-with-custom-chip.html.
Accessed November 26, 2016.
[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, and et al, “In-
Datacenter Performance Analysis of a Tensor Processing Unit,” in
2017 ACM/IEEE 44nd Annual International Symposium on Computer
Architecture (ISCA), Early Accessed, 2017.
[7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, and N. Sun, “DaDianNao: A Machine-Learning Supercomputer,”
in IEEE/ACM International Symposium on Microarchitecture, pp. 609–
622, 2014.
[8] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the
Sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on
Computer Architecture (ISCA), pp. 92–104, 2015.
[9] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–
138, 2017.
[10] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen,
“Cambricon: An Instruction Set Architecture for Neural Networks,” in
2016 ACM/IEEE 43nd Annual International Symposium on Computer
Architecture (ISCA), pp. 393–405, 2016.
[11] M. J. Flynn and W. Luk, Computer System Design : System-on-Chip.
Wiley, 2011.
[12] G. Ansaloni, P. Bonzini, and L. Pozzi, “EGRA: A Coarse Grained
Reconfigurable Architectural Template,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 19, no. 6, pp. 1062–1074,
2011.
[13] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A Fully Pipelined
and Dynamically Composable Architecture of CGRA,” in 2014 IEEE
22nd Annual International Symposium on Field-Programmable Custom
Computing Machines, pp. 9–16, 2014.
[14] J. Davila, A. de Torres, J. M. Sanchez, M. Sanchez-Elez,
N. Bagherzadeh, and F. Rivera, “Design and Implementation of a Ren-
dering Algorithm in a SIMD Reconfigurable Architecture (MorphoSys),”
in Proceedings of the conference on Design, automation and test in
Europe: Designers’ forum, pp. 52–57, 2006.
[15] F. Bouwens, M. Berekovic, A. Kanstein, and G. Gaydadjiev, “Architec-
tural Exploration of the ADRES Coarse-Grained Reconfigurable Array,”
in Reconfigurable Computing: Architectures, TOOLS and Applications,
Third International Workshop, ARC 2007, Mangaratiba, Brazil, March,
pp. V–G, 2007.
[16] V. Govindaraju, C. H. Ho, T. Nowatzki, J. Chhugani, N. Satish,
K. Sankaralingam, and C. Kim, “DySER: Unifying Functionality and
Parallelism Specialization for Energy-Efficient Computing,” IEEE Mi-
cro, vol. 32, no. 5, pp. 38–51, 2012.
[17] O. Atak and A. Atalar, “BilRC: An Execution Triggered Coarse Grained
Reconfigurable Architecture,” IEEE Transactions on Very Large Scale
Integration Systems, vol. 21, no. 7, pp. 1285–1298, 2013.
[18] Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, “Elastic CGRAs,”
in ACM/SIGDA International Symposium on Field Programmable Gate
Arrays, pp. 171–180, 2013.
[19] J. Cortadella, M. Kishinevsky, and B. Grundmann, “Synthesis of Syn-
chronous Elastic Architectures,” Proceedings of the 43rd annual Design
Automation Conference, pp. 657–662, 2006.
14
[20] L. Liu, D. Wang, M. Zhu, and Y. Wang, “An Energy-Efficient Coarse-
Grained Reconfigurable Processing Unit for Multiple-Standard Video
Decoding,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1706–
1720, 2015.
[21] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf,
“A Programmable Parallel Accelerator for Learning and Classification,”
in Proceedings of the 19th International Conference on Parallel Archi-
tectures and Compilation Techniques, pp. 273–284, ACM, 2010.
[22] M. Tanomoto, S. Takamaedayamazaki, J. Yao, and Y. Nakashima,
“A CGRA-Based Approach for Accelerating Convolutional Neu-
ral Networks,” in IEEE International Symposium on Embedded
Multicore/many-Core Systems-On-Chip, pp. 73–80, 2015.
[23] K. Ando, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, and M. Motomura,
“A Multithreaded CGRA for Convolutional Neural Network Process-
ing,” Circuits Systems, vol. 08, no. 6, pp. 149–170, 2017.
[24] D. Shin, J. Lee, J. Lee, and H. J. Yoo, “14.2 DNPU: An 8.1TOPS/W
reconfigurable CNN-RNN Processor for General-Purpose Deep Neu-
ral Networks,” in IEEE International Solid-State Circuits Conference,
pp. 240–241, 2017.
[25] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep Convolu-
tional Neural Network Architecture With Reconfigurable Computation
Patterns,” IEEE Transactions on Very Large Scale Integration Systems,
vol. 25, no. 8, pp. 2220–2233, 2017.
[26] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, L. Liu, and S. Wei, “A 1.06-
to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for
Deep Learning Applications,” in 2017 Symposium on VLSI Circuits,
pp. C26–C27, 2017.
[27] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zis-
serman, “The Pascal Visual Object Classes (VOC) Challenge,” Interna-
tional Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[29] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial
Pyramid Matching for Recognizing Natural Scene Categories,” in IEEE
Computer Society Conference on Computer Vision Pattern Recognition,
pp. 2169–2178, 2006.
[30] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian Face
Revisited: A Joint Formulation,” in European Conference on Computer
Vision, pp. 566–579, 2012.
[31] P. Felzenszwalb, D. Mcallester, and D. Ramanan, “A Discriminatively
Trained, Multiscale, Deformable Part Model,” pp. 1–8, 2008.
[32] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up Robust
Features,” Computer Vision Image Understanding, vol. 110, no. 3,
pp. 404–417, 2006.
[33] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks.,” IEEE
Transactions on Pattern Analysis Machine Intelligence, pp. 1–1, 2015.
[34] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,”
in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 1–9, 2015.
[36] L. Wan, C. Dong, and D. Chen, “A Coarse-Grained Reconfigurable
Architecture with Compilation for High Performance,” International
Journal of Reconfigurable Computing, vol. 2012, no. 2, 2012.
[37] F. Bistouni and M. Jahanshahi, “Scalable Crossbar Network: a Non-
blocking Interconnection Network for Large-scale Systems,” The Jour-
nal of Supercomputing, vol. 71, no. 2, pp. 697–728, 2015.
[38] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and
M. A. Horowitz, “Convolution Engine: Balancing Efficiency Flexibility
in Specialized Computing,” ACM Sigarch Computer Architecture News,
vol. 41, no. 3, pp. 24–35, 2013.
[39] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful
Visual Performance Model for Multicore Architectures,” Communica-
tions of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
[40] C. Lomont, “Fast Inverse Square Root,” Tech, 2003.
[41] D. SDRAM, “JESD79-3F.” http://www.jedec.org/standards-documents/
docs/jesd-79-3d. Accessed on November 26, 2016.
[42] Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long,
and Jonathan, “Caffe: Convolutional Architecture for Fast Feature Em-
bedding,” Eprint Arxiv, pp. 675–678, 2014.
[43] Nvidia, “cuBLAS.” https://developer.nvidia.com/cublas. Accessed
November 26, 2016.
[44] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based Accelerator Design for Deep Convolutional Neural Net-
works,” in Acm/sigda International Symposium on Field-Programmable
Gate Arrays, pp. 161–170, 2015.
[45] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A High
Performance FPGA-based Accelerator for Large-Scale Convolutional
Neural Networks,” in International Conference on Field Programmable
Logic and Applications, pp. 69–77, 2016.
[46] R. Gruber, P. Volgers, A. D. Vita, M. Stengel, and T. M. Tran,
“Parameterisation to tailor commodity clusters to applications,” Future
Generation Computer Systems, vol. 19, no. 1, pp. 111–120, 2003.
[47] S. Rousseau, D. Hubaux, and P. Guisset, “A High Performance FPGA-
Based Accelerator for BLAS Library Implementation,” 2007.
Xitian Fan received the B. S. degree from the School of Physics and
Engineering, Sun Yat-Sen University, Guangzhou, China in 2012, and the
M. S. degree from School of Microelectronics, Fudan University, Shanghai,
China, in 2014. He is currently pursuing the Ph. D. degree in the School of
Information Science and Technology, Fudan University, Shanghai, China. His
research interests include the reconfigurable computing, computer architecture,
machine learning acceleration on FPGA.
Di Wu received the B.S. degree from School of Information Science and
Technology, Fudan University, Shanghai, China in 2015. He is currently
pursuing the Ph. D. degree in the School of Information Science and
Technology, Fudan University, Shanghai, China. His research interests include
computer architecture and the development of reconfigurable systems.
Wei Cao received the B.S. and M.S. degrees in from Heilongjiang University
in 1996 and 2000, and PhD degree from Harbin Institute of Technology
in 2006, respectively. From September 2009 till now, he has worked as an
assistant researcher in State Key Lab of ASIC and System, Fudan University.
His research interests include reconfigurable computing, FPGAs architectures
and VLSI architectures for digital video and image processing.
Wayne Luk (F’ 09) received the M.A., M.Sc. and D.Phil. degrees in
engineering and computing science from Oxford University, Oxford, U.K.
Currently Professor of Computer Engineering at Imperial College, he
founded and leads the Computer Systems Section and the Custom Computing
Group in Department of Computing, and was Visiting Professor at Stanford
University and Queens University Belfast. He is a member of the Program
Committee of many international conferences such as FCCM, FPL and FPT.
He has been an author or editor for 6 books and 4 special journal issues.
Dr. Luk had 15 papers that received awards from various conferences such
as ASAP, FPL, FPT, SAMOS, SPL and ERSA, and he also won a Research
Excellence Award from Imperial College in 2006. He is a Fellow of the Royal
Academy of Engineering and the BCS, and was founding Editor-in-Chief for
ACM Transactions on Reconfigurable Technology and Systems.
Lingli Wang (M’ 99) received the M.S. degree from Zhejiang University,
Hangzhou, China, in 1997, and the Ph.D. degree from Edinburgh Napier
University, Edinburgh, U.K., in 2001, both in electrical engineering. He
was with Altera European Technology Center for four years. In 2005, he
joined Fudan University, Shanghai, China, where he is currently a Full
Professor with the State Key Laboratory of ASIC and System in the School
of Microelectronics. His current research interests include logic synthesis,
reconfigurable computing, and quantum computing.
