Multi-Mode Inference Engine for Convolutional Neural Networks by Ardakani, Arash et al.
Multi-Mode Inference Engine for Convolutional Neural
Networks
Arash Ardakani, Carlo Condo and Warren J. Gross
Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada
ABSTRACT
During the past few years, interest in convolutional neural
networks (CNNs) has risen constantly, thanks to their excel-
lent performance on a wide range of recognition and classi-
fication tasks. However, they suffer from the high level of
complexity imposed by the high-dimensional convolutions in
convolutional layers. Within scenarios with limited hardware
resources and tight power and latency constraints, the high
computational complexity of CNNs makes them difficult to
be exploited. Hardware solutions have striven to reduce the
power consumption using low-power techniques, and to limit
the processing time by increasing the number of processing
elements (PEs). While most of ASIC designs claim a peak
performance of a few hundred giga operations per seconds,
their average performance is substantially lower when ap-
plied to state-of-the-art CNNs such as AlexNet, VGGNet
and ResNet, leading to low resource utilization. Their per-
formance efficiency is limited to less than 55% on average,
which leads to unnecessarily high processing latency and
silicon area. In this paper, we propose a dataflow which en-
ables to perform both the fully-connected and convolutional
computations for any filter/layer size using the same PEs.
We then introduce a multi-mode inference engine (MMIE)
based on the proposed dataflow. Finally, we show that the
proposed MMIE achieves a performance efficiency of more
than 84% when performing the computations of the three
renown CNNs (i.e., AlexNet, VGGNet and ResNet), outper-
forming the best architecture in the state-of-the-art in terms
of energy consumption, processing latency and silicon area.
1. INTRODUCTION
Deep neural networks (DNNs), especially convolutional
neural networks (CNNs) [1], have received tremendous at-
tention due to their ability to surpass human-level accuracy
on a wide range of complex tasks such as recognition, clas-
sification and detection [2]. Depending on their size and
complexity, these networks achieve different degrees of clas-
sification/recognition accuracy. A CNN is a stack of mul-
tiple convolutional layers followed by fully-connected lay-
ers: they extract high level abstractions and features of raw
data, whereas fully-connected networks are used to learn
non-linear combinations of the extracted features. In 2012, a
CNN called AlexNet [3] was introduced: it is constituted of
5 convolutional layers followed by 3 fully-connected lay-
ers and achieves 42.9% misclassification rate (MCR) on
the ImageNet dataset. AlexNet contains 2.3M weights and
58.6M weights in its convolutional and fully-connected lay-
ers, respectively, performing 1332M operations (i.e., 666M
multiplications-accumulations) in its convolutional layers and
117.2M operations (i.e., 58.6M multiplications-accumulations)
in its fully-connected layers. VGGNet-16 [4] is another well-
known CNN, containing 13 convolutional layers with 14.7M
weights and 3 fully-connected layers with 124M weights.
VGGNet-16 performs 30.6G operations in its convolutional
layers and 248M operations in its fully-connected layers,
achieving 27% MCR on ImageNet. Recently, ResNet-50 [5],
containing 49 convolutional layers with 23.5M weights and
1 fully-connected layer with 2M weights, achieved a bet-
ter MCR (i.e., 22.85% on ImageNet) by going even deeper.
ResNet-50 respectively performs 7G and 4M operations within
the two types of layers. All these CNNs have won the Ima-
geNet Large Scale Visual Recognition Challenge (ILSVRC)
[6].
Regardless of the fact that in almost all the aforementioned
CNNs the majority of weights is found in fully-connected lay-
ers, the number of operations are dominated by convolutions.
As a result, the processing time of CNNs is also dominated
by the convolutional processes. This issue can easily be ad-
dressed by exploiting parallel processing elements (PEs) to
increase throughput. However, a straightforward paralleliza-
tion requires high data movement and bandwidth, leading to
high energy consumption [7]. It is worth noting that mem-
ory accesses to off-chip memories are more expensive than
on-chip storage, as shown in [8].
Pruning techniques were first introduced in [9, 10] to re-
duce the number of parameters and memory accesses to off-
chip memory. In [9] CPU/GPU implementations were con-
sidered, showing that 3× to 4× layer-wise speedup can be
obtained for fully-connected layers without any practical
speedup for convolutional layers. To accelerate convolutional
processes on GPUs and CPUs, a new method was also in-
troduced in [11], achieving up to 5.1× speedup. The work
presented in [12] introduces a fully-connected accelerator,
called efficient inference engine (EIE), for the pruning tech-
nique introduced in [9, 10]. EIE can obtain 13× to 307×
speedup, and save 2700× to 24000× energy compared to
CPUs or GPUs for fully-connected computations. Recently,
a new pruning technique and its custom hardware were intro-
duced in [13], using low-cost linear-feedback shift registers
(LFSRs) to prune the connectivity of fully-connected lay-
ers. This technique also saves up to 90% energy compared
to conventional implementations of fully-connected layers.
However, as discussed earlier, convolutional processes are
the bottleneck of the processing time of CNNs.
1
ar
X
iv
:1
71
2.
03
99
4v
1 
 [c
s.A
R]
  1
1 D
ec
 20
17
During the past few years, many convolutional accelerators
with different dataflows have been introduced in literature [14–
19]. While these ASIC architectures can successfully reduce
the energy consumption of convolutional processes and meet
the latency constraints of small CNNs such as AlexNet, they
fail to employ the full potential of their architectures, resulting
in a low performance efficiency. In fact, there is a huge
gap between their peak performance and average runtime
performance. For instance, in [14] the architecture known
as Eyeriss achieves a peak performance of 84 Gops, where
each MAC is considered as two operations. However, its
performance efficiency is limited to 55% and 26% when
performing the convolutional computations of AlexNet and
VGGNet-16, respectively.
To improve the performance efficiency and to accelerate
the convolutional processes for VGG and VGG-like networks,
a dataflow, called fully-connected inspired dataflow (FID),
and the architecture implementing it were introduced in [20].
This architecture achieves a high performance efficiency of
90% on the convolutional processes of VGGNet-16. Despite
its high performance efficiency, throughput and low silicon
area, it is only limited to architectures with 3×3 filters.
In this paper, we propose a dataflow supporting all type
of filter sizes used in state-of-the-art CNNs by generaliz-
ing FID. We provide a theoretical framework showing that
the proposed generalized FID (GFID) can perform both the
fully-connected and convolutional processes while using the
same hardware resources, resulting in a high utilization factor.
We then propose a CNN accelerator based on the proposed
GFID, that performs both fully-connected and convolutional
computations, which is hereafter referred to as multi-mode
inference engine (MMIE). MMIE is optimized to achieve
high performance efficiency and low memory accesses to
the off-chip memory, while keeping the power consumption
below the budget of mobile/embedded devices. Finally, we
evaluate the performance of MMIE on the state-of-the-art
CNN models (i.e., AlexNet, VGGNet-16 and ResNet-50) and
show that MMIE performs the convolutional computations of
these CNNs with an 83% minimum performance efficiency.
2. PRELIMINARIES
A fully-connected network is a stack of layers where each
neuron is connected to every neuron in the previous and next,
and to each connection is associated a weight w. A fully-
connected layer performs the following computations with n
inputs and m outputs:
y= ReLU(wm×nxn×1+bm×1), (1)
where x denotes the input pixels, y the output pixels, b
the biases, and ReLU is the non-linear activation function
ReLU = max(0,x). According to (1), the fully-connected
computational kernel calculates numerous vector-matrix mul-
tiplications followed by the ReLU. Due to parallel mem-
ory access requirement for fully-parallel implementations of
such networks, a semi-parallel implementation is a typical
approach for their hardware implementations [20]. In semi-
parallel implementations, only a limited number of PEs is
instantiated, and computations for each neuron are performed
serially [21]. In fact, different trade-offs between area occu-
pation and latency can be obtained by changing the degree of
Input Activation 
Maps
Filters
Output 
Activation Maps
* =
Win
H
in
Cin
Wf
H
f
Cin
1
2
Cout
Wout
H
ou
t
Cout
Figure 1: The high-dimensional convolutions in a convo-
lutional layer.
parallelism.
Inspired by the organization of the animal visual cortex, it
was shown that the connectivity of neurons in convolutional
layers can be mathematically described by a convolution
operation [22]. All neurons in a convolutional layer share a
set of weights, also referred to as a filter.
The main computational kernel of a convolutional layer
involves high-dimensional convolutions, as shown in Fig. 1.
The convolutional layers take input pixels, which are also
called input activation maps, arranged in 3 dimensions (i.e.,
height Hin, width Win and channel Cin), and generate output
pixels, which are also called output activation maps, arranged
in 3 dimensions (i.e., height Hout , width Wout and channel
Cout). This transformation is a result of the convolution be-
tween the input activation maps and a set of Cout 3D filters.
More precisely, every single 2D Hout ×Wout plane of the out-
put activation maps is a result of the convolution between the
3D input activation maps with a set of 3D filters. In fact, a
summation of multiple plane-wise 2D convolutions forms a
3D convolution. At the end, the results of 3D convolutions
are also added to 1D bias. In summary, the convolutional pro-
cesses with the input activation maps, the output activation
maps, the filters and the bias matrices denoted as X , Y , W
and B, respectively, can be expressed as
Y (z, t,q)=B(q)+
Cin
∑
k=1
H f
∑
j=1
W f
∑
i=1
X(zS+ j, tS+i,k)×W ( j, i,k,q),
Hout = (Hin−H f +S)/S,
Wout = (Win−Wf +S)/S, (2)
where 1 ≤ z ≤ Hout , 1 ≤ t ≤Wout and 1 ≤ q ≤ Cout . The
stride S represents the number of activation map pixels of
which the filter is shifted after each convolution. Contrary to
the fully-connected layers, convolutional computations are
dominated by numerous MACs according to Eq. (2), leading
to a high degree of computational complexity.
2.1 Fully-Connected Inspired Dataflow for Con-
volutional Computations
In [20], FID was introduced. It can be used to efficiently
perform the computations of convolutional layers with filter
2
Table 1: The FID for Convolutional Computations.
1st row of output activation map 2nd row of output activation map
CC Inputs
Outputs
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
#1 X1× W1 0 0 0 0 0 0 0 0 0 0 0
#2 X2× W2 W1 0 0 0 0 0 0 0 0 0 0
#3 X3× W3 W2 W1 0 0 0 0 0 0 0 0 0
#4 X4× 0 W3 W2 W1 0 0 0 0 0 0 0 0
#5 X5× 0 0 W3 W2 W1 0 0 0 0 0 0 0
#6 X6× 0 0 0 W3 W2 W1 0 0 0 0 0 0
#7 X7× 0 0 0 0 W3 W2 0 0 0 0 0 0
#8 X8× 0 0 0 0 0 W3 0 0 0 0 0 0
#9 X9× 0 0 0 0 0 0 W1 0 0 0 0 0
#10 X10× 0 0 0 0 0 0 W2 W1 0 0 0 0
#11 X11× 0 0 0 0 0 0 W3 W2 W1 0 0 0
#12 X12× 0 0 0 0 0 0 0 W3 W2 W1 0 0
#13 X13× 0 0 0 0 0 0 0 0 W3 W2 W1 0
#14 X14× 0 0 0 0 0 0 0 0 0 W3 W2 W1
#15 X15× 0 0 0 0 0 0 0 0 0 0 W3 W2
#16 X16× 0 0 0 0 0 0 0 0 0 0 0 W3
∑ Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12
parameter Wf fixed to 3. Let us note that 2D convolution
is the weighted summation of each pixel of an input image
with its neighboring pixels, and consider an input image as
a matrix X8×2, a filter as a matrix W1×3 and an output as a
matrix Y6×2, such that
X =
[
X1 X2 . . . X8
X9 X10 . . . X16
]
,W = [W1 W2 W3] ,
Y =
[
Y1 Y2 . . . Y6
Y7 Y8 . . . Y12
]
.
Considering each output pixel assigned to a neuron, Table 1
shows the convolutional process of this example in a way sim-
ilar to the fully-connected layer computations, where input
pixels are read sequentially at each clock cycle (CC) and the
neurons share the same input pixels. This example considers
Cin = 1, Cout = 1, Hin = 2, Win = 8, H f = 1, Wf = 3 and
S= 1. Similar to the fully-connected dataflow, each neuron
loads a different weight at each time step, subsequently accu-
mulating the weighted input pixels. The number of time steps
required to perform the convolutional computations is also
equal to the number of input pixels, Hin×Win. When passed
to the next neuron belonging to the same row of the output
activation map, the weights need to be shifted of one position.
However, weight passing between neurons of different rows
requires a shift of Wf positions, as can be observed between
output #6 and #7 in Table 1.
A direct implementation of the convolutional process in
Table 1 requires a large number of PEs, or neurons, each
of them with a low utilization factor (UF). In [20] it was
shown that 3 PEs, which are denoted by different colors in
Table 1, are sufficient to perform the convolutions. In fact,
there are only 3 active neurons at each time step. Each PE
thus receives its input at clock cycle 3× i+1, 3× i+2 and
3× i+ 3. Their outputs are also valid after 3 clock cycles
in the given example. So far, we only considered a case
with H f = 1. In case of H f = 3, the procedure in Table 1
has to be repeated 2 times more: the first iteration with W1,
W2 and W3, the second with W4, W5 and W6, and the final
one with W7, W8 and W9. Similarly, for higher values of Cin,
the process has to be to repeated Cin times. Therefore, a
memory is required to store the partial values generated by
the 3 neurons for each output pixel. In general, N output
pixels can be computed using 3 neurons (i.e. PEs) and 3
separate N/3-element SRAM memories working in parallel.
The unit generating the N output pixels of an output activation
map is referred to as a 1D tile. Parallel 1D tiles can be also
exploited to generate p out of Cout output activation maps in
parallel. Using p parallel 1D tiles reduces both the latency
and memory access by a factor of p. The input pixels are
shared among all the p 1D tiles.
3. GENERALIZED FULLY-CONNECTED IN-
SPIRED DATAFLOW (GFID)
Let us define a generalized form of the FID as a matrix M:
M =

W1 0 0 · · · 0
W2
...
0
...
... W1
W2 0
...
WW f 0
0
... W1
...
W2
WW f 0
0
... W1... W2... WW f
...
. . .
...
0 · · · 0 · · · WW f

,
S
(3)
where each column of the matrix M can contain only Wf
non-zero elements at most. The shift amount within each
row of the output activation map is equal to S, denoted with
a dashed line in the matrix M. The number of columns of
the matrix M indicates the N output pixels that belong to the
same row of the output activation map, while the number of
rows of M denotes the required number of clock cycles.
In this Section, we use the GFID matrix M to represent
different filter sizes used in the state-of-the-art CNNs (i.e.,
AlexNet, VGGNet and ResNet). AlexNet uses filter sizes of
11×11 with S = 4, 5×5 with S = 1, and 3×3 with S = 1.
The filter sizes used in VGGNets are fixed to 3×3 with S= 1.
Finally, ResNets use filter sizes of 7× 7 with S = 2, 3× 3
with S= 1, and 1×1 with S= 1.
3.1 Filters with Wf = 3 and S= 1
In Section 2.1, we showed that 3 PEs are sufficient to
perform the convolutions for filter size of 3×3 with S = 1.
Therefore, a 1D tile containing only 3 neurons can perform
the convolutional computations. Considering a convolution
of a row of a filter map with its corresponding input pixels,
N+2 clock cycles are required to generate N output pixels
which belong to the same row of the output activation map.
3
For instance, in the given example in Table 1, 8 clock cycles
are required to generate the output pixels of the first row of
the output activation map (i.e., the first 6 output pixels). This
example can also be expressed using the GFID matrix M as
follows:
M8×6 =

W1 0 0 0 0 0
W2 W1 0 0 0 0
W3 W2 W1 0 0 0
0 W3 W2 W1 0 0
0 0 W3 W2 W1 0
0 0 0 W3 W2 W1
0 0 0 0 W3 W2
0 0 0 0 0 W3

. (4)
The matrix M also confirms that there are only 3 active neu-
rons at each time steps, highlighted in dark gray.
3.2 Filters with Wf = 5 and S= 1
The convolutional computations for filters with Wf = 5
and S= 1 are performed in a way similar to the convolutional
computations of the filters with Wf = 3 and S = 1, with the
difference that 5 neurons are active at each time step. Thus, a
1D tile with 5 PEs can perform the computations for this filter
size. Moreover, N+4 clock cycles are required to generate
N output pixels which belong to the same row of the output
activation map.
3.3 Filters with Wf = 1 and S= 1
The following matrix M shows the convolutional computa-
tions for filters with Wf = 1 and S= 1.
M5×5 =

W1 0 0 0 0
0 W1 0 0 0
0 0 W1 0 0
0 0 0 W1 0
0 0 0 0 W1
 . (5)
Contrary to other filter sizes, its GFID matrix M is square:
the number of clock cycles required to generate N output
pixels is equal to N. As denoted in the matrix M, there is
only one active neuron at each clock cycle. Consequently,
its 1D tile requires only one PE to perform the convolutional
computations.
3.4 Filters with Wf = 7 and S= 2
So far, we only considered a stride value S= 1. However,
both AlexNet and ResNet contain layers computing convolu-
tions with S≥ 1. Considering filters with Wf = 7 and S= 2,
the shift amounts within each row of the output activation
map is equal to 2 as shown in the following matrix M:
M15×5 =

W1 0 0 0 0
W2 0 0 0 0
W3 W1 0 0 0
W4 W2 0 0 0
W5 W3 W1 0 0
W6 W4 W2 0 0
W7 W5 W3 W1 0
0 W6 W4 W2 0
0 W7 W5 W3 W1
0 0 W6 W4 W2
0 0 W7 W5 W3
0 0 0 W6 W4
0 0 0 W7 W5
0 0 0 0 W6
0 0 0 0 W7

. (6)
While the higher stride value linearly decreases the number
of pixels in the output activation maps, it also reduces the
number neurons required to perform the convolutional com-
putations. For instance, the above matrix M shows that there
are only 4 active neurons at each time step, while the width of
the filter Wf = 7. According to the matrix M, 15 clock cycles
are required to generate 5 output pixels in the given example.
3.5 Filters sizes with Wf = 11 and S= 4
The matrix M for filters with Wf = 11 and S = 4 is as
follows:
M23×4 =

W1 0 0 0
W2 0 0 0
W3 0 0 0
W4 0 0 0
W5 W1 0 0
W6 W2 0 0
W7 W3 0 0
W8 W4 0 0
W9 W5 W1 0
W10 W6 W2 0
W11 W7 W3 0
0 W8 W4 0
0 W9 W5 W1
0 W10 W6 W2
0 W11 W7 W3
0 0 W8 W4
0 0 W9 W5
0 0 W10 W6
0 0 W11 W7
0 0 0 W8
0 0 0 W9
0 0 0 W10
0 0 0 W11

. (7)
Despite of the large width of the filter, the number of active
neurons at each time step is only 3, thanks to the large stride
value. However, the number of clock cycles required to
generate 4 output pixels is 23, which is rather high and can
result in a long latency.
3.6 Utilization Factor for Different Filter Sizes
As discussed in Section 2.1, the number of clock cycles
required to perform the convolutions using FID is equal to
the number of input pixels, and it is the same for GFID.
ConsideringCin = 1 and H f = 1, in order to generate N pixels
of an output activation map, S×N+Wf −S clock cycles are
required to perform the convolutions according to Eq. (2).
Let us define the number of required PEs in the 1D tile as T .
The number of pixels computed by each neuron is equal to
N/T when N is a multiple of T . Each neuron also requires
Wf clock cycles to generate an output pixel. Therefore, the
utilization factor of GFID can be expressed as
UF =
N
T
×Wf
S×N+Wf −S ×100. (8)
In Section 1, we discussed the importance of high perfor-
mance efficiency. The utilization factor of PEs in a convo-
lutional accelerator is also linearly proportional to its per-
formance efficiency. Any increasing in the utilization factor
of PEs exploited in the 1D tile results in an increase in per-
formance efficiency. Considering the fact that Wf and S are
usually small, a high UF is achieved for a large value of N.
4
Table 2: Breakdown of Number of PEs Required Per Tile
Network H f ×Wf S T # layers
AlexNet [3]
11×11 4 3 1 out of 5
5×5 1 5 1 out of 5
3×3 1 3 3 out of 5
ResNet-50 [5]
7×7 2 4 1 out of 49
3×3 1 3 16 out of 49
1×1 1 1 32 out of 49
VGG-16 [4] 3×3 1 3 13 out of 13
In other word, the maximum achievable utilization factor can
be obtained as
UFmax = lim
N→∞
UF =
Wf
T ×S ×100. (9)
Eq. (9) suggests that the highest performance efficiency
is obtained when N (Wf −S). The maximum utilization
factors for filters with [Wf ,S] equal to [1, 1], [3, 1], [5, 1],
[7, 2] and [11, 4] are 100%, 100%, 100%, 88% and 92%,
respectively, showing the high performance efficiency of the
proposed GFID.
4. MULTI-MODE INFERENCE ENGINE
In Section 3, we showed that different filter sizes require
different number of PEs per tile. Table 2 summarizes the
number of required PEs per tile for each layer of AlexNet,
VGGNet-16 and ResNet-50. AlexNet consists of 5 layer
of convolutions with filter sizes of 11×11, 5×5 and 3×3.
Performing the GFID on the AlexNet layers show that 4 out
of 5 layers (i.e., the layers with filter sizes of 11× 11 and
3×3) only require 3 PEs to perform the computations, while
the remaining layer requires 5 PEs per tile. Therefore, T = 3
is the most frequent number in AlexNet for convolutional
processes. The filter size and stride are fixed to 3× 3 and
one pixel for the convolutional computations of VGGNets
[4], respectively: as a result, the whole computations of
VGGNets can be performed using 3 PEs per tile. There
are different VGGNet models in literature: in this paper,
we use VGGNet-16, which contains 13 convolutions and 3
fully-connected layers, for experimental purposes. Similar
to VGGNets, ResNets also come in different flavors. The
first layer of ResNets is fixed to the receptive field of 7×7
and stride of S = 2. The filter sizes of the remaining layers
are either fixed to 3×3 (for ResNet-18 and ResNet-34) or a
combination of 1×1 and 3×3 (for ResNet-50, ResNet-101
and ResNet-152) [5]. Therefore, the dominant filter sizes
are 1× 1 and 3× 3, which require one and 3 PEs per tile
to perform the convolutional computations, respectively. In
Table 2, we report the requirements for ResNet-50.
Fig. 2 shows the high level architecture of the 1D tile. It
consists of two main sub-blocks: the weight generator and K
PEs working in parallel. All the PEs share the same input ac-
tivation pixel while their weights are different. Each PE takes
an input activation map and its corresponding weight accord-
ing to the proposed GFID and performs the accumulation-
multiplication for the first row of the first input filter, i.e.,
W1, W2, . . . , WW f . This process takes Wf clock cycles and
Adder
ReLU
Processing 
Element (PE)
PE #1 PE #2 PE #T
Weight Generator
Tile
Input 
Activation 
Pixels
Weight #1 In1 In2
L 24-bit 
SRAM
Weight #2 Weight #T
Figure 2: The high-Level Architecture of a 1D Reconfig-
urable Tile.
the computed partial value is stored in a memory of L ele-
ments. Afterwards, the PE starts the processing of another
output activation pixel, using the same weights. The con-
volutional computations of the first row of the first input
filter require S×N+Wf − S clock cycles, as discussed in
Section 3.6. Upon reaching this point, the partial value of
the first output activation pixel is read from the memory and
the computations of the second row of the first input filter
are performed for S×N+Wf − S clock cycles. In general,
this procedure is repeated for H f times until the computa-
tions of the first filter are finished (i.e., upon completion
of H f × (S×N+Wf − S) clock cycles). At this point, the
computation of the second of the Cin filters starts. Upon
completion of Cin×H f × (S×N+Wf −S) clock cycles, the
output value of each PE is passed through the ReLU and the
result is stored in the off-chip memory.
So far, we introduced a high-level architecture for the 1D
tile and explained the high level procedure of convolutional
computations. In order to perform the computations while
achieving a high performance efficiency, the number of PEs
per tile has to be reconfigurable. In order words, K instanti-
ated PEs have to dynamically adapt to act as a multiple of T
PEs to achieve the maximum possible utilization factor. The
closed form solution for this strategy is
K = LCM(Ti), i ∈ {1,3,4,5}, (10)
where LCM denotes the least common multiple. Using this
approach, 60 PEs are required to achieve the maximum pos-
sible utilization factor for all the network sizes listed in Table
2. Depending on the required T , the 60 PEs can dynami-
cally behave as a set of T PEs. For instance, they can act
as 60, 20, 15 and 12 parallel tiles for T equal to 1, 3, 4 and
5, respectively, where each tile also contains 1, 3, 4 and 5
PEs. However, using 60 reconfigurable PEs is not trivial and
results in a complex address generator.
Table 2 shows that T = 1 and T = 3 are the dominant min-
imum numbers of PEs for the three well-known CNNs. More
precisely, the two filters with Wf = 5 and Wf = 7 have the
least impact on the overall performance efficiency of CNNs,
since they are used in only one layer of CNNs. Therefore,
we use K = 6 PEs inside the reconfigurable tile: the reason
is twofold. First of all, 6 PEs can be easily used as 2 and 6
tiles containing 3 and 1 PEs for T = 3 and T = 1, which are
the dominant minimum numbers of PEs for the three well-
known CNNs. Secondly, they can perform the computations
for T = 4 and T = 5 with a minimum level of complexity
for the address generator unit. In this case, with K larger
5
than what strictly necessary, the number of clock cycles re-
quired to perform the convolutional computations remains
the same. However, the utilization factors of PEs for these
cases decreases.
4.1 Reconfigurable Weight Generator Unit
The weight generator unit provides each neuron an appro-
priate weight according to the proposed GFID. The weight
generator unit consists of 6 register sets where each set con-
tains 11 registers. The appropriate weight for each neuron is
provided by selecting among these shift registers.
4.1.1 Filters WithWf = 3 and S= 1
As discussed in Section 4, in case of Wf = 3 and S = 1,
the 1D reconfigurable tile containing 6 neurons can function
as two tiles of 3 neurons each. Fig. 3(a) shows the weight
generator unit and its working path highlighted in black when
using Wf = 3 and S = 1. It is worth noting that tiles are
separated using a dashed line. Each tile loads the weights of
the first row of the first filter (i.e., W1, W2 and W3) through
the input ports denoted as In #1 and In #2 in Fig. 3(a). These
weights then loop through the first register of each set to
provide one clock cycle delay for each neuron according to
(3.1). Considering Eq. (8), the utilization factor of each
neuron for this case can be computed as
UF =
N
N+2
, (11)
which approaches 100% for large values of N.
4.1.2 Filters WithWf = 5 and S= 1
In case of Wf = 5 and S = 1, we use 6 neurons to per-
form the convolutional processes while we showed that the
minimum required number of neurons is 5 for this case (see
Section 4). Therefore, the reconfigurable tile works as a sin-
gle tile containing 6 PEs as shown in Fig. 3(b). The tile
takes the first row of the first filter (i.e., W1, W2, . . . , and W5)
through the input port denoted as In #1. It then provides the
required one clock cycle delay for each PE by passing the
weights through the first register of each register set as high-
lighted in black in Fig. 3(b). It is worth noting that 6 registers
are used in this paradigm while only 5 of them required to
store the weights. Therefore, the value of one register among
the 6 registers is always zero to cancel out its effect on the
computations. More precisely, we can assume the 5 weights
(i.e., W1, W2, . . . , and W5) as a set of 6 weights in which one
of them is zero (i.e., W1, W2, . . . , W5 and 0). The utilization
factor of each PE is also can be expressed as
UF =
5N
6N+24
. (12)
In fact, using 6 neurons to perform the convolutions ofWf = 5
reduces the maximum achievable utilization factor from 100%
to 83%.
4.1.3 Filters WithWf = 1 and S= 1
In Section 3.3, we showed that only one PE is sufficient to
perform the computations for Wf = 1 and S= 1. Therefore,
the reconfigurable 1D tile can be used as 6 parallel tiles,
as depicted in Fig. 3(c). The 6 tiles are separated using
dashed lines and the involved hardware units and paths are
highlighted in black. Each tile takes its weight (i.e., W1) at
the first clock cycle through the input ports In #1 to In #6.
Afterwards, the imported weight loops through each tile and
the first register of each register set. The utilization factor of
each PE is equal to 100% regardless of N, according to (8).
4.1.4 Filters WithWf = 7 and S= 2
Similar to the case of Wf = 5 and S= 1, 6 PEs are used to
compute the convolutions for Wf = 7 and S= 2, while only
4 neurons are sufficient. As a result, the reconfigurable tile
functions as a single tile containing 6 PEs (see Fig. 3(d)).
The tile loads the weights of the first row of the first filter
(i.e., W1, W2, . . . , and W7) through the input port In #1 and
they loop through the black paths in Fig. 3(d). In this scheme,
the first two registers of each register set are used to provide
the required two delays for each PE, as shown in (3.4). It is
worth mentioning that while 12 registers are used in this case,
only 7 of them contain the weights. The utilization factor for
this configuration is computed as follows:
UF =
7N
12N+30
. (13)
Since 4 PEs are sufficient to perform the computations of this
case, using 6 neurons highly affects the utilization factor and
results in 53% for large values of N. However, the final im-
pact of this case when considering the computations of whole
system is negligible due to the fact that this configuration is
only used for one layer out of 49 in ResNet-50.
4.1.5 Filters WithWf = 11 and S= 4
Similar to the case with Wf = 3 and S= 1, 3 PEs are suf-
ficient to perform the convolutional processes when using
Wf = 11 and S = 4. Therefore, the reconfigurable tile func-
tions as two tiles where each contains 3 PEs. The weights
of the first row of the first filter (i.e., W1, W2, . . . , and W11)
are passed through input ports In #1 and In #4 to each tile,
as shown in Fig. 3(e). Since a stride value of 4 is used, the
first four registers of each register set are used to provide the
required four clock cycle delays (3.5). A total of 12 registers
are used in each tile while only 11 weights exist. Therefore,
the remaining register is zero. The utilization factor of this
case is also computed as
UF =
11N
12N+21
, (14)
achieving up to 92% for large values of N.
4.1.6 Fully-Connected Computations
As discussed in Section 2, semi-parallel architectures are
a common approach to implement fully-connected layers,
where the computations of each neuron are performed se-
rially. For instance, considering a single neuron with 512
inputs (i.e., n = 512 and m = 1), 512 clock cycles are re-
quired to perform the computations of (1) using a single PE.
We can perform the computations of multiple neurons by
instantiating multiple PEs in parallel as discussed in [20]. In
this way, each PE shares the same input pixels while loading
different weights. This approach can be easily realized using
the proposed reconfigurable tile as illustrated in Fig. 3(f). In
6
14
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(a)
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(b)
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(c)
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(d)
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(e)
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
1
4
3
2
5
11
In #1
In #2 In #3 In #4 In #5 In #6
Out #1 Out #2 Out #3 Out #4 Out #5 Out #6
(f)
Figure 3: Involved hardware resources and paths in case of convolution computations for (a)Wf = 3 and S= 1, (b)Wf = 5
and S= 1, (c) Wf = 1 and S= 1, (d) Wf = 7 and S= 2, (e) Wf = 11 and S= 4, and (f) fully-connected computations.
fact, the reconfigurable tile passes the incoming 6 parallel
weights directly to each PE through multiplexers highlighted
in black. The utilization factor of PEs for fully-connected
computations is 100%.
4.2 Handling Weight Passing
So far, we discussed both the convolutional and fully-
connected computations while not considering the weight
passing cases for the sake of simplicity. However, weight
passing occurrence is inevitable in convolutional computa-
tions and impacts both the processing time and utilization
factor of PEs. Weight passing occurs when a tile performs the
computations of more than one row of the output activation
map. In this case, the weight passing from a neuron of a row
to a neuron of another row takes Wf clock cycles regardless
of the stride value S, resulting in a longer latency and conse-
quently a lower utilization factor for PEs. The total number
of weight passing occurrences for computations of a single
convolutional layer is equal to Hout − 1. We considered 11
registers for each register set to support the weight passing
delay up to 11 clock cycles. Therefore, in case of weight
passing in any of PEs, its corresponding register set provides
the required delay depending on Wf .
4.3 Exploiting Parallel Tiles
While the proposed reconfigurable tile can efficiently per-
forms both fully-connected and convolutional computations,
using a single tile results in a long latency and numerous
memory accesses, as discussed in [20]. To address this is-
sue, p tiles are instantiated in parallel to generate multiple
activation maps in parallel. Since the reconfigurable tile it-
self can function as up to 6 parallel tiles, the upper bound
for the maximum number of tiles is 6p in MMIE. Therefore,
the computational latency of MMIE is effectively reduced
by a p factor when compared to a single reconfigurable tile.
Moreover, the memory accesses are reduced as well, since
the input pixels are shared among the parallel tiles (see Fig.
1), while each tile is fed by a different set of weights.
Exploiting parallel tiles requires an input bandwidth of
(1+6×P)×16 bits (6× p×16 for weights and 16 for input
pixels). However, most of the embedded and mobile devices
cannot provide such a high bandwidth. To overcome this
problem, MMIE leverages the pipelining technique first intro-
duced in [20]. As discussed in Section 3, each input pixel is
read at each clock cycle while Wf weights are read only for
the first Wf clock cycles when performing the convolutional
process of the first row of the first input filter. The parameter
Wf is also a small value compared to the processing time
of convolutions for the first row of the first input filter (i.e.,
Wf  (S×N+Wf − S)). More precisely, the input band-
width from (Wf )th clock cycle to (S×N+Wf − S)th clock
cycle is only occupied with input pixels. Therefore, we can
fill out this available bandwidth by pipelining the tiles with
up to b(S×N +Wf − S)/Wf c stages, while the additional
latency overhead is negligible compared to the overall latency
7
Table 3: The effective value of N and p in MMIE depend-
ing on Wf and S
H f ×Wf S Ne f f pe f f
11×11 4 192 64
7×7 2 384 32
5×5 1 384 32
3×3 1 192 64
1×1 1 64 192
of the system.
4.4 Processing Time and Memory Accesses of
MMIE
4.4.1 Convolutional Processes
Earlier in Section 4 we showed that in convolutional pro-
cesses, a single tile computes N out of Hout ×Wout pixels of
one of Cout output activation maps within Cin×H f × (S×
N+Wf −S) clock cycles. We also showed that the total num-
ber of weight passing occurrences for the computation of a
single convolutional layer is equal to Hout −1, which causes
additional (Wf −1)× (Hout −1) clock cycles for the compu-
tations of each row of the input filters. Considering p parallel
tiles, the number of required clock cycles is expressed as
CC =
Wout ×Hout
N
× (S×N+Wf −S)×H f ×Cin×
⌈
Cout
p
⌉
+(Wf −1)× (Hout −1)×H f ×Cin×
⌈
Cout
p
⌉
. (15)
Eq. (15) suggests that the number of required clock cycles
for convolutional computations is independent of N for large
values of N (i.e., S×N  (Wf − S)) when not considering
the weight passing overheads. In Section 4.3 we showed that
input pixels are shared among all tiles and each pixel is read
at each clock cycle. This means that the number of memory
accesses by input activation maps (MAimaps) is equal to the
number of clock cycles required to complete the convolution.
On the other hand, the weights are read in the first Wf clock
cycles out of a total (S×N+Wf −S). As a result, the number
of required memory accesses by filters to compute N out of
Hout ×Wout pixels of one out of Cout output activation map is
equal to Cin×H f ×Wf . In general, the number of memory
accesses by filters (MA f ilters) can be computed as follows:
MA f ilters = H f ×Wf ×Cin×
⌈
Wout ×Hout
N
⌉
×Cout . (16)
Finally, the total number of memory accesses (MA) is a sum-
mation of memory accesses by filters, input activation maps
and output activation maps, where the number of memory
accesses by output activation maps (MAomaps) is equal to
Wout×Hout . It is worth noting that while the number of clock
cycles and MAimaps are independent of N, MA f ilters depends
on it. On the other hand, MA f ilters is independent of p while
the number of clock cycles and MAimaps are not. It is worth
mentioning that while higher values of p and N optimize
MMIE towards lower memory accesses and processing laten-
cies, they also increases its power consumption and silicon
area.
4.4.2 Fully-Connected Computations
In Section 4.1.6, we showed that MMIE can perform the
fully-connected computations in a similar way to convolu-
tional computations, with each PE loading a different set of
weights. The processing time of each PE is thus equal to the
number of inputs n. The number of clock cycles required to
generate m output pixels can be expressed as
No. CC =
⌈
m
p
⌉
×n. (17)
Unlike weights, input pixels are shared among PEs. There-
fore, the number of memory accesses by input pixels (MAip)
is equal to the number of clock cycles required for fully-
connected computations. Since each output pixel relies on a
distinct set of n weights, the number of memory accesses by
weights (MAweights) is computed as follows:
MAweights = m×n. (18)
The number of memory accesses by output pixels (MAop) is
equal to m. The total number of memory accesses (MA) is
also a summation of memory accesses by weights, input and
output pixels.
5. IMPLEMENTATION RESULTS
In this paper, we optimize MMIE for a low-latency, low-
memory access implementation while keeping its power con-
sumption below the power budget of mobile devices, limited
to a few hundred mW [23]. Fig. 4(a) shows the architecture
of MMIE which is consisted of three main sub-blocks: tiles,
pipelining stages and a distributor unit. MMIE contains 32
reconfigurable tiles, each of which with 6 PEs. Each PE is
also associated with a L = 64 24-bit memory. The pipelin-
ing stages provide the required amount of shifts depending
on the value of Wf using shift registers and multiplexers, as
discussed in Section 4.3. The distributor unit also provides
the required bandwidth for fully-connected weights using
shift registers working at lower frequency than the off-chip
memory.
The p and N parameters do not only affect latency and
number of memory accesses, but also impact power and area
costs. Therefore, it is possible to obtain different trade-offs
between processing time, throughput and implementation
costs depending on p and N. Since the reconfigurable tile
functions differently based on Wf and S, the effective values
of N and p vary for each case. Table 3 shows the effective
values of N and p for AlexNet, VGGNet and ResNet filter
sizes. The effective values of N and p, denoted as Ne f f and
pe f f respectively, have to be used in all the equations reported
in this paper that rely on these two values.
MMIE was implemented in TSMC 65nm GP CMOS tech-
nology and its layout are shown in Fig. 4(b). MMIE works
at a nominal frequency of 200 MHz and 40 MHz for convo-
lutional and fully-connected processes respectively. MMIE
performs the fully-connected computations at a lower fre-
quency since they require a high input bandwidth, as each
neuron loads its own set of weights. We also used the run-
length compression technique introduced in [14] to reduced
8
48
4848484848 48 1616
16
16
16
16
Tile    #1 Tile    #2 Tile    #p
Controller
Distributor
Weights
Input 
Pixels
Input 
Pixels
Input 
Pixels
Weights Weights
O
ff
-C
h
ip
 M
em
o
ry
Pipelining
Stage
16
MMIE
64
(a) (b)
Figure 4: (a) The architecture of MMIE and (b) its layout.
the required bandwidth. MMIE uses the distributor unit to
decode the compressed values. Considering MMIE working
at 10× lower frequency compared to the off-chip memory for
fully-connected computations, the required bandwidth of 193
16-bit values are obtained using this technique.
5.1 Hardware Implementation Results on State-
of-the-Art Networks
Fig. 5(a) shows the breakdown of performance efficiency
for each layer of AlexNet, VGGNet-16 and ResNet-50 when
using MMIE. In our simulations, the input pixels and weight
values are quantized to 16 bits while using 2 and 15 fractional
bits, respectively. It is worth noting that this quantization
scheme only results in less than 0.5% accuracy degradation
on the aforementioned CNNs using [24, 25]. The implemen-
tation results show that the lowest performance efficiency of
AlexNet and VGGNet-16 was obtained at the first layer of
these networks. The number of output activation mapsCout of
the first layer of AlexNet is 96 while MMIE provide 64 paral-
lel tiles when Wf = 11 and S= 4. As a result, for the first 64
output activation maps, MMIE achieves a high performance
efficiency while the remaining 32 output activation maps are
computed using 32 parallel tiles out of 64, which explain
the low performance efficiency of this layer. On the other
hand, MMIE successfully performs the computations of the
first layer of VGGNet-16 with a high performance efficiency.
However, since the required time for writing the computed
output activation pixels is longer than the computation time,
the low performance efficiency is inevitable. In ResNet-50,
layers with a receptive field of 1×1 show lower performance
compared to other filter sizes, while it was shown in Section
4.1.3 that such receptive field yields a 100% performance effi-
ciency. Such performance efficiency degradation is expected,
as Cout of the layers with receptive field of 1×1 are not mul-
tiple of 192 available parallel tiles. For instance, the number
of output activation maps of the second layer of ResNet-50 is
64, while 192 parallel tiles are available. Therefore, 128 tiles
are not being used for this layer.
Fig. 5(b) shows the breakdown of power consumption
for each layer of AlexNet, VGGNet-16 and ResNet-50. The
power consumption of MMIE follows a descending trend
as the number of zeros in output/input activations maps and
filters increases for each layer of AlexNet, VGGNet-16 and
ResNet-50. Moreover, it also increases as the performance
efficiency of layers rises. The power numbers reported in this
paper are obtained by measuring switching activities of all
models.
Fig. 5(c) shows the breakdown of the memory accesses
for each layer of AlexNet, VGGNet-16 and ResNet-50. The
memory accesses for each layer of the aforementioned net-
works are limited to a few MB. More precisely, AlexNet
and ResNet-50 layers require a lower number of memory
accesses compared to VGGNet-16. While the memory ac-
cesses for each layer of AlexNet and ResNet-50 are roughly
in the same order, the total memory accesses of ResNet-50
are significantly more due to its numerous layers. The pro-
cessing latency of each layer also follows a similar trend to
the memory accesses as shown in Fig. 5(d). In fact, the
latency of each layer in AlexNet and ResNet-50 is limited to
a few milliseconds while each layer of VGGNet-16 requires
roughly 10× more clock cycles.
5.2 Comparison With State-of-the-Art Imple-
mentations
The implementation results of MMIE on AlexNet, VGGNet-
16 and ResNet-50 are shown in Table 4. As discussed in
Section 1, MCR of these networks varies depending on their
sizes. Therefore, different implementation results are ex-
pected when running MMIE on these models. MMIE per-
forms the convolutional and fully-connected computations
of AlexNet within 20.8 ms and 7.6 ms while requiring 15.6
MB and 117.8 MB memory accesses to the off-chip mem-
ory, respectively. The convolutional and fully-connected
processes of VGGNet-16 are performed within 421.8 ms
and 16.4 ms and require 375.5 MB and 247.3 MB mem-
ory accesses, respectively. Finally, performing convolutional
and fully-connected computations of ResNet-50 on MMIE
requires 106.6 ms and 0.3 ms while memory accesses are
154.6 MB and 4.1 MB, respectively. Therefore, AlexNet com-
putations require the lowest latency while its total memory
accesses are roughly similar to those of ResNet-50. VGGNet-
16 is the most complex network in terms of both processing
latency and memory accesses. MMIE also yields 83%, 94%
and 94% performance efficiency for convolutional computa-
tions of AlexNet, VGGNet-16 and ResNet-50, respectively.
It is worth mentioning that the performance efficiency of
9
10 20 30 40 50
20
40
60
80
100
Layer Number
Pe
fo
rm
an
ce
E
ffi
ci
en
cy
(%
)
AlexNet
VGGNet-16
ResNet-50
(a)
10 20 30 40 50
100
200
300
Layer Number
Po
w
er
C
on
su
m
pt
io
n
(m
W
)
AlexNet
VGGNet-16
ResNet-50
(b)
10 20 30 40 50
100
101
102
Layer Number
M
em
or
y
A
cc
es
s
(M
B
)
AlexNet
VGGNet-16
ResNet-50
(c)
10 20 30 40 50
100
101
Layer Number
Pr
oc
es
si
ng
L
at
en
cy
(m
s)
AlexNet
VGGNet-16
ResNet-50
(d)
Figure 5: (a) The performance efficiency, (b) power consumption, (c) memory access and (d) computation latency
breakdowns of AlexNet, VGGNet-16 and ResNet-50 at 200 MHz for convolutional processes and 40 MHz for fully-
connected computations in TSMC 65 nm CMOS technology.
fully-connected computations is roughly 100% for all the
aforementioned networks.
During the past few years, numerous works have been con-
ducted towards ASIC implementations of DNNs. However,
most of them were only tested on either small datasets or out-
dated CNNs which require order of magnitudes lower parame-
ters and computations [23,26–29]. Recently, Google released
a custom DNN accelerator tensor processing unit (TPU) [30].
TPU is a programmable and reconfigurable accelerator that
can perform both fully-connected and convolutional compu-
tations. However, its power consumption exceeds the power
budgets of embedded devices [20]. In [14, 31], a convolu-
tional accelerator, called Eyeriss, was introduced. Eyeriss
was fabricated in 65 nm CMOS technology and tested on
AlexNet and VGGNet-16. Eyeriss uses high batch sizes
to obtain a lower number of memory accesses, but using
this method results in a higher computational latency. Ey-
eriss performs convolutional computations of AlexNet and
VGGNet-16 in 115.3 ms and 4.3 s while requiring 15.4 MB
and 321.1 MB memory accesses and using batch size of 4
and 3, respectively. Its performance efficiency is also limited
to only 55% and 26% on AlexNet and VGGNet-16, resulting
in large silicon area of 12.52 mm2 (1852kgates). Eyeriss also
uses clock gating to reduce its power consumption.
Recently, a few works have focused on minimizing en-
ergy by modulating precision, frequency and supply voltage
of their accelerator for each convolutional layer [15, 32, 33].
In [15], a precision-scalable convolutional accelerator, fabri-
cated in 28 nm UTBB FD-SOI technology, was introduced.
This architecture dynamically adapts itself depending on the
required precision for each layer, instead of using a fixed pre-
cision. More precisely, it exploits a reconfigurable multiplier
which is able to perform a 16-bit, two 8-bit and four 4-bit
multiplications, depending on the required precision. As a re-
sult, using a dynamic fixed-point technique allows to change
frequency and supply voltage over time which results in a
10
Table 4: Comparison of the Baseline Architecture with State-of-the-art Implementations.
Reference ISSCC’17 [19] ISSCC’17 [15] JSSC’17 [14] TCAS’17 [20] This work
Technology NA/65 nm UTBB FD-SOI/28 nm TSMC/65 nm TSMC/65 nm TSMC/65 nm
Gate Count*(NAND-2) NA 1950k 1852k 1117k 1036k
Core Area*(mm2) 16 1.87 12.52 3.5 6 (2.45×2.45)
# PE 768(16b)-3072(4b)c, 64f 256(16b)-1024(4b) 168 192 192c,f
On-chip SRAM (kB) 290 144 181.5 86 36.9
Nominal Frequency (MHz) 50-200c,f 200c 250c 200c 200c, 40f
Peak Performance (Gops) 300(16b)-1200(4b)c, 25f 102(16b)-408(4b)c 84c 76c 76.8c, 15.4f
Bitwidth (bits) 4-16 programmablec, 4-7f 1-16 programmablec 16 fixedc 16 fixedc 16 fixedc,f
CNN type for ImageNet AlexNet AlexNet VGG-16 AlexNet VGG-16 VGG-16 AlexNet VGG-16 ResNet-50
Top-1 Error (%) 42.9 42.9 27 42.9 27 27 42.9 27 20
Voltage (V) 0.77-1.1 NA NA 1 1 1 1 1 1
Power*(mW) 63c, 3.5f 44c 26c 278c 236c 260c 265c, 37f 301c, 40f 248c, 35.5f
Total Latency (ms) 5.7c, 0.8f 21.3c 598.8c 115.3c 4309.5c 453.3c 20.8c, 7.6f 421.8c, 16.4f 103.6c, 0.3f
Throughput (fps) 177c, 1.2kf 47c 1.67c 34.7c 0.7c 2.21c 48.1c, 131.6f 2.2c, 61f 9.6c, 3.3kf
Performance (Gops) 235.4c, 140.6f 62.6c 51.3c 46.1c 21.4c 67.7c 63.9c, 15.4f 72.5c, 15.1f 74.5c, 15f
Performance Efficiency 50%c, 562%f 38%c 32%c 55%c 26%c 89%c 83%c, 100%f 94%c, 98%f 88%c, 97%f
Energy-Efficiency*(Gops/W) 4200c, 40.2kf 1423c 1973c 166c 90.7c 260.4c 241.1c, 416.2f 240.9c, 377.5f 300.4c, 422.5f
Memory Access / Batch (MB) NA NA NA 15.4c 321.1c 331.7c 15.6c, 117.8f 375.5c, 247.3f 154.6c, 4.1f
* Including on-chip SRAM. f Fully-connected. c Convolutional.
lower power/energy consumption. This accelerator performs
the convolutional computations of AlexNet to 21.3 ms, and
those of ResNet to 598.8 ms, while its performance efficiency
is respectively limited to 38% and 32% on average. Similar
to Eyeriss, the low performance efficiency of this architecture
results in a large gate count of 1950kgates.
In [19], a DNN accelerator, fabricated in 65 nm CMOS
1P8M, was introduced. This accelerator can perform both
fully-connected and convolutional computations while using
two separate cores and the dynamic fixed-point technique
to minimize power/energy consumption. This architecture
exploits a reconfigurable 16-bit multiplier for convolutional
processes which allows it to work with lower frequency and
supply voltage. This architecture performs convolutional
and fully-connected computations within 5.7 ms and 833
µs, respectively. The convolutional core of this architecture
contains 768 16-bit reconfigurable PEs, which can be used
as 3072 4-bit PEs. Despite its high convolutional through-
put, its performance efficiency is limited to 50% on average.
The fully-connected core contains only 64 PEs, and uses
a quantization table-based matrix multiplication to reduce
off-chip memory accesses and remove redundancy. This tech-
nique reduces the memory accesses by 75% and avoids 90%
of the 16-bit fixed-point multiplications in fully-connected
computations [19]. While the fully-connected core is highly
optimized, it requires separate PEs and hardware resources,
which leads to a large silicon area of 16 mm2.
In [20], a convolutional accelerator was proposed as a
first attempt to improve the performance efficiency for filters
fixed to 3×3. This architecture performs the convolutional
computations of VGGNet-16 within 453.3 ms and requires
331.7 MB memory accesses.
In this paper, we proposed MMIE which supports all the
filter sizes that require less than or equal to 6 parallel PEs in
each tile. MMIE can perform both the convolutional and fully-
connected computations while using the same PEs. Since
both Eyeriss and MMIE were implemented in TSMC 65nm
CMOS technology and use 16-bit fixed-point representations,
a direct comparison of these two implementations consti-
tutes a fair comparison. As shown in Table 4, MMIE out-
performs Eyeriss [14] in terms of gate count (1.8× smaller),
latency (5.5× and 10.2× lower), throughput (1.4× and 3.1×
faster), performance efficiency (1.5× and 3.6× better) and
energy scalability (1.5× and 2.7× more efficient) while hav-
ing roughly the same number of memory accesses per batch.
It is worth noting that a direct comparison of MMIE with
the works published in [15, 19] does not constitute a fair
comparison, since they dynamically modulate precision, fre-
quency and supply voltage and use advanced technology
nodes, which allows them to instantiate more PEs while still
having a low-power/energy consumption. However, the in-
troduced performance efficiency metric can be used for a
fair comparison as it reflects the performance of the acceler-
ators independent of their technology nodes, precisions and
optimization techniques. Therefore, MMIE has better the
performance efficiency than the works published in [19] (2×
better) and [15] (2.2× and 2.9× better) when performing
convolutions of AlexNet and VGGNet-16.
6. CONCLUSION
CNN accelerators in literature promise a high peak through-
put, but their performance is limited to less than 55 % when
running the state-of-the-art networks such as AlexNet, VG-
GNets and ResNets. We proposed a dataflow inspired to the
fully-connected computations to perform both convolutional
and fully-connected processes with a high utilization factor.
We then introduced a multi-mode inference engine (MMIE)
based on the proposed dataflow and theoretically formalized
its implementation performance. Finally, we implemented
MMIE in TSMC 65nm CMOS technology and tested it on
three state-of-the-art networks, AlexNet, VGGNet-16 and
ResNet-50. The implementation results show that MMIE
performs both the fully-connected and convolutional compu-
tations with performance efficiency no less than 84%, outper-
forming the state of the art also in terms of area occupation.
11
7. REFERENCES
[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” in Proceedings of the
IEEE, pp. 2278–2324, 1998.
[2] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
pp. 436–444, 5 2015.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25 (F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran
Associates, Inc., 2012.
[4] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” CoRR, vol. abs/1512.03385, 2015.
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3,
pp. 211–252, 2015.
[7] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: a Spatial Architecture for
Energy-efficient Dataflow for Convolutional Neural Networks,” in
Proceedings of the 43rd International Symposium on Computer
Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 367–379, IEEE
Press, 2016.
[8] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pp. 10–14, Feb 2014.
[9] S. Han, H. Mao, and W. J. Dally, “Deep Compression: compressing
Deep Neural Network with Pruning, Trained Quantization and
Huffman Coding,” CoRR, vol. abs/1510.00149, 2015.
[10] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
connections for efficient neural network,” in Advances in Neural
Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1135–1143, Curran
Associates, Inc., 2015.
[11] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
sparsity in deep neural networks,” CoRR, vol. abs/1608.03665, 2016.
[12] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: Efficient inference engine on compressed deep neural
network,” in 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA), pp. 243–254, June 2016.
[13] A. Ardakani, C. Condo, and W. J. Gross, “Sparsely-Connected Neural
Networks: Towards Efficient VLSI Implementation of Deep Neural
Networks,” Proc. 5th Int. Conf. Learn. Represent. (ICLR), Nov. 2016.
[14] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52,
pp. 127–138, Jan 2017.
[15] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5
Envision: A 0.26-to-10TOPS/W subword-parallel
dynamic-voltage-accuracy-frequency-scalable Convolutional Neural
Network processor in 28nm FDSOI,” in 2017 IEEE International
Solid-State Circuits Conference (ISSCC), pp. 246–247, Feb 2017.
[16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning
Supercomputer,” in 2014 47th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 609–622, Dec 2014.
[17] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
L. Benini, “Origami: a convolutional network accelerator,” CoRR,
vol. abs/1512.04295, 2015.
[18] S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-NN: An
energy-efficient 1D chain architecture for accelerating deep
convolutional neural networks,” in Design, Automation Test in Europe
Conference Exhibition (DATE), 2017, pp. 1032–1037, March 2017.
[19] D. Shin, J. Lee, J. Lee, and H. J. Yoo, “14.2 DNPU: An 8.1TOPS/W
reconfigurable CNN-RNN processor for general-purpose deep neural
networks,” in 2017 IEEE International Solid-State Circuits
Conference (ISSCC), pp. 240–241, Feb 2017.
[20] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An
Architecture to Accelerate Convolution in Deep Neural Networks,”
IEEE Transactions on Circuits and Systems I: Regular Papers, Early
Access, doi: 10.1109/TCSI.2017.2757036, 2017.
[21] F. Moreno, J. Alarcon, R. Salvador, and T. Riesgo, “Fpga
implementation of an image recognition system based on tiny neural
networks and on-line reconfiguration,” in Industrial Electronics, 2008.
IECON 2008. 34th Annual Conference of IEEE, pp. 2445–2452, Nov
2008.
[22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to
handwritten zip code recognition,” Neural Comput., vol. 1,
pp. 541–551, Dec. 1989.
[23] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An
Ultra-Low Power Convolutional Neural Network Accelerator Based
on Binary Weights,” in 2016 IEEE Computer Society Annual
Symposium on VLSI (ISVLSI), pp. 236–241, July 2016.
[24] A. Vedaldi and K. Lenc, “MatConvNet: Convolutional Neural
Networks for MATLAB,” in Proceedings of the 23rd ACM
International Conference on Multimedia, MM ’15, (New York, NY,
USA), pp. 689–692, ACM, 2015.
[25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for
Fast Feature Embedding,” arXiv preprint arXiv:1408.5093, 2014.
[26] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time
embedded scene labeling with convolutional networks,” in 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6,
June 2015.
[27] L. Cavigelli and L. Benini, “A 803 GOp/s/W convolutional network
accelerator,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. PP, no. 99, pp. 1–1, 2016.
[28] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini, “A
heterogeneous multi-core system-on-chip for energy efficient brain
inspired computing,” IEEE Transactions on Circuits and Systems II:
Express Briefs, vol. PP, no. 99, pp. 1–1, 2017.
[29] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: a
convolutional neural network accelerator with in-situ analog arithmetic
in crossbars,” in 2016 ACM/IEEE 43rd Annual International
Symposium on Computer Architecture (ISCA), pp. 14–26, June 2016.
[30] N. P. J. et al., “In-datacenter performance analysis of a tensor
processing unit,” CoRR, vol. abs/1704.04760, 2017.
[31] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks,” in IEEE International Solid-State Circuits
Conference, ISSCC 2016, Digest of Technical Papers, pp. 262–263,
2016.
[32] B. Moons and M. Verhelst, “An energy-efficient precision-scalable
convnet processor in a 40-nm CMOS,” IEEE Journal of Solid-State
Circuits, vol. PP, no. 99, pp. 1–12, 2016.
[33] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalable
processor for real-time large-scale ConvNets,” in 2016 IEEE
Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2, June 2016.
12
