EDCompress: Energy-Aware Model Compression for Dataflows by Wang, Zhehui et al.
ar
X
iv
:2
00
6.
04
58
8v
1 
 [c
s.L
G]
  8
 Ju
n 2
02
0
EDCompress: Energy-Aware Model Compression
with Dataflow
Zhehui Wang, Tao Luo, Joey Tianyi Zhou, Rick Siow Mong Goh
{wang_zhehui, luo_tao, joey_zhou, gohsm}@ihpc.a-star.edu.sg
Institute of High Performance Computing
Agency for Science, Technology and Research (A*STAR)
Abstract
Edge devices demand low energy consumption, cost and small form factor. To
efficiently deploy convolutional neural network (CNN) models on edge device,
energy-aware model compression becomes extremely important. However, ex-
isting work did not study this problem well because the lack of considering the
diversity of dataflow in hardware architectures. In this paper, we propose ED-
Compress, an Energy-aware model compression method, which can effectively
reduce the energy consumption and area overhead of hardware accelerators, with
different Dataflows. Considering the very nature of model compression proce-
dures, we recast the optimization process to a multi-step problem, and solve it
by reinforcement learning algorithms. Experiments show that EDCompress could
improve 20X, 17X, 37X energy efficiency in VGG-16, MobileNet, LeNet-5 net-
works, respectively, with negligible loss of accuracy. EDCompress could also find
the optimal dataflow type for specific neural networks in terms of energy con-
sumption and area overhead, which can guide the deployment of CNN models on
hardware systems.
1 Introduction
Convolutional neural network (CNN) shows good performance in various applications such as image
classification and object detection. However, traditional CNN is in large scale, which makes it
challenging to implement on edge devices. For example, the VGG-16 network contains 528 MB
weights [28]. To classify one image, we need to perform 1.5 × 1010 multiply–accumulate (MAC)
operations. There are two consequences. First, the limited memory space of edge devices cannot
store the parameters. Second, the edge device becomes power hungry because the calculation and
data movement operations consume a large amount of energy.
Model compression method such as quantization and pruning, is an emerging technique developed
in recent years to alleviate this problem. Most model compression method target to the reduction
of model size. For example, Han et al. proposed Deep Compressing method [15], which helps to
fit the neural networks into the on-chip memory of hardware accelerators. However, the model size
does not directly decide the two significant metrics of edge device, i.e., the energy consumption and
area overhead.
To prove this, we compare our work EDCompress (EDC) with Deep Compression (DC) in Figure 1.
we can see that although EDCompress shows lower compression rate, it has higher energy and area
efficiency than DC. This is because the energy consumption does not only depend on the model size,
but also depend on the dataflow design, which is the way we reuse the data. In hardware accelerators,
different processing elements may share the same input or output data. By reusing the data, there is
no need to load the data from memory by multiple times. Given the fact that a large portion of the
energy is spent on the data movement (e.g. around 72% in VGG-16). A good dataflow design can
effectively improve the energy efficiency. In this paper, we propose EDCompress, which has two
following features:
Preprint. Under review.
DC
EDC
0 2 4 6 8
DC
EDC
0 10 20 30 40
DC
EDC
0 10 20 30 40
Compression Rate (X) Energy Efficiency Improvement (X) Area Efficiency Improvement (X)
Figure 1: Comparison between our EDCompress (EDC) and Deep Compression (DC)
Algorithm 1 Computation of a typical convolutional layer
for co in range (CO) do
for ci in range (CI ) do
for x in range (X) do
for y in range (Y ) do
for fx from -(FX -1)/2 to (FX -1)/2 do
for fy from -(FY -1)/2 to (FY -1)/2 do
O[co][x][y]+=I[ci][x+fx][y+fy]×W [co][ci][fx][fy]
• Dataflow Awareness: This paper first studies model compression problem with the knowl-
edge of the diversity of dataflow designs. We study the impact of different dataflow designs
on quantization and pruning, and exploit the best model compression strategy in terms of
energy consumption and area.
• Automated Approach: We first formulate the energy-awaremodel compression as a multi-
step optimization problem. At each step, we partially quantize or prune the model, and then
fine tune the model by a few epochs. We further recast it into an reinforcement learning
task.
2 Related work
Many energy based model compression methods have been proposed in the literature. For exam-
ple, Wang et al. proposed a hardware-aware quantization method using the Deep Deterministic
Policy Gradients (DDPG) algorithm [34]. He et al. proposed a pruning method for mobile de-
vices using the DDPG algorithm [16]. Yang et al. proposed an energy-aware pruning method
for low-power devices [38]. Cai et al. [2] and Yang et al. [39] proposed optimization methods
to reduce the latency of neural networks. Several other work also focused on model compressing
techniques, such as [15] [12] [35] [24] [3] [25] [22] [29]. According to the previous work, there
are several effective model compression techniques, including pruning and quantization. In prun-
ing [19] [21] [9] [8] [37] [14], we reduce the model size by replacing those weights with small
absolute values by zeros. In quantization [6] [11] [32], we reduce the model size by decreasing
the precision of the weights and the activations. These work reduce the size of parameters stored
in the memory so that the compressed CNN can be applied on edge devices. Our proposed work
EDCompress is different with them because we has considered the diversity of dataflow designs.
3 Energy-Aware Quantization/Pruning with Dataflow
Dataflow is an important concept in accelerators. It is related to the mapping strategy between the
mathematical operations and the processing elements [40]. Algorithm 1 shows the computation
of a typical convolutional layer. The algorithm contains six loops. One loop corresponds to one
dimension in either the filter or the feature map. Here, CO and CI denote the number of output and
input channels. X and Y denote the width and height of the feature map. FX and Fy denote the
width and height of the filter. In each iteration of the innermost loop, we perform a basic operation
called multiply–accumulate (MAC). Before the MAC operation, we read three elements from the
memory, one from the input feature map, one from the weight, and one from the output feature map.
After the MAC operation, we write the result into the memory. During the whole process, most of
the energy is spent on the MAC calculation and data movement. To compute one conventional layer,
we need to execute CO · CI ·X · Y · FX · FY MAC operations in total.
In hardware accelerators using spatial architectures, we have an array or a matrix of processing
elements, each one can execute theMAC operation independently. The strategy to map the operation
into those elements becomes a key consideration in the hardware design. There is a large design
space to explore. For example, given an array of processing element, we can unroll any one of the
loops in the algorithm, and map each iteration in the loop into each processing element in the array.
2
X:Y
W0
I0
O0 W1
I1
O1
W0
I1
W1
O0
I2
O1
I0
I1
W2
I2
O2 W3
I3
O3
X:Fx
W0
I0
W1
I1
W2
I2
W3
I3
FX:FY
I0
W10
W01
I1
W11
CI:CO
O0 O1
Multipler Adder Register Input links Ouput links
W00
i1w2i2w2 i0w2
i1w1i2w1 i0w1
i2w0i3w0 i1w0
i3w1
i0w0
i1w3i2w3 i0w3
i3w2
0
i3w3
In0
P0P1P2P3P4P5
W0
I0
P6P7
(b) Quantization on hardware
(c) Pruning on hardware
O0
W1
I1
O1
W2
I2
O2
In1
Cin
Cout B
Adder
Skipped
MultiplierMultiplier
Skipped Adders
Adder
(a) Four dataflow designs
Figure 2: (a) The hardware accelerators with four popular dataflow designs. In each dataflow, we
show four processing elements, and each of them contains a multiplier and an adder; (b) If the
weights are quantized from 4 bits to 3 bits, we can skip the first row of adders; (c) If the weights are
pruned, we can skip those multipliers whose weights are zero
Table 1: Popular dataflow types
Dataflow Applied by Dataflow Applied by Dataflow Applied by Dataflow Applied by
X : Y [7] [30] FX : FY [26] X : FX [5] [10] [23] CI : CO [4] [18] [41] [1] [27] [31]
By similar rules, we can further unrolling two loops in the algorithm and map the MAC operations
into a matrix of processing elements. With six loops in total, there are C26=15 possibilities, each one
corresponds to one dataflow design. Here, we introduce four popular dataflows in Table 1. They are
denoted as A:B, where A and B stand for the name of each loop.
Different dataflow designs employ different data movement policies, and thus show different energy
efficiency and area overhead. In Figure 2 (a), we show example of four popular dataflow designs.
To simplify the figure, we only show four processing elements in each example. In real implementa-
tions, theA:B dataflow design requiresA·B processing elements. InX :Y , we storeMAC operation
results in registers at output ports of processing elements. At each iteration, we read the last MAC
operation result from registers. In FX :FY , we store FX · FY weights in registers at input ports of
processing elements. At each iteration, we sum up FX · FY MAC operation results. In X :Fx, we
store FX weights in registers at input ports of processing elements. At each iteration, we reuse the
weights by X times, and sum up FX MAC operation results. In CI :CO , at each iteration, we reuse
the input feature map by CO times, and sum up CI MAC operation results.
3.1 Improvement on Energy and Area Efficiency
Quantization and pruning are two popular techniques in model compression. To quantize a model,
we lower the precision of parameters based on the quantization depth (the number of digits present-
ing a parameter). After quantization, the low precision parameters may still store enough informa-
tion for model inference. To prune a model, we replace some of the parameters in the model to be
zero. A well-trained model usually contains many weights with small absolute values. We sort all
the weights in the filter, and replace those weights with the least absolute values by zeros.
We can save energy and reduce area overhead of the logic circuit using quantization and pruning.
Figure 2 (b) shows the inner structure of a 4 bits×4 bits multiplier, which contains 12 adders. If the
weights are quantized from 4 bits to 3 bits, we can skip the last row of adders, and thus save the
energy consumption and reduce the area overhead. In real application, a high precision model with
32FP data type (32 bit float point) requires 23 bit×23 bit multipliers, with 506 adders in total. If
both the activations and weights can be quantized, we can save a plenty of energy and area overhead.
For example, if the activations are quantized from 32FP to 16FP, and the weights are quantized
from 32FP to 8INT (8 bit integer), only 10 bit×8 bit multipliers are required, with 72 adders in
total, which is 86% less than the original amount. Figure 2 (c) shows an array of three processing
elements, each containing a multiplier and an adder. If the weights are pruned, some processing
elements would have inputs equaling zero. In this case, we can skip the related multiplier, and thus
save the energy consumption.
3
Energy0
O
p
ti
m
iz
a
ti
o
n
 O
rd
e
r
Accuracy0  >  T
W0
Fine 
Tune
W1 W2 W3L
a
y
e
r 0
W4
? ?
8b8b8b8b
Step 0
?
8b
Energy1
Accuracy1  >  T
W0
Fine 
Tune
W1 W2 W3L
a
y
e
r 0
W4
07b7b0
Step 1
7b
Energyt
Accuracyt  < T
W0
Fine 
Tune
W1 W2 W3L
a
y
e
r 0
W4
03b00
Step t
3b
R
e
w
a
rd
 r
t
S
ta
te
 s
t
M
a
p
p
in
g
S
tr
a
te
g
y
6b 6b
Layer0 
6b
6b 6b0
0 6b6b
0 5b
Layer1 
5b
5b 05b
0 5b5b
3b 0
Layer2 
3b
3b 3b3b
3b 3b0
Accuracy & Energy Consumption
Fine Tune Hardware Implementation
Model
p
1
Agent
q
1
p
2
q
2
p
3
q
3
t t tt t t
(a) (b)
Figure 3: (a) The multi-step optimization; (b) The reinforcement learning based optimization model.
The agent increases or decreases the quantization depth/pruning remaining amount at each step
We can also save energy and reduce area overhead of the memory modules using quantization and
pruning. To inference a model, we need to store all the weights, and put the intermediate feature
map of each layer into the memory. The memory can be either the on-chip memory or the off-chip
memory. No matter which type of memory we use, the data movement energy consumption and the
area overhead of memory modules are proportional to the total amount of data transmitted in bits.
To decrease this value, we can either reduce the size of parameters by quantization, or reduce the
number of parameters by pruning. For example, if we quantize the parameters from 32FP to 16FP,
and prune half of the parameters, then roughly 75% of the energy and area of memory modules can
be reduced.
3.2 Recasting to the Multi-Step Problem
We recast the model compression process to a multi-step problem. Our goal is to lower the energy
consumption and area overhead of edge devices while keeping the accuracy of the model. Instead of
quantizing/pruning the model directly in one step, our final target is approached through a sequence
of quantization/pruning steps. This is because we cannot alter the parameters too much at one time.
Otherwise, the performance of the model will be reduced obviously, and it will be too difficult to
restore the model [42].
We show an example of the multi-step optimization process in Figure 3 (a). In each step, we increase
or decrease the quantization depth (the precision of the parameters) or the prune amount in different
layers. For example, in step 1, we prune 40% weights, and the left weights are quantized by 7 bits.
We then fine tune the model, train a few more epochs, and check the accuracy and energy of the
model. If the accuracy is greater than threshold, we change the quantization depth and the pruning
amount, and repeat the optimization process. In step t, we prune 60% weights, and quantize the
remaining weights by 3 bit. Since the accuracy drops a lot at this step, we abort the optimization
process. The quantization depth and pruning amount can be adjusted independently at each step.
The searching space of optimal solutions in this problem is very huge. In general, an L-layer model
has 15 × 100L × 23L possible choices, assuming 1% pruning amount granularity. Designers are
always facing many choices, and in most cases, they have to make decisions by their experience on
different dataflows.
3.3 Optimization through Reinforcement Learning
Reinforcement learning is a good candidate to solve the multi-step problem. We propose a method to
search for the best model compression strategy for high energy efficiency and high area efficiency via
reinforcement learning algorithms, considering the diversity of dataflow designs. This mechanism
can automatically explore the design space, and find the optimal quantization/pruning policies for
each dataflow. We show the overview of our reinforcement learning model in Figure 3 (b). In each
episode, an agent interacts with the environment (the CNN model) via a sequence of steps. In each
step t, the agent generates an action vector at based on the state vector of the environment St. The
environment responds to action at, quantize/prune the parameters in the model, and change its state
to St+1. The model is then fine tuned by one or few epochs, and a reward rt considering both
accuracy and energy consumption is returned. For large dataset such as ImageNet, the model is not
fine tuned in the first few steps. The agent then updates its own parameters for achieving higher
rewards in later actions. In each episode, we start from 100% pruning remaining amount and 8 bit
4
quantization depth. An episode ends if the number of steps exceeds the limit, or the accuracy of the
model drops below the predefined threshold. The reason of this limit is to make the episode stop
when we reach the optimal point.
Qlt = Q
l
0 +
t−1∑
i=0
qliγ
i P lt = P
l
0 +
t−1∑
i=0
pliγ
i (1)
The quantization depth and the pruning remaining amount can be expressed by Equation 1. Here,
Ql0 and P
l
0 denote the original quantization depth and pruning remaining amount of l-th layer in
the CNN model before the optimization. Qlt and P
l
t denote the quantization depth and the pruning
remaining amount after optimization step t − 1 (t ≥ 1). To obtain Qlt and P
l
t , we need t steps of
optimization. In step i, the agent changes the values of Ql and P l by qli and p
l
i respectively. To get
a better optimization result, we take smaller steps when Qlt and P
l
t are close to the optimal point.
The discount factor γ is used to regulate the variance of qli and p
l
i. We test different values of γ in
experiments, and find that γ = 0.9 is an optimal value.
at = (
L−1⋃
l=0
{qlt}) ∪ (
L−1⋃
l=0
{plt}) (2)
The action at can be expressed by Equation 2. Here at is the set containing changes of Q and P in
all layers. Although the quantization depth is a discrete variable, we use the continuous action space.
This is because we don’t want to loss the small changes of the quantization depth accumulated in
each optimization step. When we fine tune the network, we round the quantization depth to the
nearest integer value.
st = (
t⋃
m=t−τ
L−1⋃
l=0
{Qlm}) ∪ (
t⋃
m=t−τ
L−1⋃
l=0
{P lm}) ∪ (
t⋃
m=t−τ
{rm}) ∪ {t} (3)
The state st can be expressed by Equation 3. Here st is the set containing all the quantization depth
Q, the pruning remaining amount P , and the reward r from step t − τ to step t. It also contains
t, the index of current step. We want the state of the environment to well reflect the history of
the optimization process. Hence, the state contains the values of Q and P in previous τ steps. To
guarantee that the state set has the same dimension at any optimization step, we have Qt−τ = Q0
and Pt−τ = P0 if t is less than τ .
rt = (αt/αt−1)
λ · βt−1/βt (4)
The reward rt can be expressed by Equation 4. Here, αt and αt−1 are the accuracy at current
step t and previous step t − 1, respectively. βt and βt−1 are the energy consumption at step t and
step t − 1. The area overhead is not involved in this equation because it is highly correlated with
energy consumption. Low energy consumption comes with a low area overhead. In the optimization
process, we want to reduce the energy consumption and at the same time maintain the accuracy of
the model. Intuitively, decreasing the quantization depth and the pruning remaining amount would
reduce the energy consumption and at the same time decrease the accuracy. The reinforcement
learning algorithms can automatically find the trade-off point between the accuracy α and the energy
consumption β. We use a third parameter λ to show the importance of accuracy over the energy
consumption. It is normally greater than 1, and is fixed during the optimization. We test different
values of λ in experiments, and find that λ = 3 is an optimal value.
4 Experiment
Algorithm setup: we use a state-of-the-art reinforcement learning algorithms SAC (soft actor-
critic) [13] to train our optimization model. Compared with classical large-space problems, the
search space in our problem is not large, and SAC can approach the optimal solutions very quickly
(less than ONE day on ImageNet using a single graphic card Titan Xp). We test EDCompress on
the ImageNet, CIFAR-10 and MNIST datasets using three different neural networks: VGG-16 [28],
MobileNet [17] and LeNet-5 [20]. VGG is a complex deep neural network. MobileNet is designed
for computation efficiency. LeNet-5 is a simple neural network with only two neural layers. We
study four dataflow types, which are the most commonly used dataflow types. In each episode, we
5
Table 2: Comparison of EDCompress and
HAQ [34] on ImageNet using MobileNet
Dataflow Norm. Energy Norm. Area
[34] Ours [34] Ours
X : Y 5.44 1.41 26.1 5.27
FX : FY 6.31 1.81 2.53 1.00
X : FX 6.32 1.81 2.53 1.00
CI : CO 4.48 1.00 505 92.0
Top-1 Acc. 64.8 68.3 64.8 68.3
Top-5 Acc. 85.9 88.3 85.9 88.3
Table 3: Comparison of EDCompress and the previous
work [22] [29] on CIFAR-10 using VGG-16
Dataflow Norm. Energy Norm. Area
[22] [29] Ours [22] [29] Ours
X : Y 24.41 15.10 1.69 7.78 5.56 1.00
FX : FY 22.61 14.42 2.31 6.42 4.20 1.27
X : FX 22.17 15.10 2.73 6.42 4.20 1.42
CI : CO 19.68 12.21 1.00 434 431 47.58
Accuracy 93.1 93.4 91.3 93.1 93.4 91.3
Table 4: Comparison of EDCompress and the previous work [15] [12] [35] [24] [3] [25] on MNIST
using LeNet-5 network. Total area is the maximum area that can support the function of each layer
Energy (µJ) Area (mm2)
[15] [12] [35] [24] [3] [25] Ours [15] [12] [35] [24] [3] [25] Ours
X
:Y
Conv1 1.62 3.34 6.29 2.93 3.61 15.76 0.27 0.95 5.77 5.77 5.77 5.77 5.81 0.53
Conv2 0.60 1.47 1.75 1.20 0.92 8.29 0.57 0.15 0.78 0.78 0.78 0.78 0.81 0.09
FC1 0.06 0.07 0.04 0.06 0.02 0.32 0.11 0.02 0.03 0.03 0.03 0.03 0.06 0.02
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.31 4.96 8.25 4.25 4.62 25.5 0.96 0.97 5.81 5.81 5.80 5.80 5.83 0.55
F
X
:F
Y
Conv1 1.33 3.09 5.67 2.73 3.33 13.91 0.22 0.05 0.20 0.20 0.20 0.20 0.24 0.03
Conv2 0.58 1.58 1.86 1.29 0.99 7.78 0.36 0.06 0.23 0.23 0.23 0.23 0.26 0.04
FC1 0.08 0.08 0.05 0.07 0.02 0.38 0.09 0.04 0.20 0.21 0.20 0.20 0.24 0.03
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.06
Total 2.03 4.84 7.75 4.16 4.42 23.21 0.69 0.09 0.66 0.66 0.66 0.66 0.7 0.08
X
:F
X
Conv1 1.17 3.44 6.05 3.05 3.70 12.93 0.39 0.18 1.04 1.05 1.04 1.04 1.08 0.11
Conv2 0.71 1.69 2.00 1.37 1.04 8.66 0.53 0.09 0.41 0.41 0.41 0.41 0.45 0.06
FC1 0.10 0.09 0.05 0.07 0.02 0.41 0.20 0.02 0.06 0.06 0.06 0.06 0.09 0.02
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.01 5.31 8.28 4.56 4.84 23.13 1.14 0.2 1.07 1.07 1.07 1.07 1.11 0.12
C
I
:C
O
Conv1 2.08 4.07 7.58 3.57 4.40 18.32 0.36 0.02 0.06 0.06 0.06 0.06 0.10 0.02
Conv2 0.73 1.81 2.14 1.47 1.13 8.88 0.63 0.14 0.75 0.75 0.75 0.75 0.78 0.09
FC1 0.06 0.09 0.06 0.08 0.03 0.35 0.08 1.55 14.11 14.11 14.11 14.11 14.15 1.29
FC2 0.03 0.09 0.17 0.07 0.08 1.14 0.02 0.08 0.63 0.63 0.62 0.62 0.66 0.07
Total 2.91 6.05 9.94 5.19 5.64 28.68 1.09 1.56 14.14 14.14 14.14 14.13 14.17 1.3
Accuracy 99.3 99.1 99.1 99.1 99.0 99.1 98.6 99.3 99.1 99.1 99.1 99.0 99.1 98.6
start from a well-train model. When the last episode ends, we restore the weights from a saved
checkpoint, and reset the quantization depth/pruning remaining amount in each layer.
Hardware setup: we implement popular dataflows X : Y , FX : FY ,X : FX and CI : CO on
the Xilinx Virtex UltraScale FPGA, and obtain the energy consumption and area overhead from
the Xilinx XPE toolkit [36]. The energy can be reported in a few seconds. In the logic part, the
multipliers, adders and registers are implemented on LUTs (lookup tables). An M × N multiplier
requiresM/2× (N+1) LUTs [33]. In our experiment, parameters in the feature map are quantized
by 10 bits, while the weights are quantized by q-bit (q ranging from 0 to 8). Hence, we need 5q
LUTs for a single 10× (q + 1) multiplier. In the memory part, the on-chip memory is implemented
on RAM (Random-Access Memory) modules. During inference, to save the memory space, the
input feature map is not kept after the computation of each layer. Hence, the size of the memory
modules must support the weights in all layers plus the maximum feature map in the model.
4.1 Comparison with the State-of-the-Art
EDCompress is effective on all kinds of datasets. Table 2, Table 3 and Table 4 compare EDCom-
press with the state-of-the-art work on the ImageNet, CIFAR-10 and MNIST datasets. Compared
with HAQ on ImageNet, our EDCompress test on four dataflow types and could achieve averaged
3.8X, and 3.9X improvement on energy and area efficiency with similar accuracy. We then focused
on small-size dataset because we are targeting on edge devices running lite applications. It shows
that among the four dataflows, EDCompress could more effectively reduce the energy consumption
and area overhead, with negligible loss of accuracy. Compared with the state-of-the-art work, ED-
Compress shows 9X improvement on energy efficiency and 8X improvement on area efficiency in
LeNet-5, in average of the four dataflows. It also shows 11X/6X improvement on energy/area effi-
ciency in VGG-16. If we optimize the model by EDCompress, the dataflow FX : FY is the most
appropriate choice for LeNet-5 in terms of energy consumption and area overhead, and the dataflow
X : Y is the most appropriate one for VGG-16.
6
00.2
0.4
0.6
0.8
1.0
C1 C2 L1 L2 C1 C2 L1 L2 C1 C2 L1 L2 C1 C2 L1 L2
0
0.1M
0.2M
0.3M
0.4M
0.5M
X:Y FX:FY X:FX CI:CO
N
or
m
al
iz
ed
 E
ne
rg
y
N
um
be
r o
f P
ar
am
et
er
s
0
0.2
0.4
0.6
0.8
1.0
C1 C2 L1 L2 C1 C2 L1 L2 C1 C2 L1 L2 C1 C2 L1 L2
0
0.1M
0.2M
0.3M
0.4M
0.5M
X:Y FX:FY X:FX CI:CO
Deep Compression:
EDCompress:
Calculation Input feature
movement
Weight
movement
Output feature
movement Logic (LUT)
RAM-Activation
(each layer)
RAM-Weights
(all layers)
N
or
m
al
iz
ed
 A
re
a
N
um
be
r o
f P
ar
am
et
er
s
Figure 4: Layerwise comparison of energy consumption and area overhead between EDCompress
and Deep Compression on LeNet-5. The color bar denotes the breakdown of energy and area, and
the red polyline denotes the number of parameters in each layer (right-hand y-axis)
0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20 24 28 32
N
or
m
al
iz
ed
 E
ne
rg
y
Optimization Step
0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20 24 28 32
N
or
m
al
iz
ed
 E
ne
rg
y
Optimization Step
0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20 24 28 32
N
or
m
al
iz
ed
 E
ne
rg
y
Optimization Step
91%
 
93%
91%
 
93%
91%
 
93%
88%
 
92%
88%
 
92%
88%
 
92%
98%
 
100%
98%
 
100%
98%
 
100%
91%
 
93%
88%
 
92%
98%
 
100%
VGG-16 MobileNet LeNet-5
X:Y FX:FY
X:FX CI:CO
Ac
cu
ra
cy
Ac
cu
ra
cy
Ac
cu
ra
cy
Figure 5: Optimization process of EDCompress on CIFAR-10 (VGG-16/MobileNet) and MNIST
(LeNet-5). In each episode, we run thirty-two steps. The curves show the energy consumption of
four dataflows, and the bars show the accuracy of the model
Comparisons also indicate that instead of compressing the model size, EDCompress is more efficient
in the reduction of energy consumption and area overhead. For example, in Figure 4, we compare
the energy and area between EDCompress and Deep Compression (DC) [15], layer by layer. From
the figure, EDCompress shows 2.4X higher energy efficiency and 1.4X higher area efficiency than
DC. We can see that in the third layer, DC shows better performance than EDCompress on energy
consumption because this layer contains 93% of the total parameters. However, this layer does not
contribute to most of the energy consumption. In fact, compressing the first layer would be more
helpful on the energy reduction, although it only contains 0.1% of the parameters. Figure 4 and
Table 4 show that EDCompress reduces much more energy consumption and area overhead in the
first layer, compared with previous work. Another example is the dataflow CI : CO , whose third
layer contributes to most of the area overhead. From the figure, we can see that EDCompress shows
higher area efficiency than DC in the third layer. This observation further prove that EDCompress
is more efficient in the reduction of hardware resources.
4.2 Insights on Dataflow
Quantization and pruning have different effects on different dataflow designs. Figure 5 shows the
optimization process of the hardware accelerators using three neural networks in terms of energy
consumption and accuracy. We start the optimization from a model with activations in 16FP data
type and weights in 8INT data type. From the figure, we can see that the reinforcement learning al-
gorithm could effectively reduce the energy consumption, with negligible loss of accuracy. Figure 6
7
00.2
0.4
0.6
0.8
1.0
X:Y FX:FY X:FX CI:CO
N
or
m
al
iz
ed
 E
ne
rg
y
0
0.2
0.4
0.6
0.8
1.0
X:Y FX:FY X:FX CI:CO
N
or
m
al
iz
ed
 E
ne
rg
y
0
0.2
0.4
0.6
0.8
1.0
X:Y FX:FY X:FX CI:CO
VGG-16 MobileNet LeNet-5
Before EDCompress After EDCompress Calculation Input feature movement Weight movement Output feature movement
N
or
m
al
iz
ed
 E
ne
rg
y
Figure 6: Energy consumption breakdown before and after the optimization of EDCompress. The
solid bar and patterned bar represent results before and after the EDCompress, respectively
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
En
er
gy
 R
ed
uc
tio
n
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
En
er
gy
 R
ed
uc
tio
n
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
En
er
gy
 R
ed
uc
tio
n
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
Ar
ea
 R
ed
uc
tio
n
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
Ar
ea
 R
ed
uc
tio
n
0
25%
50%
75%
100%
X:Y FX:FY X:FX CI:CO
VGG-16 MobileNet LeNet-5
VGG-16 MobileNet LeNet-5
Improvement from quantization only Improvement from pruning only Improvement from both quantization and pruning
Ar
ea
 R
ed
uc
tio
n
Figure 7: The performance of EDCompress by applying quantization technique only, pruning tech-
nique only, and both quantization/pruning techniques
shows the energy consumption breakdown of each dataflow before EDCompress (model using 16FP
activations and 8INT weight) and after EDCompress. If we compare the optimized result from ED-
Compress with the original model, the energy efficiency in VGG-16, MobileNet, LeNet-5 networks
can be improve by 20X, 17X, 37X, respective. More specifically, around 55% energy consumption
are saved from processing elements and the rest 45% are saved from data movement.
The results also indicate that optimization could change our choice on dataflow types. Those
dataflows that do not show good energy efficiency before the optimization may show very high
energy efficiency after the optimization. Take the VGG-16 for example, before the optimization,
the dataflow X : Y consumes the most energy among the four dataflows. However, after the op-
timization, X : Y consumes the second lowest energy consumption. This is because the energy
consumption of hardware accelerators include the energy of MAC operations on processing ele-
ments, and the energy on data movement. As we can see from Figure 6, given the fixed pruning
remaining amount and quantization depth, the energy consumed on processing elements are almost
the same. The only way to save the energy is to spent less energy on data movement. Due to the
optimization, the energy consumed on data movement decreases because the amount of delivered
data is reduced. In this process, different dataflow designs have different amount of reduction on the
delivered data. X : Y , in this case, is more efficient in data movement reduction, and therefore we
can save more energy consumption on this dataflow than other dataflow types.
4.3 Insights on Quantization/Pruning
The effectiveness of quantization and pruning techniques on the reduction of energy consumption
and area overhead is highly related to the dataflow design. Figure 7 shows their individual contribu-
tions. From the figure, we can see that in most cases, both quantization and pruning can effectively
reduce the energy consumption and area overhead. More specifically, if we apply quantization tech-
nique only, EDCompress can achieve 5.6X improvement on energy efficiency and 4.3X improve-
ment on area efficiency. If we apply pruning techniques only, EDCompress can achieve 3.8X/1.7X
improvement on energy/area efficiency.
We have two observations in Figure 7. First, pruning shows very little improvement on area over-
head of the CI : CO dataflow design. Second, the small-scale model LeNet-5 prefers quantization
over pruning. This is because in these cases, the accelerator demands more area on the processing
elements than the memory modules. Pruning can effectively reduce the area of memory modules
8
because of the reduction of model size. However, it is not good at decreasing the area of process-
ing elements. Quantization, on the other hand, could reduce the area of both processing elements
and memory modules effectively. Hence, the quantization technique would be more useful in these
cases.
5 Conclusions
We propose EDCompress, an energy-aware model compression method with dataflow. To the best
of our knowledge, this is the first paper studying this problem with the knowledge of the dataflow
design in accelerators. Considering the very nature of model compression procedures, we recast
the optimization to a multi-step problem, and solve it by reinforcement learning algorithms. ED-
Compress could find the optimal dataflow type for specific neural networks, which can guide the
deployment of CNN on hardware systems. However, deciding which dataflow type to use in the
hardware accelerator depends on many other constraints, such as the expected computation speed,
the thermal design power, the fabrication budget, etc. Therefore, we leave the final decision to
hardware developers.
References
[1] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-layer CNN Acceler-
ators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MI-
CRO), pages 1–12. IEEE, 2016.
[2] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct Neural Architecture Search on
Target Task and Hardware. arXiv preprint arXiv:1812.00332, 2018.
[3] Jing Chang and Jin Sha. Prune Deep Neural Networks With the Modified L_{1/2} Penalty.
IEEE Access, 7:2273–2280, 2018.
[4] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier
Temam. Diannao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-
Learning. ACM SIGARCH Computer Architecture News, 42(1):269–284, 2014.
[5] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A Spatial Architecture for Energy-
Efficient Dataflow for Convolutional Neural Networks. ACM SIGARCH Computer Architec-
ture News, 44(3):367–379, 2016.
[6] Caiwen Ding, Shuo Wang, Ning Liu, Kaidi Xu, Yanzhi Wang, and Yun Liang. REQ-YOLO:
A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, pages 33–42. ACM, 2019.
[7] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng,
Yunji Chen, and Olivier Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor.
In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages
92–104, 2015.
[8] Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource-aware Multi-tenant On-device
Deep Learning for ContinuousMobile Vision. In Proceedings of the 24th Annual International
Conference on Mobile Computing and Networking, pages 115–127. ACM, 2018.
[9] A. Frickenstein, C. Unger, and W. Stechele. Resource-Aware Optimization of DNNs for Em-
bedded Applications. In 2019 16th Conference on Computer and Robot Vision (CRV), pages
17–24, May 2019.
[10] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable
and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twenty-
Second International Conference on Architectural Support for Programming Languages and
Operating Systems, pages 751–764, 2017.
[11] Xue Geng, Jie Fu, Bin Zhao, Jie Lin, Mohamed M Sabry Aly, Christopher J Pal, and Vijay
Chandrasekhar. Dataflow-Based Joint Quantization for Deep Neural Networks. In DCC, page
574, 2019.
9
[12] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic Network Surgery for Efficient DNNs.
In Advances in neural information processing systems, pages 1379–1387, 2016.
[13] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-critic: Off-
policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv
preprint arXiv:1801.01290, 2018.
[14] Ghouthi Boukli Hacene, Vincent Gripon, Matthieu Arzel, Nicolas Farrugia, and Yoshua Ben-
gio. Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neu-
ral Networks. arXiv preprint arXiv:1812.11337, 2018.
[15] Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing Deep Neu-
ral Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint
arXiv:1510.00149, 2015.
[16] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML for
Model Compression and Acceleration on Mobile Devicesyang2017designing. In Proceedings
of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
[17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient Convolutional Neural
Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, 2017.
[18] Norman P Jouppi et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In
Proceedings of the 44th Annual International Symposium on Computer Architecture, pages
1–12, 2017.
[19] Hyeong-JuKang. Accelerator-Aware Pruning for ConvolutionalNeural Networks. IEEE Trans-
actions on Circuits and Systems for Video Technology, 2019.
[20] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based Learning
Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[21] Carl Lemaire, Andrew Achkar, and Pierre-Marc Jodoin. Structured Pruning of Neural Net-
works with Budget-Aware Regularization. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 9108–9116, 2019.
[22] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for
Efficient Convnets. arXiv preprint arXiv:1608.08710, 2016.
[23] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. A High Perfor-
mance FPGA-based Accelerator for Large-Scale Convolutional Neural Networks. In 2016
26th International Conference on Field Programmable Logic and Applications (FPL), pages
1–9. IEEE, 2016.
[24] Zhenhua Liu, Jizheng Xu, Xiulian Peng, and Ruiqin Xiong. Frequency-domain Dynamic
Pruning for Convolutional Neural Networks. In Advances in Neural Information Processing
Systems, pages 1043–1053, 2018.
[25] Franco Manessi, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schet-
tini. Automated Pruning for Deep Neural Network Compression. In 2018 24th International
Conference on Pattern Recognition (ICPR), pages 657–664. IEEE, 2018.
[26] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi
Tang, Ningyi Xu, Sen Song, et al. Going Deeper with Embedded FPGA Platform for Convolu-
tional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pages 26–35, 2016.
[27] Yongming Shen, Michael Ferdman, and Peter Milder. Overcoming Resource Underutilization
in Spatial CNN Accelerators. In 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), pages 1–4. IEEE, 2016.
[28] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-scale
Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
10
[29] Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. Play and Prune:
Adaptive Filter Pruning for DeepModel Compression. arXiv preprint arXiv:1905.04446, 2019.
[30] Mingcong Song, Jiaqi Zhang, Huixiang Chen, and Tao Li. Towards Efficient Microarchitec-
tural Design for Accelerating UnsupervisedGAN-based Deep Learning. In 2018 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA), pages 66–77. IEEE,
2018.
[31] Naveen Suda et al. Throughput-optimized OpenCL-based FPGA Accelerator for Large-Scale
Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, pages 16–25, 2016.
[32] F. Tung and G. Mori. CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-
Quantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 7873–7882, June 2018.
[33] E GeorgeWalters. ArrayMultipliers for High Throughput in Xilinx FPGAs with 6-input LUTs.
Computers, 5(4):20, 2016.
[34] KuanWang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: Hardware-Aware Automated
Quantization with Mixed Precision. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 8612–8620, 2019.
[35] XuefengXiao, Lianwen Jin, Yafeng Yang,Weixin Yang, Jun Sun, and Tianhai Chang. Building
Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character
Recognition. Pattern Recognition, 72:72–81, 2017.
[36] Xilinx. Vivado Design Suite User Guide. Technical Publication, 2018.
[37] Haichuan Yang, Yuhao Zhu, and Ji Liu. Energy-Constrained Compression for Deep Neu-
ral Networks via Weighted Sparse Projection and Layer Input Masking. arXiv preprint
arXiv:1806.04321, 2018.
[38] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing Energy-Efficient Convolutional
Neural Networks Using Energy-Aware Pruning. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5687–5695, 2017.
[39] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne
Sze, and Hartwig Adam. Netadapt: Platform-aware Neural Network Adaptation for Mobile
Applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages
285–300, 2018.
[40] Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou
Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et al. DNN Dataflow Choice Is Overrated.
arXiv preprint arXiv:1809.04070, 2018.
[41] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing
FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of
the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
161–170, 2015.
[42] Michael Zhu and Suyog Gupta. To Prune, or Not to Prune: Exploring the Efficacy of Pruning
for Model Compression. arXiv preprint arXiv:1710.01878, 2017.
11
