Loom: Exploiting Weight and Activation Precisions to Accelerate
  Convolutional Neural Networks by Sharify, Sayeh et al.
ar
X
iv
:1
70
6.
07
85
3v
2 
 [c
s.D
C]
  1
6 M
ay
 20
18
Loom: ExploitingWeight and Activation Precisions to Accelerate
Convolutional Neural Networks
Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, Andreas Moshovos
University of Toronto
{sayeh,delmasl1,siukevi4,juddpatr,moshovos}@ece.utoronto.ca
ABSTRACT
Loom (LM), a hardware inference accelerator for Convolutional
Neural Networks (CNNs) is presented. In LM every bit of data pre-
cision that can be saved translates to proportional performance
gains. Specifically, for convolutional layers LM’s execution time
scales inversely proportionallywith the precisions of both weights
and activations. For fully-connected layers LM’s performance scales
inversely proportionally with the precision of the weights. LM tar-
gets area- and bandwidth-constrained System-on-a-Chip designs
such as those found on mobile devices that cannot afford the multi-
megabyte buffers that would be needed to store each layer on-chip.
Accordingly, given a data bandwidth budget, LM boosts energy
efficiency and performance over an equivalent bit-parallel accel-
erator. For both weights and activations LM can exploit profile-
derived per layer precisions. However, at runtime LM further trims
activation precisions at a much smaller than a layer granularity.
Moreover, it can naturally exploit weight precision variability at
a smaller granularity than a layer. On average, across several im-
age classification CNNs and for a configuration that can perform
the equivalent of 128 16b × 16b multiply-accumulate operations
per cycle LM outperforms a state-of-the-art bit-parallel accelera-
tor [1] by 4.38× without any loss in accuracy while being 3.54×
more energy efficient. LM can trade-off accuracy for additional im-
provements in execution performance and energy efficiency and
compares favorably to an accelerator that targeted only activation
precisions. We also study 2- and 4-bit LLM variants and find the
the 2-bit per cycle variant is the most energy efficient.
1 INTRODUCTION
Deep neural networks (DNNs) have become the state-of-the-art
technique in many recognition tasks such as object [2] and speech
recognition [3]. Given their many applications and high computa-
tion and memory demands, DNNs are prime candidates for hard-
ware acceleration. While a few different types of DNNs exist, Con-
volutional Neural Networks (CNNs) in particular dominate applica-
tions where the input is an image or video. Devices executing such
CNNs will be required to perform mostly if not only inference. An
example is computational photography where machine learning
has shown great promise in replacing classical algorithms [4].
We present Loom (LM), a hardware accelerator for inference
with CNNs targeting embedded systemswhere reducing the amount
of data transfered per memory connection, be it an external or in-
ternal one, is paramount. Specifically, given a memory bandwidth
budget LM’s goal is to boost performance and energy efficiency
compared to a state-of-the-art data-parallel accelerator. LM ex-
ploits the precision requirement variability of CNNs to reduce the
memory footprint, increase bandwidth utilization, and to deliver
performance which scales inversely proportional with precision
for both convolutional (CVLs) and fully-connected (FCLs) layers.
Ideally, compared to using a fixed precision of 16 bits, LM achieves
a speedup of 256
Pa×Pw
and 16
Pw
for CVLs and FCLs where Pw and Pa
are the precisions of weights and activations, respectively. LM also
reduces the number of weight and activation bits read by 16−Pw16
and 16−Pa16 . To deliver these benefits LM processes both activations
and weights bit-serially while compensating for the loss in com-
putation bandwidth by exploiting parallelism. Judicious reuse of
activations and weights enables LM to improve performance and
energy efficiency over conventional bit-parallel designs without
requiring a wider memory interface. For both weights and activa-
tions LM utilizes profile-derived per layer precisions. For activa-
tions, LM further trims their precision at a much finer granularity
at runtime utilizing the approach of Lascorz et al. [5]. By exploit-
ing precision LM delivers benefits for all activations and weights
regardless of whether they are ineffectual or not.
We evaluate LM on an SoC and compare against a bit-parallel
fixed-precision accelerator (DPNN ) over a set of image classifica-
tion CNNs. For a configuration that is sized to match the peak com-
putation bandwidth of a bit-parallel accelerator that can perform
at peak 128 16b×16b multiply-accumulate operations per cycle, on
average LM yields a speedup of 3.25×, 1.74×, and 3.19× overDPNN
for the convolutional, fully-connected, and all layers, respectively.
The energy efficiency of LM over DPNN is 2.63×, 1.41× and 2.59×
for the aforementioned layers, respectively. LM enables trading off
accuracy for additional improvements in performance and energy
efficiency. For example, accepting a 1% relative loss in accuracy,
LM yields 3.57× higher performance and 2.87× more energy effi-
ciency than DPNN . We also perform a sensitivity study varying
the equivalent peak compute bandwidth and the number of bits
that LM processes per cycle. LM scales well up up to a configura-
tion equivalent to 256 16b × 16b multiply-accumulate operations
per cycle and that a 2-bit per cycle design achieves the best energy
efficiency albeit not the best performance.
The rest of this document is organized as follows: Section 2 il-
lustrates the key concepts behind LM via an example. Section 3
presents theDPNN and Loom architectures. The evaluationmethod-
ology and experimental results are presented in Section 4. Section 5
reviews related work, and Section 6 concludes.
2 LOOM: A SIMPLIFIED EXAMPLE
This section explains how LM would process CVLs and FCLs on
an example using 2-bit activations and weights.
Conventional Bit-Parallel Processing: Figure 1a shows a bit-
parallel processing engine which multiplies two input activations
with two weights generating a single 2-bit output activation per
cycle. The engine can process two new 2-bit weights and/or acti-
vations per cycle a throughput of two 2b × 2b products per cycle.
Loom’s Approach: Figure 1b shows an equivalent LM engine
whichmatches the bit-parallel engine’s throughput by producing 8
1b×1b products every cycle. The engine comprises an 2×2 array of
bit-serial subunits (4 in total). Each subunit accepts 2 bits of input
activations and 2 bits of weights per cycle and performs 2 1b × 1b
products. The subunits along the same column share the activation
inputs while the subunits along the same row share their weight
inputs. In total, this engine accepts 4 activation and 4 weight bits
equaling the input bandwidth of the bit-parallel engine. Each sub-
unit has two 1-bit Weight Registers (WRs), one 2-bit Output Regis-
ter (OR) for accumulating its products.
Figure 1b through Figure 1f show how LM would process an
FCL. As Figure 1b shows, in cycle 1, the left column subunits re-
ceive the least significant bits (LSBs) a0/0 and a1/0 of activations
a0 and a1, andw
0
0/0
,w0
1/0
,w1
0/0
, andw1
1/0
, the LSBs of four weights
from filters 0 and 1. Each of these two subunits calculates two
1b × 1b products (the product and accumulation would take place
in the subsequent cycle adding one more pipeline stage, a detail
the example omits for clarity) and stores their sum into its OR. In
Figure 1c and cycle 2, the left column subunits now multiply the
same weight bits with the most significant bits (MSBs) a0/1 and
a1/1 of activations a0 and a1 respectively accumulate these into
their ORs. In parallel, the two right column subunits load a0/0 and
a1/0, the LSBs of the input activations a0 and a1, and multiply them
by the LSBs of weightsw2
0/0
,w2
1/0
,w3
0/0
, andw3
1/0
from filters 2 and
3. In cycle 3, the left column subunits now load and multiply the
LSBs a0/0 and a1/0 with the MSBs w
0
0/1
, w0
1/1
, w1
0/1
, and w1
1/1
of
the four weights from filters 0 and 1. In parallel, the right subunits
reuse their WR-held weights w2
0/0
, w2
1/0
, w3
0/0
, and w3
1/0
and mul-
tiply them by the most significant bits a0/1 and a1/1 of activations
a0 and a1 (Figure 1d). In cycle 4 and Figure 1e, the left column sub-
units multiply their WR-held weights and a0/1 and a1/1 the MSBs
of activations a0 and a1 and finish the calculation of output activa-
tions o0 and o1. Concurrently, the right column subunits loadw
2
0/1
,
w2
1/1
, w3
0/1
, and w3
1/1
, the MSBs of the weights from filters 2 and
3 and multiply them with a0/0 and a1/0. In cycle 5 and Figure 1f,
the right subunits complete the multiplication of their WR-held
weights and a0/1 and a1/1 the MSBs of the two activations. By the
end of this cycle, output activations o2 and o3 are ready as well.
In total it took 4+1 cycles to process 32 1b × 1b products (4, 8,
8, 8, 4 products in cycles 1 through 5, respectively). Notice that
at the end of the 5th cycle, the left column subunits are idle, thus
the WRs could have loaded another set of weights commencing
the computation of a new set of outputs. In the steady state, with
2b input activations and weights, this engine will be producing 8
1b×1b terms every cycle thus matching the 2 2b×2b throughput of
the parallel engine. If the weights could be represented using only
one bit, LM would be producing two output activations per cycle,
twice the bandwidth of the bit-parallel engine.
In general, if the bit-parallel hardware was using Pbase bits to
represent the weights while only Pw bits were actually required,
for the FCLs the LM engine would outperform the bit-parallel en-
gine by
Pbase
Pw
. The LM would use an array of Pbase×k units, where
k the number of Pbase ×Pbase productsDPNN processes per cycle.
Each subunit would produce k 1b × 1b products. Since there is no
weight reuse in FCLs, 16 cycles are required to load a different set
of weights to each of the 16 columns. Thus having activations that
use less than 16 bits would not improve performance (but could
improve energy efficiency).
Convolutional Layers: LM processes CVLs similarly to FCLs but
exploits weight reuse across different windows to exploit a reduc-
tion in precision for both weights and activations. Specifically, in
CVLs the subunits across the same row share the same weight
bits which they load in parallel into their WRs in a single cycle.
These weight bits are multiplied by the corresponding activation
bits over Pa cycles. Another set of weight bits needs to be loaded
every Pa cycles, where Pa is the input activation precision. Here
LM exploits weight reuse across multiple windows by having each
subunit column process a different set of activations. Assuming
that the bit-parallel engine uses P bits to represent both input acti-
vations and weights, LM will outperform the bit-parallel engine by
P
2
Pw×Pa
where Pw and Pa are the weight and activation precisions
LM uses respectively.
3 LOOM ARCHITECTURE
This section describes the baseline fixed precision bit-parallel ac-
celerator and the Loom architecture.
3.1 Data Supply and Baseline System
Our baseline design (DPNN ) shown on Figure 2a is an appropri-
ately configured data-parallel engine inspired by the DaDianNao
accelerator [1] the de facto standard used for comparison in most
accelerator studies. DPNN uses 16-bit fixed-point activations and
weights. DPNN comprises k inner product units (IP) each process-
ing a different filter. Every cycle DPNN accepts as input N activa-
tions and N corresponding weights per filter out of k filters. In the
configuration shown N = 16 and k = 8. The N activations are
broadcast to all IP units. Each IP unit multiplies each of the N ac-
tivations with one out of its N weights, reduces the resulting N
32b products with an adder tree, and accumulates the result into
an output register. In total, every cycle, DPNN calculates N × k
products producing k partial output activations.
An Activation Memory (AM) and a Weight Memory (WM) sup-
ply respectively the activations and the weights. An input activa-
tion buffer (ABin) buffers the input activations while an output ac-
tivation buffer (ABout) temporarily buffers the output activations.
For clarity, in our descriptionwe assume a single tile that processes
up to 128 weights (8 filters) and 16 activations per cycle.
3.2 Loom
For LM to match our DPNN configuration it needs to process 128
filters concurrently and 16 weight bits per filter per cycle, for a to-
tal of 128×16 = 2048weight bits per cycle. Alternatively, LM could
process 32 filters over 64 windows, however, we leave this investi-
gation for future work. LM also accepts 256 1-bit input activations
each of which it multiplies with 128 1-bit weights thus matching
the computation bandwidth of base in the worst case where both
2
Weight 0
Weight 1
Activation 0 Activation 1
X
X
+
Out
2
2
2 2
w01/1
w01/0
w00/1
w00/0
a0/1 a0/0 a1/1 a1/0
(a) Bit-Parallel Engine processing 2b × 2b layer
over two cycles
Out0
1
1
1
a0/0
1 1
X
X
1
1
a1/0
1
w11/0
w10/0
w01/0
w00/0
Window 
lane 0
w00/0
w01/0
+
1
1
1 1
X
X
Window 
lane 1
+
Out2
Out1
1
1
1
X
X
1
1
1
w10/0
w11/0
+
1
1
X
X
+
Out3
WR OR
(b) Cycle 1: Load LSB of weights from filters 0
and 1 into the left WRs
Out0
1
1
1
a0/1
1 1
X
X
1
1
a1/1
1
w31/0
w30/0
w21/0
w20/0 w
0
0/0
w01/0
+
1
1
1 1
X
X
+
Out2
Out1
1
1
1
X
X
1
1
1
w10/0
w11/0
+
1
1
X
X
+
Out3
a0/0 a1/0
w31/0
w30/0
w21/0
w20/0
(c) Cycle 2: Load LSB of weights from filters 2
and 3 into the right WRs
Out0
1
1
1
a0/0
1 1
X
X
1
1
a1/0
1
w11/1
w10/1
w01/1
w00/1 w
0
0/1
w01/1
+
1
1
1 1
X
X
+
Out2
Out1
1
1
1
X
X
1
1
1
w10/1
w11/1
+
1
1
X
X
+
Out3
a0/1 a1/1
w31/0
w30/0
w21/0
w20/0
(d) Cycle 3: Load MSB of weights from filters 0
and 1 into the left WRs
Out0
1
1
1
a0/1
1 1
X
X
1
1
a1/1
1
w31/1
w30/1
w21/1
w20/1 w
0
0/1
w01/1
+
1
1
1 1
X
X
+
Out2
Out1
1
1
1
X
X
1
1
1
w10/1
w11/1
+
1
1
X
X
+
Out3
a0/0 a1/0
w31/1
w30/1
w21/1
w20/1
(e) Cycle 4: Load MSB of weights from filters 2
and 3 into the right WRs
Out0
1
1
1
1 1
X
X
1
1
1
+
1
1
1 1
X
X
+
Out2
Out1
1
1
1
X
X
1
1
1
+
1
1
X
X
+
Out3
a0/1 a1/1
w31/1
w30/1
w21/1
w20/1
(f) Cycle 5: Multiply MSB of weights from fil-
ters 2 and 3 with MSB of a0 and a1
Figure 1: Processing an example Fully-Connected Layer using LM’s Approach.
Activation 
Lane 0
Activation 
Lane 15
ABin
Weight 
Lane 0
Weight 
Lane 15
8
IP0
8
16
16
16 16
IP0
X
X
+
16
16
16 16
to ABout
Filter
(a) Baseline design
ABin
Weight 
Lane 0
Weight 
Lane 15
1
1
to ABout
SIP(0,0)
+
WR
1
1
11
Activation 
Lane 0
Activation 
Lane 15
SIP(0,0)
1 1
Activation 
Lane 240
Activation 
Lane 255
SIP(15,0)
1 1
From weight lane
(b) Loom
Figure 2: The two CNN accelerators.
activations and weights need 16 bits. Figure 2b shows the Loom de-
sign. It comprises 2K Serial Inner-Product Units (SIPs) organized in
a 128× 16 grid. Every cycle, each SIP multiplies 16 1b input activa-
tions with 16 1b weights and reduces these products into a partial
output activation. The SIPs along the same row share a common
16b weight bus, and the SIPs along the same column share a com-
mon 16b activation bus. Accordingly, as in DPNN , the SIP array
is fed by a 2Kb weight bus and a 256b activation input bus. Simi-
lar to DPNN , LM has an ABout and an ABin. LM processes both
activations and weights bit-serially.
Reducing Memory Footprint and Bandwidth: Since both
weights and activations are processed bit-serially, LM can store
weights and activations in a bit-interleaved fashion and using only
as many bits as necessary thus boosting the effective bandwidth
and storage capacity of the weight memory and the AM. For ex-
ample, given 2K 13b weights to be processed in parallel, LM would
pack first their bit 0 onto continuous rows, then their bit 1, and so
on up to bit 12. DPNN would stored them using 16 bits instead.
A transposer can rotate the output activations prior to writing
them toAM fromABout. Since each output activation entails inner-
products with tens to hundreds of inputs, the transposer demand
will be low.
Convolutional Layers: Processing starts by reading in parallel
2K weight bits from memory, loading 16 bits to all WRs per SIP
row. The loaded weights will be multiplied by 16 corresponding
activation bits per SIP column bit-serially over PLa cycles where
3
Table 1: Activation and weight (W) precision profiles in bits
for the convolutional and fully-connected layers.
Convolutional Layers
Network 100% Accuracy 99% Accuracy
Act. / Per Layer W Act. / Per Layer W
NiN 8-8-8-9-7-8-8-9-9-8-
8-8
11 8-8-7-9-7-8-8-9-9-8-
7-8
10
AlexNet 9-8-5-5-7 11 9-7-4-5-7 11
Google 10-8-10-9-8-10-9-8-9-
10-7
11 10-8-9-8-8-9-10-8-9-
10-8
10
VGGS 7-8-9-7-9 12 7-8-9-7-9 11
VGGM 7-7-7-8-7 12 6-8-7-7-7 12
VGG19 12-12-12-11-12-10-
11-11-13-12-13-13-
13-13-13-13
12 9-9-9-8-12-10-10-
12-13-11-12-13-13-
13-13-13
12
Fully-Connected Layers
100% Accuracy 99% Accuracy
Weights /Per Layer Weights/Per
Layer
NiN N/A N/A
AlexNet 10-9-9 9-8-8
Google 7 7
VGGS 10-9-9 9-9-8
VGGM 10-8-8 9-8-8
VGG19 10-9-9 10-9-8
PLa is the activation precision for this layer L. Then, the second bit
of weights will be loaded intoWRs and multipliedwith another set
of 16 activation bits per SIP row, and so on. In total, the bit-serial
multiplication will take PLa × P
L
w cycles. where P
L
w the weight pre-
cision for this layer L. Whereas DPNN would process 16 sets of 16
activations and 128 filters over 256 cycles, LM processes them con-
currently but bit-serially over PLa × P
L
w cycles. If P
L
a and/or P
L
w are
less than 16, LM will outperform DPNN by 256/(PLa × P
L
w ). Other-
wise, LM will match DPNN ’s performance.
Fully-Connected Layers: Processing starts by loading the LSBs
of a set of weights into theWR registers of the first SIP column and
multiplying the loaded weights by the LSBs of the corresponding
activations. In the second cycle, while the first column of SIPs is
still busy with multiplying the LSBs of its WRs by the second bit
of the activations, the LSBs of a new set of weights can be loaded
into the WRs of the second SIP column. Each weight bit is reused
for 16 cycles multiplying with bits 0 through bit 15 of the input
activations. Thus, there is enough time for LM to keep any single
column of SIPs busy while loading new sets of weights to the other
15 columns. For example, as shown in Figure 2b LM can load a
single bit of 2K weights to SIP(0,0)..SIP(0,127) in cycle 0, then load
a single-bit of the next 2K weights to SIP(1,0)..SIP(1,127) in cycle 1,
and so on. After the first 15 cycles, all SIPs are fully utilized. It will
take PLw × 16 cycles for LM to process 16 sets of 16 activations and
128 filters while DPNN processes them in 256 cycles. Thus, when
PLw is less than 16, LM will outperform DPNN by 16/P
L
w and it will
match DPNN ’s performance otherwise.
SIP: Bit-Serial Inner-Product Units: Figure 3 shows LM’s Bit-
Serial Inner-Product Unit (SIP). Every clock cycle, each SIP multi-
plies 16 single-bit activations by 16 single-bit weights to produce
a partial output activation. Internally, each SIP has 16 1-bit Weight
Registers (WRs), 16 2-input AND gates to multiply the weights in
the WRs with the incoming input activation bits, and a 16-input
1b adder tree that sums these partial products. AC1 accumulates
and shifts the output of the adder tree over PLa cycles. Every P
L
a
cycles, AC2 shifts the output of AC1 and accumulates it into the
OR. After PLa × P
L
w cycles the Output Register (OR) contains the
inner-product of an activation and weight set. In each SIP, a multi-
plexer afterAC1 implements cascading. To support signed 2’s com-
plement activations, a negation block is used to subtract the sum
of the input activations corresponding to the most significant bit
of weights (MSB) from the partial sumwhen theMSB is 1. Each SIP
also includes a comparator (max) to support max pooling layers.
Dynamic Precision Reduction: So far we assumed that soft-
ware provided profile-derived per layer activation and weight pre-
cisions [6]. Lascorz et al., observed that the hardware can further
shorten these precisions by inspecting the actual values at run-
time [5]. LM determines adjusts precision per group of 256 acti-
vations that it processes concurrently. Per bit position OR trees
produce a 16-bit vector indicating the positions where any of the
activations has a 1. A leading one detector identifies the most sig-
nificant position and thus the precision in bits that is sufficient.
Processing Layers with Few Outputs: For LM to keep all the
SIPs busy an output activation must be assigned to each SIP. This
is possible as long as the layer has at least 2K outputs. However,
in the networks studied some FCLs have only 1K output activa-
tions, To avoid underutilization, LM’s implements SIP cascading,
in which SIPs along each row can form a daisy-chain, where the
output of one can feed into an input of the next via a multiplexer.
This way, the computation of an output activation can be sliced
along the bit dimension over the SIPs in the same row. In this case,
each SIP processes only a portion of the input activations resulting
into several partial output activations along the SIPs on the same
row. Over the next Sn cycles, where Sn is the number of bit slices
used, the Sn partial outputs can be reduced into the final output
activation.
Other Layers: Similar to DaDN , LM processes the additional lay-
ers needed by the studied networks. To do so, LM incorporates
units for MAX pooling as in DaDN . Moreover, to apply nonlinear
activations, an activation functional unit is present at the output of
the ABout. Given that each output activation typically takes sev-
eral cycles to compute, it is not necessary to use more such func-
tional units compared to DPNN .
Total computational bandwidth: In the worst case, with 16b
activations and weights, a single 16b×16b product that would have
taken DPNN one cycle to produce, now takes LM 256 cycles. Since
DPNN calculates 128 products per cycle, LM needs to calculate the
equivalent of 256 × 128 16b × 16b products every 256 cycles. LM
has 128 × 16 = 2048 SIPs each producing 16 1b × 1b products per
cycle. Thus, over 256 cycles, LM produces 2048 × 16 × 256 1b × 1b
products matching DPNN ’s compute bandwidth.
Tuning the Performance, Area and Energy Trade-off:We can
trade off some of the performance benefits to reduce the number
of SIPs and the respective area overhead by processing multiple
activation bits per cycle. The evaluation section considers 2-bit
(LM2b ) and 4-bit (LM4b ) LM configurations which need 8 and 4
4
n
e
g
x16
i=1(a0)
i=1(a15)
weight
1
weight
1
+
max
<<1
<<
o_nbout
i_nboutactivation
MSB
1    0
prec16
WR
1
1
+
+
<<i=1
M
S
B
i_
n
b
o
u
t
c
a
s.
AC 1
AC 2
Figure 3: LM’s SIP.
SIP columns and accommodate precisions that are multiple of 2
and 4, respectively. For example, for LM4b reducing the P
L
a from 8
to 5 bits produces no performance benefit, whereas for the LM1b
it would improve performance by 1.6×.
4 EVALUATION
This section evaluates Loom performance, energy and area and ex-
plores the trade-off between accuracy and performance comparing
to DPNN and Stripes [7].
4.1 Methodology
Execution time is modeled via a custom cycle-accurate simulator
and energy and area measurements are collected over layouts of all
designs. The designs were synthesized for worst case, typical case,
and best case corners with the Synopsys Design Compiler using
a TSMC 65nm library. Layouts were produced with Cadence In-
novus using the typical corner case synthesis results which were
more pessimistic for LM than the worst case scenario. Power re-
sults are based on the actual data-driven activity factors. The clock
frequency of all designs is set to 1GHz. The ABin and ABout SRAM
buffers were modeled with CACTI [8] and AM and WMwere mod-
eled as eDRAM with Destiny [9]. We first evaluate LM assuming
that all the activations fit on chip and the weights can be read from
off-chip memory without any bandwidth constraint to explore the
design space without being affected by the choice of a particular
off-chip memory. We conclude by investigating performance with
a single-channel of low-power DDR4-4267.
4.2 Weight and Activation Precisions:
Table 1 reports the profile-derived per layer precisions of input
activations and network precisions of weights for the CVLs and
FCLs using the method of Judd et al. [6]. Since LM’s performance
for the CVLs depends on both PLa and P
L
w , we adjust them inde-
pendently. We use per layer activation precisions and a common
across all CVLs weight precision. We found little inter-layer vari-
ability for weight precisions but additional per layer exploration
is warranted. Since LM’s performance for FCLs performance de-
pends only on PLw we only adjust weight precision for FCLs. The
precisions that guarantee no top-1 accuracy loss for CVLs input
activations vary from 5 to 13 bits and for weights vary from 10
to 12. When a 99% relative top-1 accuracy is still acceptable, the
activation and weight precision can be as low as 4 and 10 bits, re-
spectively. The per layer weight precisions for the FCLs vary from
7 to 10 bits.
4.3 Performance and Energy Efficiency
Figures 4a and 4b show respectively the performance and energy
efficiency of Loom, Stripes, and DStripes configurations relative to
DPNN with the precision 100% profiles of Table 1 and for all layers
combined. Stripes is based on Stripes which exploits only profile-
derived per layer activation precisions and only for CVLs [7].
DStripes incorporates dynamic prediction reduction [5].
On average, LM1b outperformsDPNN bymore than 3×while be-
ing more than 2.5× energy efficient. When LM processes multiple
bits per cycle the performance benefits are lower but energy effi-
ciency improves up to 2.9×. LM1b consistently outperforms Stripes
and DStripes in performance and Stripes in energy efficiency. LM1b
is more energy efficient than DStripes except for GoogleNet where
its energy efficiency is within 2% of DStripes.
Table 2 reports per network performance and energy efficiency
for LM configurations relative to DPNN for the FCLs and CVLs
separately, and for the 100% and 99% accuracy profiles. In general,
LM1b outperforms LM2b and LM4b in most cases with the latter
two being more energy efficient. On occasion the latter two out-
perform LM1b under the 100% accuracy profiles in FCLs. Since for
LM the performance improvement in FCLs is only due to the use
of lower weight precisions, processing multiple activation bits per
cycle does not effect performance in the steady state. However, pro-
cessing more activation bits per cycle reduces the initiation inter-
val per layer an effect that becomes noticeable for small FCLs.
The table reports detailed results for Stripes. For FCLs, Stripes
performance and energy efficiency suffer as it does not exploit
weight precisions. With the 99% accuracy profiles, both perfor-
mance and energy efficiency improve considerably for FCLs and
CVLs. Performance with DStripes would be identical to Stripes
for the FCLs. We do not present detailed results for DStripes
due to space limitations noting that LM consistently outperforms
DStripeswhile being more energy efficient except for the CVLs for
GoogLeNet where the difference in energy efficiency is small.
4.4 Area Overhead
Post layout measurements were used to measure the area ofDPNN
and Loom. The LM1b configuration requires 1.34× more area over
DPNN while achieving on average a 3.19× speedup. The LM2b and
LM4b reduce the area overhead to 1.25× and 1.16× while still im-
proving the execution time by 3.05× and 2.74×, respectively. Thus
LM exhibits better performance vs. area scaling than DPNN .
5
Table 2: Relative execution time speedup and energy efficiency with Stripes and LM for fully-connected and convolutional
layers vs. DPNN .
FULLY-CONNECTED LAYERS CONVOLUTIONAL LAYERS
Network
Stripes Loom 1-bit Loom 2-bit Loom 4-bit Stripes Loom 1-bit Loom 2-bit Loom 4-bit
Perf Eff Perf Eff Perf Eff Perf Eff Perf Eff Perf Eff Perf Eff Perf Eff
100% TOP-1 Accuracy 100% TOP-1 Accuracy
NiN n/a n/a n/a n/a n/a n/a n/a n/a 1.76 1.54 2.97 2.40 2.92 2.75 2.91 3.05
AlexNet 1.00 0.88 1.65 1.34 1.66 1.56 1.66 1.74 2.34 2.04 4.25 3.43 4.20 3.96 3.66 3.84
Google 0.99 0.87 2.25 1.82 2.27 2.14 2.28 2.39 1.76 1.50 2.63 2.12 2.49 2.34 2.12 2.22
VGGS 1.00 0.88 1.63 1.32 1.63 1.54 1.63 1.71 1.89 1.65 3.98 3.21 3.78 3.56 3.02 3.17
VGGM 1.00 0.88 1.63 1.32 1.64 1.54 1.64 1.72 2.12 1.86 4.12 3.33 3.69 3.47 3.34 3.50
VGG19 1.00 0.88 1.62 1.31 1.63 1.53 1.63 1.71 1.34 1.17 2.17 1.76 2.09 1.97 2.03 2.13
Geomean 1.00 0.88 1.74 1.41 1.75 1.65 1.75 1.84 1.84 1.61 3.25 2.63 3.10 2.92 2.78 2.92
99% TOP-1 Accuracy 99% TOP-1 Accuracy
NiN n/a n/a n/a n/a n/a n/a n/a n/a 2.31 2.02 4.21 3.40 4.09 3.85 3.78 3.96
AlexNet 1.00 0.88 1.85 1.49 1.85 1.74 1.85 1.94 2.57 2.25 4.62 3.73 4.49 4.23 4.36 4.57
Google 0.99 0.87 2.25 1.82 2.27 2.14 2.28 2.39 1.80 1.58 2.91 2.35 2.74 2.58 2.30 2.42
VGGS 1.00 0.88 1.78 1.44 1.78 1.68 1.79 1.87 1.89 1.65 3.98 3.21 3.78 3.56 3.15 3.30
VGGM 1.00 0.88 1.79 1.45 1.80 1.69 1.80 1.89 2.12 1.86 4.49 3.63 4.03 3.79 3.64 3.82
VGG19 1.00 0.88 1.63 1.32 1.63 1.54 1.63 1.71 1.45 1.27 2.28 1.84 2.21 2.08 2.07 2.17
Geomean 1.00 0.88 1.85 1.49 1.85 1.75 1.86 1.95 1.99 1.74 3.63 2.93 3.45 3.25 3.11 3.26
(a) LM’s performance relative to DPNN . (b) LM’s energy efficiency relative to DPNN .
Figure 4: LM’s performance and energy efficiency relative to DPNN for all layers with 100% accuracy.
4.5 Scaling
Thus far we assumed that all activations fit on chip and focused
on a single LM configuration. We next consider configurations
with practical on- and off-ship memory hierarchies. Specifically,
we size the activation memory so that most layers can fit on-chip
avoiding off-chip accesses that today require at least two orders
of magnitude more energy a critical consideration in embedded
systems. Accordingly, DPNN requires 2MB of activation memory
(VGG19 requires 10MBwhich is impractical for embedded systems
and thus has to spill activations off-chip). Since LM processes both
activations and weights bit-serially, it naturally stores and commu-
nicates values on- and off-chip using the per layer precisions. As a
result, LM requires only 1MB on-chip memory for the activations.
However, since LM processes more filters concurrently compared
to DPNN , it can benefit from a larger weight memory.
Figure 5 shows how average performance over all networks
scales for different configurations where the number of SIPs is cho-
sen to match the peak compute bandwidth (x-axis) of a bit-parallel
accelerator. For example, the "128" configurations can perform the
equivalent of 128 16b×16b multiply-accumulate operations per cy-
cle. For each configuration Figure 5 reports performance relative
6
Table 3: Average Effective Per Layer Weight Precisions [10]
Network Effective Precision Per Layer
NiN 8.85-10.29-10.21-7.65-9.13-9.04-7.63-
8.65-8.62-7.79-7.96-8.18
AlexNet 8.36-7.62-7.62-7.44-7.55
Google 6.19-5.75-6.80-6.28-5.34-6.70-6.31-5.02
-5.49-7.89-4.83
VGGS 9.94-6.96-8.53-8.13-8.10
VGGM 9.87-7.55-8.52-8.16-8.14
VGG19 10.98-9.81-9.31-9.09-8.58-8.04-7.89-7.86
-7.51-7.20-7.36-7.47-7.61-7.66-7.66-7.63
toDPNN and absolute performance as frames per second (fps). The
figure reports results for the convolutional layers only and also for
all layers. This is done because fully-connected layers are off-chip
bound (and thus are affected by our choice of off-chip memory)
whereas the convolutional layers are compute bound. Here we re-
strict attention to LM1b .
LM outperforms DPNN for all design points shown and can
achieve real-time processing rates even for the "32" configuration.
The relative performance advantage of LM drops for the larger con-
figurations since LM requires more parallelism and suffers more
from increased underutilization as the number of weight lanes
grows. DStripes’s relative performance over DPNN remains con-
stant for the range shown. LM outperformsDStripes up to the "128"
configurations. At "256" LM and DStripes perform nearly identi-
cally and at "512" the latter performs better.
The figure also reports the weight memory capacity, the relative
(vs. DPNN ) area overhead, and the energy efficiency for the vari-
ous LM configurations. For the "64" and "32" configurations LM
requires 128KB and 544KB less memory in total than DPNN . How-
ever, for the "128" and the "256" configurations LM requires more
memory than DPNN . Regardless, the performance benefits exceed
the relative area overhead and thus LM provides a better perfor-
mance/area trade-off than DPNN . For the "256" configuration en-
ergy efficiency suffers with LM. However, this measurement ig-
nores the energy of off-chip traffic which is on average 0.61× less
with LM. Moreover, as CNNs evolve to process higher resolution
images the size of activation memory increases significantly com-
pared to the filter sizes which makes the effect of data compression
more important [11]. Thus we expect that for higher resolution im-
ages LM will ever more appealing.
4.6 Per Group Weight Precisions
Thus far we assumed that LM exploits software provided profile-
derived per layer weight precisions [6]. However, exploiting the
approach of Lascorz et al. [10] LM can further trim the weight pre-
cisions at a finer granularity to boost the performance and energy
efficiency of both FCLs and CVLs. The per group weight precisions
can be detected at runtime similarly to the activation precisions, or
can be detected statically and communicated via per group meta-
data.
Table 3 reports the average effective weight precision per layer
for a group of 16 weights. The estimated performance and energy
47 fps
92 fps
169 fps
205 fps
240 fps
53 fps 102 fps
190 fps
234 fps
278 fps
512 KB 1 MB 2 MB 4 MB 8 MB
2.6 1.88 1.27 0.7 0.33
0.94 1.23 1.72 2.46 3.84
0.5
1
1.5
2
2.5
3
3.5
4
32 64 128 256 512
Dstripes-all Loom-all Dstripes-conv Loom-conv
}ÀZ
vPÇ((]]vÇ
Á]PZuu}Ç
Figure 5: Scaling vs. equivalent DPNN peak compute band-
width. Conv: convolutional layers only. All: all layers. All
results with a LPDDR4-4267 off-chip memory.
Table 4: Relative execution time speedup and energy effi-
ciency with LM for all layers vs. DPNN .
All LAYERS COMBINED
Network
Loom 1-bit Loom 2-bit Loom 4-bit
Perf Eff Perf Eff Perf Eff
100% TOP-1 Accuracy
NiN 3.38 2.73 3.32 3.13 3.31 3.48
AlexNet 5.66 4.57 5.61 4.57 4.95 5.19
Google 3.19 2.57 3.02 2.84 2.80 2.93
VGGS 5.72 4.62 5.46 5.13 4.42 4.63
VGGM 6.03 4.87 5.46 5.14 4.60 4.83
VGG19 3.38 2.73 3.28 3.09 3.01 3.15
Geomean 4.38 3.54 4.20 3.95 3.76 3.94
efficiency of Loom configurations relative to DPNN with the pre-
cision profiles of Table 3 and for all layers combined is shown in
Table 4. For these estimates we assume that performance scales
linearly with weight precision.
Exploiting the effective weight precisions yields a speedup of
4.38×, 4.20×, and 3.76× overDPNN for LM1b , LM2b , and LM4b con-
figurations, respectively. The energy efficiency of LM over DPNN
is 3.54×, 3.95×, and 3.94× for the aforementioned configurations.
5 RELATED WORK
Due to space limitations, we limit attention to a few works that are
themost related.We have already compared to Stripes [7] extended
with dynamic prediction reduction [5].
Pragmatic’s performance for the CVLs depends only on the
number of activation bits that are 1, but does not improve per-
formance for FCLs [12]. Further performance improvement may
be possible by combining Pragmatic’s approach with LM’s but the
costs per SIP may make this prohibitively expensive. Proteus ex-
ploits per layer precisions reducing memory footprint and band-
width but requires crossbars per input weight [13]. Loom does not
need crossbars. Hardwired NN implementations naturally exploit
per layer precisions [14]. Loom does not require that the whole
network fit on chip nor does it hardwire precisions. Furthermore,
Loom further trims activations precisions at runtime.
7
Several accelerators target ineffectual weights and/or activa-
tions for dense and/or sparse networks [15–18]. Most target either
FCLs or CVLs alone. LM targets both layer types and benefits all
inputs ineffectual or not.
6 CONCLUSION
This work presented Loom, a hardware inference accelerator for
DNNs whose execution time for the convolutional and the fully-
connected layers scales inversely proportionallywith the precision
p used to represent the input data. LM can trade-off accuracy vs.
performance and energy efficiency on the fly. Future work may
consider extending LM to further exploit weight sparsity.
REFERENCES
[1] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and
O. Temam, “Dadiannao: A machine-learning supercomputer,” in Microarchitec-
ture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 609–
622, Dec 2014.
[2] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” CoRR, vol. abs/1311.2524,
2013.
[3] A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger,
S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-
to-end speech recognition,” CoRR, vol. abs/1412.5567, 2014.
[4] R. Lukac, Computational photography: methods and applications. CRC Press,
2016.
[5] A. D. Lascorz, S. Sharify, P. Judd, and A. Moshovos, “Dynamic stripes: Exploiting
the dynamic precision requirements of activation values in neural networks,”
CoRR, vol. abs/1706.00504, 2017.
[6] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and
A. Moshovos, “Reduced-Precision Strategies for Bounded Memory in Deep Neu-
ral Nets ,” arXiv:1511.05236v4 [cs.LG] , 2015.
[7] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bit-
serial Deep Neural Network Computing ,” in Proc. of the 49th Annual IEEE/ACM
Intl’ Symposium on Microarchitecture, 2016.
[8] N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool to understand
large caches,” 2015.
[9] M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool for model-
ing emerging 3d nvm and edram caches,” in Design, Automation Test in Europe
Conference Exhibition, March 2015.
[10] A. Delmas, S. Sharify, P. Judd, M. Nikolic, and A. Moshovos, “Dpred: Making
typical activation values matter in deep learning computing,” arXiv preprint
arXiv:1804.06732, 2018.
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD:
Single Shot MultiBox Detector,” arXiv:1512.02325 [cs.CV], 2016.
[12] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and
A. Moshovos, “Bit-pragmatic deep neural network computing,” in Proceedings
of the 50th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO-50 ’17, pp. 382–394, 2017.
[13] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and
A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural
networks,” in Proceedings of the 2016 International Conference on Supercomputing,
p. 23.
[14] T. Szabo, L. Antoni, G. Horvath, and B. Feher, “A full-parallel digital implemen-
tation for pre-trained NNs,” in IJCNN 2000, Proceedings of the IEEE-INNS-ENNS
International Joint Conference on Neural Networks, 2000, vol. 2, pp. 49–54 vol.2,
2000.
[15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie:
Efficient inference engine on compressed deep neural network,” in Proceedings
of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscat-
away, NJ, USA), pp. 243–254, IEEE Press, 2016.
[16] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Enright Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network comput-
ing,” in 2016 IEEE/ACM International Conference on Computer Architecture (ISCA),
2016.
[17] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen,
“Cambricon-x: An accelerator for sparse neural networks,” in 49th Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei,
Taiwan, October 15-19, 2016, pp. 1–12, 2016.
[18] A. Parashar,M. Rhu, A.Mukkara,A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,
S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse con-
volutional neural networks,” in Proceedings of the 44th Annual International Sym-
posium on Computer Architecture, ISCA ’17, pp. 27–40, 2017.
8
