Laconic Deep Learning Computing by Sharify, Sayeh et al.
ar
X
iv
:1
80
5.
04
51
3v
1 
 [c
s.N
E]
  1
0 M
ay
 20
18
Laconic Deep Learning Computing
Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos
Electrical and Computer Engineering, University of Toronto
{sayeh, delmasl1, moshovos}@ece.utoronto.ca,
{mostafa.mahmoud, milos.nikolic}@mail.utoronto.ca
ABSTRACT
We motivate a method for transparently identifying ineffec-
tual computations in unmodified Deep Learning models and
without affecting accuracy. Specifically, we show that if we
decompose multiplications down to the bit level the amount
of work performed during inference for image classification
models can be consistently reduced by two orders of magni-
tude. In the best case studied of a sparse variant of AlexNet,
this approach can ideally reduce computation work by more
than 500×. We present Laconic a hardware accelerator that
implements this approach to improve execution time, and en-
ergy efficiency for inference with Deep Learning Networks.
Laconic judiciously gives up some of the work reduction po-
tential to yield a low-cost, simple, and energy efficient de-
sign that outperforms other state-of-the-art accelerators. For
example, a Laconic configuration that uses a weight memory
interface with just 128 wires outperforms a conventional ac-
celerator with a 2K-wire weight memory interface by 2.3×
on average while being 2.13× more energy efficient on av-
erage. A Laconic configuration that uses a 1K-wire weight
memory interface, outperforms the 2K-wire conventional ac-
celerator by 15.4× and is 1.95× more energy efficient. La-
conic does not require but rewards advances in model design
such as a reduction in precision, the use of alternate numeric
representations that reduce the number of bits that are “1”,
or an increase in weight or activation sparsity.
1. MOTIVATION
Modern computing hardware is energy-constrained and
thus developing techniques that reduce the amount of energy
required to perform the computation is essential for improv-
ing performance. The bulk of the work performed by convo-
lutional neural networks during inference is due to 2D con-
volutions (see Section 2). In turn, these convolutions entail
numerous multiply-accumulate operations were most of the
work is due to the multiplication of an activation A and a
weight W . In order to improve energy efficiency a hardware
accelerator can thus strive to perform only those multiplica-
tions that are effectual which will also lead to fewer addi-
tions. We can approach a A×W multiplication as a mono-
lithic action which can be either performed or avoided in
its entirety. Alternatively, we can decompose it into a col-
lection of simpler operations. For example, if A and W are
16b fixed-point numbers A×W can be approached as 256
1b× 1b multiplications or 16 16b× 1b ones.
Figure 1h reports the potential reduction in work for sev-
eral ineffectual work avoidance policies. The “A” policy
avoids multiplications where the activation is zero. This is
representative of the first generation of value-based acceler-
ators that were motivated by the relatively large fraction of
zero activations that occur in convolutional neural networks,
e.g., Cnvlutin [1]. The “A+W” skips those multiplications
where either the activation or the weight are zero and is rep-
resentative of accelerators that target sparse models where a
significant fraction of synaptic connections has been pruned,
e.g., SCNN [2]. The “Ap” (e.g., Stripes [3] or Dynamic
Stripes [4]) and “Ap+Wp” (e.g., Loom [5]) policies target
precision for the activations alone or for the activations and
the weights respectively. It has been found that neural net-
works exhibit variable per layer precision requirements. All
aforementioned measurements corroborate past work on ac-
celerator designs that exploited the respective properties.
However, we show that further potential for work reduc-
tion exists if we decompose the multiplications at the bit
level. Specifically, for our discussion we can assume with-
out loss of generality that these multiplications operate on
16b fixed-point values. The multiplication itself is given by:
A×W =
15
∑
i=0
15
∑
j=0
Ai AND Wj (1)
where Ai and Wj are bits of A and W respectively. When
decomposed down to the individual 256 single bit multipli-
cations one can observe that it is only those multiplications
where both Ai andWj are non-zero that are effectual. Accord-
ingly, the “Ab” (e.g., Pragmatic [6]) and “Ab+Wb” measure-
ments show the potential reduction in work that is possible if
we skip those single bit multiplications where the activation
bit is zero or whether either the activation or the weight bits
are zero respectively. The results show that the potential is
far greater than the policies discussed thus far.
Further to our discussion, rather than representing A and
W as bit vectors, we can instead Booth-encode them as a
series of signed powers of two, or terms (higher-radix Booth
encoding is also possible). In this case the multiplication is
given by:
(a)
1x
10x
100x
1000x
1.8x 1.9x 2.0x
2.9x
8.1x
31.1x
12.5x
118.2x
(b) AlexNet
1x
10x
100x
1000x
2.6x 2.8x
1.6x
2.4x
12.1x
48.3x
19.0x
157.3x
(c) GoogLeNet
1x
10x
100x
1000x
2.7x 3.4x
1.8x 2.4x
13.6x
65.1x
21.9x
226.5x
(d) VGG_S
1x
10x
100x
1000x
2.3x 2.8x 1.9x 2.6x
10.1x
50.1x
17.8x
179.7x
(e) VGG_M
1x
10x
100x
1000x
1.9x
16.0x
1.9x
2.7x
8.7x
128.6x
15.9x
501.4x
(f) AlexNet-Sparse
1x
10x
100x
1000x
2.4x
3.9x
2.5x 3.1x
13.5x
39.8x
17.6x
110.9x
(g) ResNet-Sparse
1x
10x
100x
1000x
2.3x
3.8x
1.9x 2.7x
10.8x
54.1x
17.2x
186.5x
(h) Geomean
Figure 1: Performance improvement potential for: 1) skipping zero activations [1], 2) skipping zero activations and weights,
3) using static precision for activations [3], 4) using static precision for activations and weights, 5) skipping zero bits of
activations [6], 6) skipping zero bits of activations and weights, 7) skipping zero bits of activations using booth encoding [6],
8) skipping zero bits of activations and weights using booth encoding (logarithmic scale).
A×W =
Aterms
∑
i=0
Wterms
∑
j=0
Ati × Wt j (2)
where Ati and Wti are of the form ±2
x. As with the po-
sitional representation, it is only those products where both
Ati and Wti are non-zero that are effectual. Accordingly, the
figure shows the potential reduction in work with “At” where
we skip the ineffectual terms for a Booth-encoded activation
(e.g., Pragmatic [6]), and with “At+Wt” where we calculate
only those products where both the activation and the weight
terms are non-zero. The results show that the reduction in
work (and equivalently the performance improvement poten-
tial) with “At+Wt” is in most cases two orders of magnitude
higher than the zero value or the precision based approaches.
Based on these results, our goal is to develop a hardware
accelerator that computes only the effectual terms. No other
accelerator to date has exploited this potential. Moreover,
2
Figure 2: Convolutional Layer
by targeting “At+Wt” we can also exploit “Ab+Wb” where
the inputs are represented in a plain positional representation
and are not Booth-encoded.
2. BACKGROUND
This section provides the required background informa-
tion as follows: Section 2.1 reviews operation of a Convo-
lutional Neural Network and Section 2.2 goes through our
baseline system.
2.1 Convolutional Layers
Convolutional Neural Networks (CNNs) usually consist
of several Convolutional layers (CVLs) followed by a few
fully-connected layers (FCLs). In many image related CNNs
most of the operation time is spent on processing CVLs in
which a 3D convolution operation is applied to the input ac-
tivations producing output activations. Figure 2 illustrates a
CVLwith a c×x×y input activation block andN c×h×k fil-
ters. The layer dot products each of these N filters (denoted
f 0, f 1, ..., f N−1) by a c× h× k subarray of input activation,
called window, to generate a single oh×ok output activation.
In total convolvingN filters and an activation window results
in N oh×ok outputs which will be passed to the input of the
next layer. The convolution of activation windows and fil-
ters takes place in a sliding window fashion with a constant
stride S.
Fully-connected layers can be implemented as convolu-
tional layers in which filters and input activations have the
same dimensions, i.e., x = h and y = k.
2.2 Baseline system
Our baseline design (BASE) is a data-parallel engine in-
spired by the DaDianNao accelerator [7] which uses 16-bit
fixed-point activations and weights. Our baseline configura-
tion has 8 inner product units (IPs) each accepting 16 input
activations and 16 weights as inputs. The 16 input activa-
tions are broadcast to all 8 IPs; however, each IP has its own
16 weights. Every cycle each IP multiplies 16 input activa-
tions by their 16 corresponding weights and reduces them
into a single partial output activation using a 16 32-bit in-
put adder tree. The partial results are accumulated over the
multiple cycles to generate the final output activation. An ac-
tivation memory provides the activations and a weight mem-
ory provides the weights. Other memory configurations are
possible.
3. Laconic: A SIMPLIFIED EXAMPLE
This section illustrates the key concepts behind Laconic
via an example using 4-bit activations and weights.
Bit-Parallel Processing: Figure 3a shows a bit-parallel en-
gine multiplying two 4-bit activation and weight pairs, gener-
ating a single 4-bit output activation per cycle. Its throughput
is two 4b× 4b products per cycle.
Bit-Serial Processing: Figure 3b shows an equivalent bit-
serial engine which is representative of Loom (LM) [5]. To
match the bit-parallel engine’s throughput, LM processes 8
input activations and 8 weights every cycle producing 32
1b× 1b products. Since LM processes both activations and
weights bit-serially, it produces 16 output activations in Pa×
Pw cycles where Pa and Pw are the activation and weight pre-
cisions, respectively. Thus, LM outperforms the bit-parallel
engine by 16
Pa×Pw
. In this example, since both activations and
weights can be represented in three bits, the speedup of LM
over the bit-parallel engine is 1.78×. However, LM still pro-
cesses some ineffectual terms. For example, in the first cycle
27 of the 32 1b× 1b products are zero and thus ineffectual
and can be removed.
Laconic: Figure 3c illustrates a simplified Laconic engine
in which both the activations and weights are represented as
vectors of essential powers of two, or one-offsets. For exam-
ple, A0 = (110) is represented as a vector of its one-offsetsA0
= (2,1). Every cycle each PE accepts a 4-bit one-offset of an
input activation and a 4-bit one-offset of a weight and adds
them up to produce the power of the corresponding product
term in the output activation. Since Laconic processes acti-
vation and weight “term”-serially, it takes ta × tw cycles for
each PE to complete producing the product terms of an out-
put activation, where ta and tw are the number of one-offsets
in the corresponding input activation and weight. The engine
processes the next set of activation and weight one-offsets af-
ter T cycles, where T is the maximum ta × tw among all the
PEs. In this example, the maximum T is 6 corresponding
to the pair of A0 = (2,1) and W0
1 = (2,1,0) from PE(1,0).
Thus, the engine can start processing the next set of activa-
tions and weights after 6 cycles achieving 2.67× speedup
over the bit-parallel engine.
4. Laconic
This section presents the Laconic architecture by explain-
ing its processing approach, processing elements structure,
and its high-level organization.
4.1 Approach
Laconic’s goal is to minimize the required computation
for producing the products of input activations and weights
by processing only the essential bits of both the input ac-
tivations and weights. To do so, LAC converts, on-the-fly,
the input activations and weights into a representation which
contains only the essential bits, and processes per cycle one
pair of essential bits one from an activation and another from
a weight. The rest of this section is organized as follows:
Section 4.1.1 describes the activation and weight representa-
tions in LAC and Section 4.1.2 explains how LAC calculates
the product terms.
4.1.1 Activation and Weight Representation
3
Figure 3: a) Bit-parallel unit. b) Bit-serial unit with equivalent throughput [5]. c) Laconic unit with equivalent throughput
where for both activations and weights only essential information is processed.
E    sign
1 1
16
15E   sign0
01
MA X
+
5
5-32
DEC
0
1
31
E   sign
4
4
0t
+
54
4
6
6
6
0
0
15
+
0
15
+
0
15
+
15
+
1
1
10t
Z
0s
0s
Z
15t
15t
Z
15s
15s
Z
E    sign
5-32
DEC
0
1
31
x16
x16
x32
<< 0
<< 1
<<31
x32
+
37
7
6
(1) (2) (3) (4) (5) (6)
Sign-
extend
Sign-
extend
37
37
x32
E0
E
step
15
N
1
N31
exponent one-hot histogram alignment
N0
D0
0
D0
1
D31
1
D31
0
D0
31
D31
31
reduction accumulation 
43
psum
(a)
6
6
6
x32
+
30
(4'.1) (5')
30
30
30
36
36
concat. 
unit
eq. (3)
N0
N1
N
31
reduction
38
concat. 
<< 1
<< 2
<< 3
<< 4
<< 5
35
34
33
32
37
36
alignment(4'.2)
G0
G1
G2
G3
G4
G5
<< 0
psum
(b)
Figure 4: Laconic processing element, a) 1) Calculating the exponents of the products, 2) Converting the exponents into
their corresponding one-hot format, 3) Counting the number of each power of two bucket, 4) Shifting the counting results
according to the bucket values, 5) Sign-extending the shifted values and reducing the results into a single 42-bit partial output,
6) Accumulating the partial outputs over multiple cycles. b) Enhanced steps (4) and (5).
For clarity we present a LAC implementation that pro-
cesses the one-offsets, that is the non-zero signed powers of
two in a Booth-encoded representation of the activations and
weights (however, LAC could be adjusted to process a regu-
lar positional representation or adapted to process representa-
tions other than fixed-point). LAC represent each activation
or weight as a list (on, . . . ,o0) of its one-offsets. Each one-
offset is represented as (sign,magnitude) pair. For example,
an activation A = -2(10) = 1110(2) with a Booth-encoding
of 0010(2) would be represented as (-,1) and a A = 7(10) =
0111(2) will be presented as ((+,3),(-,0)). The sign can be
encoded using a single bit, with, for example, 0 representing
“+” and 1 representing “-”.
4.1.2 Calculating a Product Term
LAC calculates the product of a weight W = (Wterms)
and an input activation A = (Aterms) where each term is a
(sign,magnitude) = (si, ti) as follows:
W ×A = ∑
∀(s,t)∈Wterms
(-1)s2t × ∑
∀(s′,t ′)∈Aterms
(-1)s
′
2t
′
= ((-1)s02t0 +(-1)s12t1 + · · ·+(-1)sn2tn)×
((-1)s
′
02t
′
0 +(-1)s
′
12t
′
1 + · · ·+(-1)s
′
m2t
′
m)
= ((-1)s0(-1)s
′
0(2t0 ×2t
′
0 )+ · · ·+((-1)s0(-1)s
′
m2t0 ×2t
′
m ))+
· · ·+((-1)sn(-1)s
′
0(2tn ×2t
′
0)+ · · ·+((-1)sn(-1)s
′
m2tn ×2t
′
m ))
= ((-1)(s0+s
′
0)2(t0+t
′
0)+ · · ·+(-1)(s0+s
′
m)2(t0+t
′
m))+
· · ·+((-1)(sn+s
′
0)2(tn+t
′
0)+ · · ·+(-1)(sn+s
′
m)2(tn+t
′
m))
(3)
That is, instead of processing the full A×W product in a
single cycle, LAC processes each product of a single t ′ term
of the input activation A and of a single t term of the weight
4
W individually. Since these terms are powers of two so will
be their product. Accordingly, LAC can first add the cor-
responding exponents t ′+ t. If a single product is processed
per cycle, the 2t
′+t final value can be calculated via a decoder.
In the more likely configuration where more than one term
pairs are processed per cycle, LAC can use one decoder per
term pair to calculate the individual 2t
′+t products and then
an efficient adder tree to accumulate all. This is described in
more detail in the next section.
4.2 Processing Element
Figure 4 illustrates how the LAC Processing Element (PE)
calculates the product of a set of weights and their corre-
sponding input activations. Without loss of generality we
assume that each PE multiplies 16 weights,W0,...,W15, by 16
input activations, A0,...,A15. The PE calculates the 16 prod-
ucts in 6 steps:
In Step 1, the PE accepts 16 4-bit weight one-offsets,
t0,. . . ,t15 and their 16 corresponding sign bits s0,. . . ,s15,
along with 16 4-bit activation one-offsets, t ′0,. . . ,t
′
15 and their
signs s′0,. . . ,s
′
15, and calculates 16 one-offset pair products.
Since all one-offsets are powers of two, their products will
also be powers of two. Accordingly, to multiply 16 acti-
vations by their corresponding weights LAC adds their one-
offsets to generate the 5-bit exponents (t0+ t
′
0),. . . ,(t15+ t
′
15)
and uses 16 XOR gates to determine the signs of the prod-
ucts.
In Step 2, for the ith pair of activation and weight, where
i is ∈{0,...,15}, the PE calculates 2(ti+t
′
i ) via a 5b-to-32b de-
coder which converts the 5-bit exponent result (ti + t
′
i ) into
its corresponding one-hot format, i.e., a 32-bit number with
one “1” bit and 31 “0” bits. The single “1” bit in the jth posi-
tion of a decoder output corresponds to a value of either +2 j
or -2 j depending on the sign of the corresponding product
(Ei.sign on the figure).
Step 3: The PE generates the equivalent of a histogram
of the decoder output values. Specifically, the PE accumu-
lates the 16 32-bit numbers from Step 2 into 32 buckets,
N0, . . . ,N31 corresponding to the values of 20,21, ...,231 as
there are 32 powers of two. The signs of these numbers
Ei.sign from Step 1 are also taken into account. At the end
of this step, each “bucket” contains the count of the num-
ber of inputs that had the corresponding value. Since each
bucket has 16 signed inputs the resulting count would be in
a value in [-16, ...,16] and thus is represented by 6 bits in 2’s
complement.
Step 4: Naïvely reducing the 32 6-bit counts into the fi-
nal output would require first “shifting” the counts accord-
ing to their weight converting all to 31+ 6= 37b and then
using a 32-input adder tree as shown in Figure 4a(4)-(5). In-
stead LAC reduces costs and energy by exploiting the rela-
tive weighting of each count by grouping and concatenating
them in this stage as shown in Figure 4b(4′.1). For example,
rather than adding N0 and N6 we can simply concatenate
them as they are guaranteed to have no overlapping bits that
are “1”. This is explained in more detail in Section 4.2.1.
Step 5: As Section 4.2.1 explains in more detail, the
concatenated values from Step 4′.1 are added via a 6-input
adder tree as shown in Figure 4b(5′) producing a 38b partial
sum.
Step 6: The partial sum from the previous step is accumu-
lated with the partial sum held in an accumulator. This way,
the complete A×W product can be calculated over multiple
cycles, one effectual pair of one-offsets per cycle.
The aforementioned steps are not meant to be interpreted
as pipeline stages. They can be merged or split as desired.
4.2.1 Enhanced Adder Tree
Step 5 of Figure 4a has to add 32 6b counts each weights
by the corresponding power of 2. This section presents
an alternate design that replaces Steps 4 and 5. Specifi-
cally, it presents an equivalent more area and energy effi-
cient “adder tree” which takes advantage of the fact that
the outputs of Step 4 contain groups of numbers that have
no overlapping bits that are “1”. For example, in relation
to the naïve adder tree of Figure 4a(5) consider adding the
6th 6-bit input (N6=n65n
6
4n
6
3n
6
2n
6
1n
6
0) with the 0
th 6-bit input
(N0=n05n
0
4n
0
3n
0
2n
0
1n
0
0). We have to first shift N
6 by 6 bits which
amounts to adding 6 zeros as the 6 least significant bits of the
result. In this case, there will be no bit position in which both
N6« 6 and N0 will have a bit that is 1. Accordingly, adding
(N6« 6) and N0 is equivalent to concatenating either N6 and
N0 or (N6-1) and N0 based on the sign bit of N0 (Figure 5a):
N6×26+N0 = (N6 << 6)+N0
1) if n05 is zero:
= n65n
6
4n
6
3n
6
2n
6
1n
6
0000000+000000n
0
5n
0
4n
0
3n
0
2n
0
1n
0
0
= n65n
6
4n
6
3n
6
2n
6
1n
6
0n
0
5n
0
4n
0
3n
0
2n
0
1n
0
0 = {N
6,N0}
2) else if n05 is one:
= n65n
6
4n
6
3n
6
2n
6
1n
6
0000000+111111n
0
5n
0
4n
0
3n
0
2n
0
1n
0
0
= (n65+1)(n
6
4+1)(n
6
3+1)(n
6
2+1)(n
6
1+1)(n
6
0+1)n
0
5n
0
4n
0
3n
0
2n
0
1n
0
0
= {(N6−1),N0}
(4)
Accordingly, this process can be applied recursively, by
grouping those Ni where (i MOD 6) is equal. That is the ith
input would be concatenated with (i+ 6)th, (i+ 12)th, and
so on. Figure 5b shows an example unit for those Ni inputs
where (i MOD 6) = 0. While the figure shows the concatena-
tion done as stack, other arrangements are possible.
For the 16 product unit described here the above process
yields the following six groups:
G0 = {N
30,N24,N18,N12,N6 ,N0}
G1 = {N
31,N25,N19,N13,N7 ,N1}
G2 = {N
26,N20,N14,N8 ,N2}
G3 = {N
27,N21,N15,N9 ,N3}
G4 = {N
28,N22,N16,N10,N4}
G5 = {N
29,N23,N17,N11,N5}
(5)
The final partial sum is then given by the following:
psum =
5
∑
i=0
(Gi << i) (6)
5
nn+6
6
concatenate 
1
N0N6
N12
concat.
concat.
N18
concat.
N24
concat.
N30
concat.
12
18
24
30
36
(a)
(b)
Figure 5: One of Laconic’s concatenation units.
A A
PE(0,0)
PE(0,0)
+
PE(0,7)
PE(15,0)
PE(15,7)
WR
4
4
44
4
4
4 4
+
+
dec
dec
0 15 A A240 255
W0
W15
W112
W127
window 0 window 15
filter 0
filter 7
Figure 6: Laconic tile
4.3 Tile Organization
Figure 6 illustrates LAC tile which comprises a 2D array
of PEs processing 16 windows of input activations andK = 8
filters every cycle. PEs along the same column share the
same input activations and PEs along the same row receive
the same weights. Every cycle PE(i,j) receives the next one-
offset from each input activation from the jth window and
multiplies it by a one-offset of the corresponding weight
from the ith filter. The tile starts processing the next set of
activations and weights when all the PEs are finished with
processing the terms of the current set of 16 activations and
their corresponding weights.
Since LAC processes both activations and weights term-
serially, to match our BASE configuration it requires to pro-
cess more filters or more windows concurrently. Here we
consider implementations that process more filters. In the
worst case each activation and weight possesses 16 terms,
thus LAC tile should process 8× 16= 128 filters in parallel
to always match the peak compute bandwidth of BASE. How-
ever, as shown in Figure 1 with 16× more filters, LAC’s po-
tential performance improvement over the baseline is more
than two orders of magnitude. Thus, we can trade-off some
of this potential by using fewer filters.
To read weights from the WM BASE requires 16 wires
256A 256 A 256A 256A
128W
256W
512 W
1KW 2KW
256 A
(a) (b)
Figure 7: a) Laconic configurations: LAC128, LAC256,
LAC512, LAC1K . b) BASE configuration: BASE2K
per weight while LAC requires only one wire per weight as
it process weights term-serially. Thus, with the same num-
ber of filters LAC requires 16× less wires. In this study we
limit our attention to a BASE configuration with 8 filters, 16
weights per filter, and thus 2K weight wires (BASE2K), and
to LAC configurations with 8, 16, 32, and 64 filters, and 128,
256, 512, and 1K weight wires. In all designs, the number
of activation wires is set to 256 (Figure 7). Alternatively, we
could fix the number of filters and accordingly number of
weight wires and add more parallelism to the design by in-
creasing the number of activation windows. The evaluation
of such a design is not reported in this document.
5. EVALUATION
This section evaluates LAC’s performance, energy and
area and explores different configurations of LAC comparing
to BASE2K . This section considers LAC128, LAC256, LAC512,
and LAC1k configurations which require 128, 256, 512, and
1K weight wires, respectively (Figure 7).
5.1 Methodology
Execution time is modeled via a custom cycle-accurate
simulator and energy and area results are reported based
on post layout simulations of the designs. Synopsys De-
sign Compiler [8] was used to synthesize the designs with
TSMC 65nm library. Layouts were produced with Cadence
Innovus [9] using synthesis results. Intel PSG ModelSim
is used to generate data-driven activity factors to report the
power numbers. The clock frequency of all designs is set
to 1GHz. The ABin and ABout SRAM buffers were mod-
eled with CACTI [10] and AM and WM were modeled as
eDRAM with Destiny [11].
5.2 Performance
Figure 8 shows the performance of LAC configurations
relative to BASE2K for convolutional layers with the 100%
relative TOP-1 accuracy precision profiles of Table 1.
Laconic targets both dense and sparse networks and im-
proves the performance by processing only the essential
terms; however, the sparse networks would benefit more as
they posses more ineffectual terms. On average, LAC128
outperforms BASE2K by more than 2× while for AlexNet-
Sparse LAC128 achieves a speedup of 3× over the baseline.
Figure 8 shows how average performance on convolutional
layers over all networks scales for different configurations
6
Table 1: Activation and weight precision profiles in bits for the convolutional layers.
Convolutional Layers
Network 100% Accuracy
Activation Precision Per Layer Weight Precision Per Network
AlexNet 9-8-5-5-7 11
GoogLeNet 10-8-10-9-8-10-9-8-9-10-7 11
VGGS 7-8-9-7-9 12
VGGM 7-7-7-8-7 12
AlexNet-Sparse [12] 8-9-9-9-8 7
ResNet50-Sparse [13] 10-8-6-6-5-7-6-6-7-7-6-7-6-7-6-8-7-6-
8-6-5-8 -8-6-8-7-7-6-9-7-5-8-7-6-8-7-
6-8-7-6-8-8-7-8-7-9-6-10-7-6-10-8-7
13
Figure 8: Laconic performance relative to BASE2K
Table 2: Laconic energy efficiency relative to BASE2K .
LAC128 LAC256 LAC512 LAC1K
AlexNet 2.03 2.44 2.92 1.88
GoogLeNet 1.84 2.32 2.76 1.75
VGG_S 2.43 3.04 3.63 2.31
VGG_M 2.18 2.73 3.26 2.02
AlexNet-Sparse 2.81 2.91 3.49 2.19
ResNet-Sparse 1.69 2.02 2.41 1.61
Geomean 2.13 2.55 3.05 1.95
with different number of weight wires. LAC256, LAC512, and
LAC1K achieve speedups of 4.0×, 8.1×, and 15.4× over
BASE2K , respectively.
5.3 Energy Efficiency
Table 2 summarizes the energy efficiency of various LAC
configurations over BASE2K . On average over all networks
LAC128, LAC256, LAC512, and LAC1K are 2.13×, 2.55×,
3.05×, and 1.95×more energy efficient than BASE2K .
5.4 Area
Post layout measurements were used to measure the area
of BASE and LAC. The LAC128, LAC256, and LAC512 con-
figurations require 0.75×, 0.82×, and 0.96× less area than
BASE2K , respectively while outperformingBASE2K by 2.3×,
4.0×, and 8.1×. The area overhead for LAC1K is 1.36×
while its execution time improvement over the baseline is
15.4×. Thus LAC exhibits better performance vs. area scal-
ing than BASE.
5.5 Scalability
Thus far we considered designs with up to 1K wire weight
memory connections. For one of the most recent network
studied here, GoogleNet we also experimented with 2K and
4K wire configurations. Their relative performance improve-
ments were 20.4× and 27.0×. Similarly to other acceler-
ators performance improves sublinearly. This is primarily
due to inter-filter imbalance which is aggravated as in these
experiments we considered only increasing the number of
filters when scaling up. Alternate designs may consider in-
creasing the number of simultaneously processed activations
instead. In such configurations, minimal buffering across ac-
tivation columns as in Pragmatic [6] can also combat cross-
activation imbalance which we expect will worsen as we in-
crease the number of concurrently processed activations.
6. CONCLUSION
We have shown that compared to conventional bit-parallel
processing, aiming to process only the non-zero bits (or
terms in a booth-encoded format) of the activations and
weights has the potential to reduce work and thus improve
performance by two orders of magnitude. We presented the
first practical design, Laconic that takes advantage of this ap-
proach leading to best-of-class performance improvements.
Laconic is naturally compatible with the compression ap-
proach of Delmas et al., [14] and thus as per their study we
expect to perform well with practical off-chip memory con-
figurations and interfaces.
7. REFERENCES
[1] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Enright Jerger,
and A. Moshovos, “CNVLUTIN: Ineffectual-Neuron-Free Deep
Neural Network Computing,” in Proceedings of the International
Symposium on Computer Architecture, 2016.
[2] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An
accelerator for compressed-sparse convolutional neural networks,” in
Proceedings of the 44th Annual International Symposium on
Computer Architecture, ISCA ’17, (New York, NY, USA), pp. 27–40,
ACM, 2017.
[3] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos,
“Stripes: Bit-serial Deep Neural Network Computing ,” in
Proceedings of the 49th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO-49, 2016.
7
[4] A. Delmas, P. Judd, S. Sharify, and A. Moshovos, “Dynamic stripes:
Exploiting the dynamic precision requirements of activation values in
neural networks,” CoRR, vol. abs/1706.00504, 2017.
[5] S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, “Loom:
Exploiting weight and activation precisions to accelerate
convolutional neural networks,” CoRR, vol. abs/1706.07853, 2017.
[6] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov,
and A. Moshovos, “Bit-pragmatic deep neural network computing,”
in Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO-50 ’17, pp. 382–394,
2017.
[7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning
supercomputer,” in Microarchitecture (MICRO), 2014 47th Annual
IEEE/ACM International Symposium on, pp. 609–622, Dec 2014.
[8] Synopsys, “Design compiler.” http://www.synopsys.com/Tools/
Implementation/RTLSynthesis/DesignCompiler/Pages.
[9] Cadence, “Encounter rtl compiler.”
https://www.cadence.com/content/cadence-
www/global/en_US/home/training/all-courses/84441.html.
[10] N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool to
understand large caches,” 2015.
[11] M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool
for modeling emerging 3d nvm and edram caches,” in Design,
Automation Test in Europe Conference Exhibition (DATE), 2015,
pp. 1543–1546, March 2015.
[12] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne, “Designing
Energy-Efficient Convolutional Neural Networks using
Energy-Aware Pruning,” in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.
[13] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey,
“Faster CNNs with Direct Sparse Convolutions and Guided Pruning,”
in 5th International Conference on Learning Representations (ICLR),
2017.
[14] A. Delmas, S. Sharify, P. Judd, M. Nikolic, and A. Moshovos,
“Dpred: Making typical activation values matter in deep learning
computing,” CoRR, vol. abs/1804.06732, 2018.
8
