Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration by Liu, Zhi-Gang et al.
Sparse Systolic Tensor Array
for Efficient CNN Hardware Acceleration
Zhi-Gang Liu*, Paul N. Whatmough*, and Matthew Mattina
Arm ML Research Lab, Boston, MA, USA
Abstract—Convolutional neural network (CNN) inference on
mobile devices demands efficient hardware acceleration of low-
precision (INT8) general matrix multiplication (GEMM). Exploit-
ing data sparsity is a common approach to further accelerate
GEMM for CNN inference, and in particular, structural sparsity
has the advantages of predictable load balancing and very low
index overhead. In this paper, we address a key architectural
challenge with structural sparsity: how to provide support for
a range of sparsity levels while maintaining high utilization
of the hardware. We describe a time unrolled formulation of
variable density-bound block (VDBB) sparsity that allows for a
configurable number of non-zero elements per block, at constant
utilization. We then describe a systolic array microarchitecture
that implements this scheme, with two data reuse optimizations.
Firstly, we increase reuse in both operands and partial products
by increasing the number of MACs per PE. Secondly, we
introduce a novel approach of moving the IM2COL transform
into the hardware, which allows us to achive a 3×data bandwidth
expansion just before the operands are consumed by the datapath,
reducing the SRAM power consumption.
The optimizations for weight sparsity, activation sparsity
and data reuse are all interrelated and therefore the optimal
combination is not obvious. Therefore, we perform an design
space evaluation to find the pareto optimal design characteristics.
The resulting design achieves 16.8 TOPS/W in 16nm with modest
50% model sparsity and scales with model sparsity up to 55.7
TOPS/W at 87.5%. As well as successfully demonstrating the
variable DBB technique, this result significantly out performs
previously reported sparse CNN accelerators.
I. INTRODUCTION
Convolutional neural network (CNN) inference has quickly
become an important workload on IoT [12], [13], [22] and
mobile computing devices [14], [42], which has spurred the
development of hardware accelerators in mobile SoCs [19],
[36]. CNNs are fundamentally composed of many layers of
multi-channel 2D convolution, interspersed with non-linear
activation functions. The convolutions are usually lowered
in the runtime to general matrix multiplication (GEMM) by
linearizing the 3D feature maps into a 2D structure using the
IM2COL function [35]. The resulting GEMMs are usually
compute-bound, O(N3), and heavily dominate the runtime
of CNN inference. Therefore, GEMM is an obvious target
for acceleration [38], and being compute bound, the speedup
justifies the extra silicon real estate. For mobile computing
devices, INT8 CNN inference accelerators demand high energy
* authors with equal contribution.
62.5% Random Sparse 62.5% Block Sparse
BZ=4x2
62.5% 8x1 DBB Sparse
BZ=8x1, NNZ=3/8
(a) (b) (c)
Fig. 1: Sparse matrix encodings, red denotes non-zero element.
BZ is block size, and NNZ is non-zero elements per block.
efficiency (TOPS/W) and area efficiency (TOPS/mm2) to
achieve performance and price differentiation.
Data sparsity can be exploited in CNN inference accelera-
tors [16], [30], [32], as zeros in the data reduce the theoretical
compute and storage cost significantly. However, traditional
sparse matrix multiplication (sGEMM) from scientific work-
loads only generates speedup at very high sparsity (e.g., >95%
zeros). In contrast, CNNs typically exhibit 50–70% zeros [18],
[30], which falls well below this level. Furthermore, the zeros
in CNNs are naturally distributed randomly (Fig. 1(a)), which
leads to two challenges. Firstly, this requires each non-zero
element to be indexed explicitly, which increases the overheads
of storing and manipulating the indexes, not to mention the
cost of gathering and scattering the data. Indexing overheads
are exacerbated for INT8 data types, where the index itself may
require at least 4-bits. Secondly, load balancing is intractable
for random sparsity without inspecting the indexes at runtime,
making it difficult to achieve high utilization.
Block sparsity (Fig. 1(b)) is an alternative, where a con-
tiguous block or pattern of elements is forced to be either
all-zero or unconstrained. For hardware design, this has two
huge advantages over random sparsity. Firstly, the indexing cost
is substantially reduced, because a single index is amortized
over multiple data elements. Secondly, as the granularity is now
a block rather than a single element, the workload is entirely
predictable, which makes load balancing and hardware design
in general much easier. However, as the block size increases
to the benefit of the hardware, the CNN accuracy degrades
due to the increasingly large “holes” in the weight matrices. A
fairly new alternative formulation called density bound block
(DBB) sparsity [21], [27], introduces a bound on the number
of non-zero elements in each block (Fig. 1(c)). DBB sparsity
exhibits the hardware advantages of block sparsity and the
1
ar
X
iv
:2
00
9.
02
38
1v
1 
 [c
s.A
R]
  4
 Se
p 2
02
0
CNN performance of random sparsity. DBB has even been
implemented in the recently announced Nvidia A100 GPU
product [28], which cites 2× speedup for 50% sparsity.
A significant limitation of the two DBB architectures
published to date, is that the sparsity is fixed at design time:
75% in [21], and 50% in [28]. As a result, any model that does
not meet this fixed sparsity level will be forced to fall back
to dense execution with no gains. Even worse, any models
that achieve even higher sparsity than the fixed level will
also see no further gains. Therefore, a fixed sparsity level
severely limits the usefulness of DBB for broader deployment,
as any commercial products are forced to choose a modest
sparsity level to best suit the average customer. The challenge
in supporting variable sparsity levels is that the number of
MACs required per fixed amount of weights read from SRAM
changes. For a fixed provisioning of hardware MACs, this leads
to a proportional drop in utilization, which directly impacts
energy efficiency and area efficiency.
In this paper, we introduce a novel variable DBB technique
using a time unrolled architecture. We demonstrate this in a
reuse optimized accelerator and demonstrate state-of-the-art
results. The contributions of this paper are summarized below:
• Variable Density Bound Block (VDBB) In previous
work, the DBB compression is fixed [21], [28], which is a
big impediment to broader deployment, because CNNs can
vary widely in their weight sparsity. This paper describes
a variable DBB (VDBB) architecture achieved through
time unrolling, that supports all structured sparsity ratios
from 12.5% (1/8) up to fully dense (8/8), achieving both
speedup and energy efficiency as sparsity increases.
• Reuse Optimized VDBB Microarchitecture We de-
scribe an accelerator microarchitecture to implement time
unrolled variable DBB. At the datapath array level, we im-
plement a systolic tensor array (STA) composed of a more
complex PE called a tensor PE (TPE), which increases
reuse and better amortizes the cost of data movement.
To decrease SRAM read power we introduce a novel
hardware IM2COL unit after the SRAM and just before
the datapath, which achieves 3x bandwidth magnification
for 3×3 kernels. We show how to incorporate VDBB
weight sparsity and random activation sparsity.
• Design Space and Evaluation The combination of time
unrolled VDBB and the reuse optimized implementation
result in a large number of parameters, which all have an
interlinked impact on performance, area and energy such
that the optimal design point is not obvious. Therefore,
we finally enumerate the design space and describe the
pareto-optimal design choices. Results in 16nm for INT8
at 1GHz show the optimal nominal 4 TOPS accelerator
has effctive throughput and energy that scale strongly
with model sparsity and demonstrating 16.8 TOPS/W
(50% model sparsity) up to 55.7 TOPS/W (87.5% model
sparsity). This is more than 8×greater energy efficiency
compared to the previously published Laconic [32].
The remainder of the paper is organized as follows. Section II
M
0
2
1
0
0
-3
8
0
Index Mask
8’b01100110
Compressed DBB
3x3 
Filters
1x1 
(Pointwise) 
Filters
2
1
-3
8
Raw DBB
5
0
0
0
0
0
0
Non-Zero 
Elements
Block Size
(BZ)
Depthwise
Tensor 
Blocking 
BZ bits
Weight Tensors
-2
3 -4
Fig. 2: Density Bound Block (DBB) structured sparsity
constraints result in a maximum of NNZ non-zero values per
block of size BZ, when the weight tensors (e.g., 3×3, 1×1
etc) are decomposed in the depth (channel) dimension. The
block is compressed simply by removing the zero elements and
appending the index bitmask M, which indicates the location
of non-zero elements in the expanded block.
provides background material on DBB and presents motivation
for the paper. Section III describes the time unrolled variable
DBB (VDBB) architecture, and Section IV presents a reuse
optimized accelerator implementation. Section V describes
the experimental methodology and Section VI presents the
results. Section VII describes related work. Finally, Section VIII
concludes the paper.
II. BACKGROUND AND MOTIVATION
The main advantage with DBB weight compression for
hardware deployment is that it maintains the regularity of
GEMM. This results in speedup proportional to the compression
rate, high utilization, and low index storage and manipulation
overheads which is anyway amortized over the block size. In
this section we give an overview of the DBB approach and
discuss preliminaries. We also explain the limitations of the
fixed sparsity ratio used in the previous work.
A. DBB Overview
Both the weights and activations of CNNs exhibit sparsity.
However, while the weights are known in advance and can
be influenced during training, activations depend on the input
image and therefore their sparsity is more difficult to influence.
Therefore, in this work, we apply DBB to weight tensors.
On top of this, we describe clock-gating schemes to exploit
activation sparsity. Density bound block [21] imposes a simple
constraint on the sparsity of a block of BZ elements, such
that there are at most NNZ non-zero elements per block. Fig.
2 gives a concrete example, using a block size of 8×1. The
tensor blocking is performed depthwise (i.e. over the channel
dimension), such that the elements in any single 3×3 kernel do
not fall into the same block, which avoids over-constraining any
single kernel. Note that this approach also works for the 1×1
(pointwise) filters that represent the majority of the compute
in depthwise separable layers [9], which are used heavily in
the influential MobileNets [3] family of models.
2
Model Dataset Baseline ———- With DBB Pruning ———-
Acc.(%) Acc.(%) Total NNZ Sparsity1 (%)
LeNet-5 MNIST 99.1 98.7 1.05K 75 (2/8)
ConvNet CIFAR10 86.0 85.3 26.8K 75 (2/8)
ResNet-50V1 ImageNet 75.2 74.2 8.79M 62.5 (3/8)
VGG-16 ImageNet 71.5 71.4 5.39M 62.5 (3/8)
MobileNetV1 ImageNet 70.9 69.8 1.6M 50 (4/8)
1Convolution layers only.
TABLE I: CNNs trained with INT8 DBB weights with a block
size of 8. The maximum block sparsity achieved for these
benchmark models varies from 50% (4/8) to 75% (2/8).
When the tensors are blocked in this fashion, they can be
trivially compressed in two steps. First, the non-zero elements
are stored by removing the zeros. Secondly, a simple bitmask
index M is added to encode the presence of a non-zero element
at each location in the expanded block (size BZ). The resulting
compressed size is 8NNZ +BZ bits, assuming INT8 word
size, giving a compression ratio of 8BZ/(8NNZ+BZ). Any
blocks that have less than NNZ non-zero elements will include
one or more zero elements in the encoded form.
B. Training DBB CNN Models
CNNs must be specially trained to meet the DBB constraint.
To demonstrate the feasibility of this, we trained five CNNs,
applying conventional INT8 quantization and magnitude-based
DBB pruning to VGG-16, MobileNetV1, ResNet-50V1, 5-layer
ConvNet and LeNet-5 on ImageNet, CIFAR10 and MNIST
datasets. The DBB sparsity hyperparameter was optimized for
each model. For MobileNet, we apply DBB to the pointwise
(1×1 kernel) layers only, which anyway constitute the vast
majority of the ops. For the depthwise separable layers,
we fall back to dense operation, which is a key feature of
this work. The training results are given in Table I. Further
details of the training methodology are given in Section V-A.
The accuracy loss from combined DBB pruning and INT8
quantization is 0.1–1.1% across all five models, which include
both a relatively big network (ResNet-50V1) and a compact
parameter-efficient network (MobileNetV1), both of which
are typically tough test cases for model optimizations. These
training experiments validate that DBB pruning achieves 2–4×
weight compression, while maintaining reasonable test accuracy
with INT8 quantization, while maintaining regularity.
C. DBB Parameters
There are only two key parameters for DBB: the block size
(BZ) and the density bound given by NNZ/BZ1. In general,
a larger block size introduces a less severe constraint on the
optimization process, but increases the hardware cost. A larger
block size also provides a greater granularity of sparsity levels.
To illustrate this, Table II shows the training sensitivity to the
block size (BZ) using 8-bit quantized LeNet-5 on the MNIST
dataset, following the methodology in Section V-A. For a given
sparsity ratio, DBB pruned LeNet-5 models with larger block
1In this paper we routinely refer to the block density as a ratio of NNZ/BZ,
but also use the sparsity given as a percentage.
NNZ
BZ 2 4 8 16
1 99.0% 98.7% 98.2% 97.9%
2 – 99.1% 98.9% 98.6%
4 – – 99.1% 99.1%
TABLE II: Accuracy sensitivity to DBB block size (BZ) and
number of non-zeros (NNZ) for 8-bit LeNet-5 on MNIST.
Accuracy increases with block size at equal sparsity ratio. Cell
colors indicate equal compression ratios of NNZ/BZ.
sizes clearly achieve better predication accuracy. For example,
the ratio of 1/4 (NNZ=1 and BZ=4) achieves 98.7% accuracy,
but the same compression ratio with a higher block size, e.g.
4/16, gives better accuracy (99.1%). Based on this, we use a
block size of 8 based on the results of the DBB pruning in
Table I and analysis of the hardware cost. Previous work uses
a block size of 8 in [21] and 4 in [28]. Note that any models
trained with block size of 4 are guaranteed to be supported
on hardware with a block size of 8, as a block 4 model will
always satisfy the same sparsity ratio in block 8 format.
D. Motivation
We showed in Table I that popular CNN architectures
achieve a fairly wide range of DBB pruning ratios, which vary
depending on the dataset, the network architecture and even the
training recipe. Smaller models such as LeNet-5 and ConvNet
can be pruned down to 1/4 of their original size, so would
ideally benefit from a block sparsity rate of 2/8. However, very
compact models such as MobileNet are notoriously tricky to
train and optimize and achieve about 50% compression at best,
which requires a block size of 4/8. We are also very likely to
encounter a variety of sparsity levels within a single model.
For example, it is very common to avoid optimizing the first
layer of a large CNN, as this often damages accuracy. It is also
possible to optimize sparsity per-layer or even per-channel to
extract the most from the model. Therefore, all of this points
towards the need to support a range of structured sparsity
ratios natively in the hardware.
Previous implementations of DBB have demonstrated fixed
sparsity: Kang et al. implements fixed DBB with a 2/8 (75%)
block, and the Nvidia A100 GPU [28] implements a 2/4
block. However, the fixed block sparsity ratios are a practical
limitation, as models with more sparsity will see no further
improvement, and models with less sparsity will have to
fall back to dense GEMM. In this work, we demonstrate
an effective approach to implement DBB with continuously
variable block sparsity. We also demonstrate the first structured
sparsity systolic array, which heavily emphasizes data reuse.
III. VARIABLE DENSITY BOUND BLOCK (VDBB)
As we outlined in the previous section, hardware support for
variable sparsity DBB (VDBB) is highly desirable. However,
varying the density bound leads to hardware under utilization.
In this section we will discuss this issue more concretely and
present an architecture solution to the VDBB requirement.
3
WA
NN: Dense
HW: Dense
5 0 0 2 0 3 0
NN: Random Sparse
HW: CG MAC
~50% MAC Power
50% Utilization
NN: 4/8 DBB Sparse
HW: 4/8 DBB (Fixed)
~50% BW/Power/Area
100% Utilization
NN: 2/8 DBB Sparse
HW: 4/8 DBB (Fixed)
50% Utilization
5 1 2 3 4
8:1 8:1 8:1 8:1
Operand BW = 1.0 Operand BW = 1.0 Operand BW = 0.625 Operand BW = 0.375
8:1 8:1 8:1 8:1
5 0 2 0 3
NN: 6/8 DBB Sparse
HW: 4/8 DBB (Fixed)
Not Supported
Operand BW = 0.875
8:1 8:1 8:1 8:1
5 0 0 2 0 3 0 0 0 0 2 0 0 0M M M-1 -7 -7 -7 -7 -7-1 -4-3
(a) (b) (c) (d) (e)
Fig. 3: Spatially unrolled datapaths, all with effective throughput of 16 Ops/cycle. (a) Conventional dense datapath with no
benefit from sparsity. (b) With a random sparsity, we can clock gate (CG) on zero operands to proportionally reduce compute
power while lowering utilization. However, this does not reduce data movement power or area. (c) A DBB datapath designed
for 4/8 block sparsity has the same effective throughput as (a), but requires 62.5% operand bandwidth, and about half the
area/power. The block sparsity (4/8) is fixed at silicon design time. (d) A model with higher sparsity (2/8) has little advantage,
as the hardware is designed for (4/8) block sparsity. (e) A model with lower sparsity (e.g., 6/8) is not natively supported.
A. Spatially Unrolled Block Architecture
A conventional (dense) datapath is shown in Fig. 3(a), where
a block of 8 weights (W) are multiplied by a corresponding
block of activations (A). The most obvious approach to compute
a sparse block is to parallelize the operations across independent
hardware MACs, i.e. spatially unroll the block. For random
weight sparsity, we can add a simple mechanism to detect
zero operands and clock gate (CG) the relevant MAC lane [7],
[31], as shown in Fig. 3(b). This reduces the compute power
(proportionally to the sparsity), but also reduces the utilization
of the hardware, which impacts area efficiency (TOPS/mm2).
It is also challenging to reduce data storage cost with random
sparsity, due to the unpredictability of the non-zero elements
per fixed SRAM access size.
In contrast, DBB results in a predictable number of non-zero
(NNZ) elements per block, which means we can easily reduce
both compute and data movement. For example, Fig. 3(c)
illustrates a datapath that supports a 4/8 density-bound block
and achieves the same throughput as Fig. 3(a). The DBB
version requires only four hardware MACs, each augmented
with an 8:1 mux to steer the correct activation element into the
MAC. The select signal for the mux is driven by the positional
index metadata (M ), which is an additional byte per block
overhead in this example. Note that the implementation of
DBB by Kang [21] is similar to this, but with a 2/8 block
(75% sparsity).
However, as we noted in Section II, real world models can
exhibit a very wide variety of sparsity levels. However, the
fixed DBB hardware in Fig. 3(c) can only support a single
fixed block sparsity ratio. If the sparsity is higher than 4/8, e.g.,
2/8 shown in Fig. 3(d), then the utilization drops, limiting the
TOPS/mm2 advantage on sparser models. Conversely, lower
sparsity models, such as the 6/8 example in Fig. 3(e), are not
supported and it is necessary to fall back to dense execution,
which offers no benefit at all. Therefore, instead of fixing the
sparsity at design time, we would instead like to support all
sparsity ratios from very sparse (1/8 density) up to fully dense
(8/8), from 87.5% to 0% sparsity.
B. Time Unrolled Block Architecture
The big challenge with supporting variable density bound
blocks (VDBB) in hardware, is that as the weight sparsity rate
is increased, the hardware utilization decreases, which leads
to low energy and area efficiency. If we implement fixed 4/8
DBB (50% DBB compression), a model that achieves 6/8
would result in a utilization drop of roughly 50%. On top of
this, executing a model with sparsity lower than 50% is not
supported, other than by treating it as a dense model.
To get around this issue with spatially unrolled DBB
architectures [21], we instead implement the DBB hardware
by unrolling the block in the time dimension. This simply
means that we process one element of the density bound block
per cycle using a single MAC per block. Of course, the key
advantage of this arrangement, is that we can now freely vary
the block sparsity, while the datapath utilization and the operand
bandwidth both remain constant. The number of cycles per
block is the only thing that varies as we change the sparsity, i.e.
the effective throughput increases with sparsity. For example,
Fig. 4(a) shows a conventional dense block unrolled in the
time dimension and requiring eight cycles to compute on a
single MAC. While in Fig. 4(b)–(e), we illustrate that NNZ
can be freely varied, with the number of clock cycles required
to compute the block being equal to NNZ. At the extreme,
a very sparse model with 1/8 DBB sparsity only requires 1
cycle per block (8× speedup).
Although the illustrative diagrams in Fig. 3 and Fig. 4 show
both the zero and non-zero elements of the 8-element block, the
key idea with DBB is that the data can be trivially compressed
(Section II), by storing only the non-zero elements and the
index metadata M. Therefore, the non-zero elements of the
weight block are consumed one per cycle, and the skipping of
elements is achieved implicitly. The corresponding activation
4
8:1
4
0
0
0
1
0
Skip→
Skip→
Skip→
Skip→
NN: 4/8
4 Cycles
W A
4
3
2
1
5
NN: 8/8
8 Cycles
W A
0
0
0
0
0
1
0
Skip→
Skip→
Skip→
NN: 2/8
2 Cycles
W A
Skip→
Skip→ 0
0
0
0
0
0
1
0
NN: 1/8
1 Cycle
W A
Skip→
Skip→
Skip→4
0
3
2
1
5
Skip→
NN: 7/8
7 Cycles
W A
8:18:18:18:1
-1
-3
-7
-1
-3
-1
-3-3
(a) (b) (c) (d) (e)
Fig. 4: Time unrolled structured sparse block processing, which
allows a continuously variable NNZ per block, while retaining
100% hardware utilization and constant operand bandwidth.
All NNZ options are supported, from the fully dense case (a),
through to 87.5% sparsity (e).
element is then muxed into the MAC. Note that a complex
reordering buffer is not required to implement this, and it
results in very high energy and area efficiency.
IV. ACCELERATOR
This section describes an extremely efficient VDBB im-
plementation, that aggressively optimizes five types of data
reuse. The proposed accelerator (Fig. 5) includes a Systolic
Tensor Array (STA), local SRAMs for weights and activations,
IM2COL activation bandwidth magnifier, and Arm Cortex-M33
microcontrollers for DMA and vector compute operations.
A. Datapath Array
The systolic array (SA) is a very efficient and scalable
hardware implementation of GEMM, due to the local register-
to-register operand reuse. However, implementing VDBB in an
systolic architecture greatly improves energy and area efficiency.
We achieve this by generalizing the SA into the systolic tensor
array (STA), which is amenable to supporting DBB and VDBB.
1) Systolic Tensor Array (STA): The classic systolic array
(SA) of Fig. 6(a), consists of an M×N array of PEs. Each PE
is a single MAC with INT8 operand (OPR) pipeline registers a
INT32 accumulator (ACC) register. We use an output stationary
dataflow, which allows the larger INT32 accumulators to remain
stationary. The Systolic Tensor Array (STA) of Fig. 6b extends
the SA concept, with a more complex PE called a tensor PE
(TPE). The TPE accepts a tensor (matrix) of weights and a
tensor of activations per cycle, instead of a single weight and
activation. Instead of computing a single MAC per cycle, each
TPE essentially processes a small matrix multiplication on the
input matrices of size A×C, using a B-way dotproduct (DP).
This increases the MACs to operands ratio, which we refer to
as intra-TPE reuse. While moving from a scalar MAC to a
dot product introduces accumulator reuse. In the remainder
Weight SRAM (WB)
TPE
TPE
TPE
TPE
M
 =
 4
N = 4
C x B x INT8
A x B x INT8
Arm M33 
Cluster
IM2COL
IM2COL
AXI
DMA
~3x Bandwidth 
Magnifier
TPE TPE
IM2COL
TPE TPE
IM2COL
TPE
TPE
TPE
TPE
TPE
TPE
TPE
TPE
A
ct
iv
at
io
n 
SR
A
M
 (A
B)
A
ct
iv
at
io
n 
Bu
ffe
r 
(A
B)
 S
R
A
M
Weight Buffer (WB) SRAM
Fig. 5: VDBB accelerator microarchitecture consisting of TPE
datapath array, local SRAM, IM2COL unit, and M33 MCUs.
of this paper, we uniquely denote an STA configuration2 as
A×B×C M×N. Fig 7a illustrates the STA dataflow, which
is similar to the classic SA, but with tensor (i.e. sub-matrix)
operands in place of scalar operands.
2) Adding DBB Support (STA-DBB): Next, we add support
for DBB weight matrices. DBB allows us to reduce the number
of MACs per block from BZ to NNZ, which reduces the
area by the compression factor NNZ/BZ. Each dot product
(DP) also requires an additional multiplexer (mux) in front of
each MAC to select the activation element that corresponds
to the non-zero weight, indicated by the bitmask index, M.
An example 2×4×2 2×2 STA-DBB configuration is shown in
Fig. 6(c). Each TPE is composed of 4×4-input 2-way Sparse
Dot Product (S4DP2) units, each of which has a 4-value
activation vector input [A0, A1, A2, A3] from the left, and
a 2/4 DBB compressed weight input [W0,W1] and associated
2-bit non-zero index from the top. Compared to a conventional
(dense) DP4, each S4DP2 trades two MACs for two 8-bit
4:1 datapath multiplexers (MUX), where each MUX costs
significantly less than a MAC. Although this architecture still
supports conventional dense GEMM, it only supports a single
sparsity ratio (50% in this example). Fig. 7(a) illustrates the
corresponding dataflow for this STA-DBB example. Here, we
multiply a 8×4 activation matrix, A, by a 4×8 weight matrix
with 2/4 DBB sparsity, W. Both matrices are first partitioned
into 2×4 or 4×2 sub-matrices respectively, with the W sub-
matrix of compressed down to only non-zero elements. Then
each sub-matrix tensor is skewed by one cycle in time across
the edges, similar to a conventional SA, but at tensor granularity.
Finally, each sub-array tensor is input one column (row) per
cycle, corresponding with each TPE on the left (top) edge.
All of this behaviour is essentially the same as an SA, but
replacing scalars with sub-matrices.
3) Adding VDBB Support (STA-VDBB: Finally, we imple-
ment time unrolled variable DBB (Section III-B) as an efficient
STA-VDBB, by modifying the TPE (Fig. 6(d)). We retain the
MUX at the input to the MAC on the activation side to select
the required activation according to the bitmask index, M . But
2The classic SA (Fig. 6a) is a special case: 1×1×1 M×N. Dot product
architectures similar to DaDianNao [8] can be described as 1×B×1 1×1.
5
F F
MAC MAC MAC MAC
F F F F F F
F 
F
F 
F
F 
F
MAC MAC MAC MACF FF FF F
MAC MAC MAC MAC
F F F F F F
F 
F
F 
F
F 
F
F F
MAC MAC MAC MAC
F F F F F F
F 
F
F 
F
F 
F
F 
F
F 
F
F 
F
F 
F
F F F F F F F F
F F
X +
MAC
A 
C 
C
(a) Systolic Array
(SA) 1×1×1 4×4
DP4 DP4
DP4 DP4
F 
F
F 
F
F F F F
DP4 DP4
DP4 DP4
F 
F
F 
F
F F F F
DP4 DP4
DP4 DP4
F 
F
F 
F
F F F F
DP4 DP4
DP4 DP4
F 
F
F 
F
F F F F
X
+ A 
C 
C
X
X
X
DP4
(b) Systolic Tensor Array
(STA) 2×4×2 2×2
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
+ A 
C 
C
X
X
S4DP2
  
M 
U 
X 
  
M 
U 
X 
(c) STA with DBB support
(STA-DBB) 2×4×2 2×2
x8
x8
x8
x8
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
F F F F
F F F F
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
F F F F
F FF F
+ A C 
C
X
S8DP1
 M 
U 
X 
(d) STA with VDBB support
(STA-VDBB) 2×8×4 2×2
Fig. 6: (a) The systolic array (SA) is efficient because operands read from SRAM are reused many times as they propagate
through PEs in the M×N array. (b) The systolic tensor array (STA), generalizes the scalar PE into a tensor PE (TPE), which
accepts two tensor operands and performs a small matrix multiplication on each cycle. This allows us to introduce intra-TPE
reuse and accumulator reuse, increasing the ratio of compute to data movement. (c) Fixed DBB is implemented inside STA by
adding a mux to the activation input on the dot product. (d) Finally, variable DBB is implemented by switching to multiple
single MACs to allow time unrolling. Notation: A×B×C M×N denotes a M×N 2-D array of A×B×C TPEs (red box). DP2
denotes a 2-way dot-product into a single accumulator register. S4DP2 denotes a 2-way sparse dot-product (SDP) with a 4:1
mux in the activation path for DBB sparsity.
to support VDBB, we move from a wide dot product with
accumulator sharing, to a single MAC with an accumulator
dedicated to a single block (S8DP1). Most importantly, as we
are unrolling in time, the occupancy (number of clock cycles) of
the S8DP1 unit for each block depends on the compression ratio.
Fig. 7(b) illustrates a corresponding data flow for computing a
4×16 by 16×8 matrix multiplication (A×W , respectively),
where W can be compressed into (4×8) in 2/8 DBB format. A
and the compressed W are then partitioned into 2×8 and 2×2
sub-matrices respectively. Each sub-matrix tensor is skewed
by one cycle in time across the array edges at the sub-matrix
granularity, before being input to one column (row) of the TPE
edge. However, in order to unroll the sparse dot product in
each data block, the compressed tenors of W are input one
row per clock. For each TPE, the corresponding tensor input
from the left edge need is delayed until the whole block is
complete, i.e. 2 cycle occupancy for this example. Due to the
DBB sparsity, all TPEs have the the same occupancy in a
computing stream. For higher DBB compression ratios, the
TPE has a lower number of occupancy cycles, resulting in
higher throughput, and area/energy efficency.
4) Array Design Trade-offs: Table III summarizes the key
differences between the conventional SA, the STA, and the
sparsity optimized STA-DBB and STA-VDBB designs. This
highlights the analytical benefit we achieve in both inter- and
intra-TPE reuse. However, there are some trade-offs between
the items listed, which we touch upon in the summary below:
• Inter-TPE Operand Reuse An M×N systolic array
features weight and activation operand reuse, O(M) and
O(N) respectively, which amortizes the relatively high
cost of reading operands from SRAM at the array edge.
• Intra-TPE Operand Reuse STA extends the array-level
Trade-offs SA STA STA-DBB STA-VDBB
MACs per TPE 1 A × B × C A × b × C A × C
ACCs per TPE 1 A × C A × C A × C
OPRs per TPE 2 B(A + C) AB + bC AB + nC
Inter-TPE Reuse1 MNM+N
AMCN
AM+CN
AbCMN
ABM+CbN
AnCMN
ABM+CnN
Intra-TPE Reuse2 12
AC
A+C
AbC
AB+bC
AnC
AB+nC
ACC Reuse 1 B b 1
A Sparsity CG 3 7 7 3
W Sparsity 7 7 Fixed DBB Variable DBB
1 Array MACs / Array input operands. 2 TPE MACs / TPE input operands.
TABLE III: Summary of array design trade-offs. b indicates the
number of MACs in the SDP unit of the STA-DBB datapath.
n denotes NNZ for the block. The STA-VDBB array increases
inter- and intra-TPE reuse, while also supporting VDBB weight
sparsity and random activation sparsity clock gating (CG).
reuse, by introducing additional operand reuse inside the
TPE itself. This further amortizes data movement, by
locally performing a small matrix multiply (with many
MACs) on the input tensor operands inside the TPE.
• Accumulator Reuse STA introduces accumulator reuse,
whenever a wide dot product is used. Accumulator
reuse increases area efficiency by increasing the MACs
to registers ratio by using more efficient carry-save
implementation in the dot product datapath.
• Weight Structured Sparsity To support VDBB in STA,
we must switch from wide dot products to single-MACs,
sacrificing the area reduction from accumulator reuse.
However, the advantages of VDBB far out this concession.
6
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
S4DP2 S4DP2
S4DP2 S4DP2
F 
F
F 
F
F F F F
A0,4
…
A0,7
A0,0
…
A0,3
A1,4
…
A1,7
A1,0
…
A1,3
A2,4
…
A2,7
A2,0
…
A2,3
A3,4
…
A3,7
A3,0
…
A3,3
(4x8) x (8x4)  
W6,0W7,0 W4,1W7,1
W1,0W3,0 W0,1W3,1
W5,2W6,2 W4,3W6,3
W0,2W2,2 W0,3W1,3
XA w
(a) Data flow for STA-DBB (2×4×2 2×2)
A0,8
…
A0,15
A0,0
…
A0,7
A1,8
…
A1,1
5
A1,0
…
A1,7
A2,8
…
A2,15
A2,0
…
A2,7
A3,8
…
A3,15
A3,0
…
A3,7
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
F 
F
F 
F
F F F F
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
F F F F
F F F F
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
S8DP1 S8DP1
F F F F
F FF F
x8
x8
x8
x8
W12,0 W14,1 W15,2 W10,3
W9,0 W10,1 W8,2 W9,3
W7,0 W6,1 W5,2 W6,3
W0,0 W3,1 W3,2 W5,3
W12,4 W15,5 W15,6 W11,7
W10,4 W11,5 W9,6 W10,7
W6,4 W5,5 W5,6 W6,7
W1,4 W4,5 W2,6 W1,7
2/8 DBB (16 x 8) 
w
(4 x16) 
A x 
(b) Data flow for STA-VDBB (2×8×4 2×2)
Fig. 7: DBB and VDBB dataflow examples on small arrays. (a) Example dataflow to compute dense 4×8 by 2/4 DBB sparse
8×4 matrix multiply using STA-DBB 2×2×4 2×2 datapath in 5 clock cycles. (b) Example dataflow to compute dense 4×16
by 2/8 DBB sparse 16×8 matrix multiply on a STA-VDBB 2×8×4 2×2 datapath in 8 clock cycles.
• Activation Sparsity Finally, we can also exploit activation
sparsity on top of VDBB weight sparsity, by clock gating
(CG) the datapath on zero activations. This can not be
applied to wide dot products, as all inputs would have to
be zero, which is very unlikely. But, we can apply this to
STA-VDBB which anyway uses single-MACs.
B. Local SRAM
As is commonplace for accelerators, we heavily leverage
local software managed SRAM [26] to provide a low-cost
operand supply for the datapath array. The weight buffer (WB)
is 0.5MB and the activation buffer (AB) is 2MB (Fig. 5).
Due to the array architecture, the SRAM is grouped together,
rather than distributed, so we are able to choose large SRAM
instances which fully amortize the cost of the SRAM periphery
against the bitcell array. We further balance the choice of the
bank muxing parameter to balance power and area. The AB
and WB are both double buffered to allow them to be shared
between the datapath and the local MCUs. The input image is
loaded into the AB before operation begins, via DMA from
the MCUs.
C. IM2COL Unit
The main drawback of GEMM (compared to native con-
volution) is the memory footprint overhead from IM2COL
expansion. IM2COL is used to linearize 3D volumes of CNN
feature map and kernel data, in order to process them using
GEMM. If the stride is less than the kernel size, IM2COL
results in duplicated pixels in the output. This leads to an
increased storage requirement, and higher SRAM read power.
We directly address this issue by adding a new hardware unit
that functions as a SRAM read bandwidth magnifier (Fig. 5).
To do this, we implement IM2COL in hardware on activations
as late as possible in the microarchitecture: after it is read from
local SRAM, and just before the data reaches the datapath
(Fig. 5). This allows us to achieve the lower memory footprint
of native convolution, while taking advantage of the compute
regularity of GEMM, which can be more readily optimized in
hardware. The net result of the late IM2COL hardware unit
is a reduction in SRAM read bandwidth for 3×3 and 5×5
convolutions. This unit consumes data from the SRAM which
is held in a small internal buffer register array. By combining
the contents of this buffer, IM2COL transformed outputs are
generated. Fig. 8 illustrates the operation on a 4×6 input feature
map tile, which results in 3×average SRAM read reduction.
D. Local MCU with SIMD
Although matrix multiplication represents by far the majority
of the computation for CNN inference, there are a number of
ancillary operators that must be supported to allow the whole
model to be processed in place, without moving intermediate
results back to the CPU. These operators include activation
functions, pooling, scaling, batch normalization and data type
casting. We implement these in software, using Arm Cortex-
M33 [1] microcontrollers (MCUs). The M33 MCU has 32-
bit SIMD instructions that can encode up to four INT8
operations [2]. A small 64KB SRAM is included as a program
store for the M33, which has minimal impact on the area
efficiency. Control and data movement tasks are also performed
by the MCU. The number of M33 MCUs required depends on
the peak throughput, 2 is sufficient for an design with 2 TOPS
peak throughput, 4 for 4 TOPS and 8 for 16 TOPS. The silicon
area of the M33 in 16nm technology is very small at 0.008mm2,
and the typical power consumption is 3.9µW/MHz [1].
V. METHODOLOGY
A. DBB CNN Training
Models must be specially trained to exploit DBB. To demon-
strate this, we trained five popular benchmark CNN models
7
Fig. 8: Overview of the hardware IM2COL unit: (a) hardware;
(b) IM2COL operation for a 6×4 activation patch; (c) hardware
operation. The IM2COL unit caches 6×4 inputs in the 6×2
buffers every 9 cycles. Two 4× outputs are generated per cycle,
reducing SRAM bandwidth by 3× for a typical 3×3 filter.
with INT8 DBB encoding on the standard image classification
datasets MNIST [25], CIFAR-10 [23] and ImageNet [11]. Our
DBB training procedure consists of three phases. Firstly, we
start with pre-trained models, although these can alternatively
be trained from scratch using published recipes. Secondly, we
apply magnitude based DBB-aware weight pruning, which is
similar to random pruning [41], except that it operates within
the domain of each DBB block. This step progressively prunes
small-magnitude weights within each DBB block for about 20
epochs, until the desired block sparsity constraint is met. As is
conventional, the first convolution layer e.g. in MobileNet-V1,
was not pruned, which has negligible impact on model size.
Finally, the model is fine tuned with 8-bit quantization for
both DBB weights and activations, using the straight-through
estimator (STE) approach which guarantees the FP value zero
is converted precisely to integer value zero, for around 30
epochs. We used TensorFlow 1.8.0 on an Intel Xeon server
running Ubuntu Linux OS, with an Nvidia V100 GPU.
B. RTL Generator
The proposed accelerator (Section IV encompasses a scalable
family of configurations. It is also regular and amenable to
hierarchical implementation and validation. We implemented a
parameterized Python RTL generator to rapidly and precisely
explore the full design space. This generator produces synthe-
sizable Verilog RTL for the accelerator, along with a testbench
suite. Designs can be generated with arbitrary dimensions of
A×B×C M×N, along with optional support for DBB and
VDBB sparsity, activation clock gating options etc. Each design
is automatically validated in Synopsys VCS using the generated
testbench, which can execute inference on a CNN model while
logging value change dump (VCD) switching activity traces
for each design on a given CNN.
C. Physical Design and Evaluation
The generated RTL was implemented in TSMC 16nm
FinFET technology to evaluate circuit area, power dissipation
and clock frequency. We also implemented a design in TSMC
65nm LP bulk CMOS to allow fair comparison with results
reported in the older technology. The EDA tool flow used
consists of Synopsys and Cadence tools, which we use with
the TSMC PDK and Arm multi-Vt standard cell libraries. The
single-ported SRAM instances were carefully chosen from the
options available in an Arm SRAM compiler. The design was
constrained for a 1GHz clock period at the slow corner, with
multiple process, voltage and temperature corners used for setup
and hold timing. The 65nm design achieved 500MHz at the
slow corner. Power analysis was performed at the typical corner,
using Synopsys PrimeTimePX, with the parasitic-annotated
netlist and switching activity from VCD simulation traces.
While architecting and analyzing the performance of random-
sparse accelerators [30] is challenging due to the potentially
widely varying sparsity patterns, DBB sparse models have
fixed sparsity and easily predictable runtime. We evaluated
each design using RTL simulation in Synopsys VCS running
our DBB INT8 CNN models (Table I). This generates accurate
performance (throughput) metrics which vary depending on the
dimensions of the weight and activation matrices in each layer.
For power consumption analysis, we capture VCD traces in
RTL simulation from representative layers of ResNet50, which
is then input to PrimeTimePX.
VI. EVALUATION RESULTS
In this section, we review the implementation results of
proposed accelerators and compare with previous publications.
A. Design Space
The proposed microarchitecture described in Section IV, has
a number of parameters to be optimized, with some trade-
offs (Section IV). We consider the design space for a typical 4
TOPS mobile CNN accelerator implementation. Area and power
consumption metrics are shown in Fig. 9, where each design
has equivalent peak throughput. Each design point shown is
described by a string which includes the array dimensions, the
optional hardware IM2COL unit (denoted “IM2C”), optional
fixed DBB (“DBB”), and optional variable DBB (“VDBB”).
Not all combinations of these parameters are valid. All the
designs are configured to have the same peak throughput of 4
TOPS, and normalized to the baseline of conventional TPU-
like systolic array, which we refer to as 1×1×1 32×64. The
TOPS/W and TOPS/mm2 results assume a fairly typical 3/8
DBB weight sparsity and 50% activation sparsity. The best
design is 4×8×8 4×8 VDBB IM2C, which is summarized
in full detail in Table IV. This strong improvement is due to
the significant advantages of exploiting sparsity and reuse in
the microarchitecture (Table III).
Fig. 10 further illustrates the improvement from the optimiza-
tions, as it shows the design space of area and power, again
normalized to the systolic array baseline. This illustrates three
distinct groupings of designs. The first in the top right of the
8
Fig. 9: Power and area breakdown of iso-throughput designs at 4 TOPS peak, with 3/8 DBB (62.5% sparsity) weights and 50%
random sparse activations. Normalized to a conventional 1×1×1 TPU-like systolic array baseline.
Classic Systolic Array
Sub-Optimal TPE Configs
DBB Support
VDBB Support
>2x Area 
Reduction
>2.5x Area
>2x Power
Reduction
Fig. 10: Effective power and area design space of the proposed
accelerators, normalized to the 1×1×1 systolic array baseline.
The design space includes options for array configuration,
hardware IM2COL, 4/8 DBB, and VDBB. All points have 4
TOPS of nominal datapath performance, and include typical
50% random activation sparsity.
design space shows a cluster of sub-optimal array configurations
(i.e. 2×8×2 and 4×8×4), without sparsity support. The second
group is in the bottom right, featuring designs with fixed DBB
sparsity support, which achieve more than 2×area reduction
compared to the baseline. Finally, the third group in the far
bottom left are pareto-optimal VDBB designs, which benefit
the most from IM2COL. These designs improve the area by
>2.5× and the power by >2×. The best design is summarized
in detail in Table IV, showing power and area breakdown for
each of the major components.
Unlike with random sparse weight accelerators [30], the
power consumption of proposed microarch. with DBB weights
is fairly constant. However, as we exploit random activation
sparsity, the power varies from layer to layer on a real world
model. To illustrate this, Fig. 11 shows normalized power for
the popular ResNet-50-v1 model. Twelve designs designs are
shown, which are representative of the whole design space,
all with 4 TOPS peak throughput. All our accelerators take
advantage of activation sparsity using a simple clock gating
scheme. The average activation sparsity percentage is annotated
above each bar group.
Over the whole model, the 4×8×8 VDBB IM2C design
achieves a 44.6% power reduction over the baseline. Fixed DBB
design 4×8×4 DBB IM2C gives a 24.9% power reduction.
Lastly, the variation over individual layers was mainly impacted
by changes in activation sparsity.
B. Variable DBB Sparsity
The key advantage of our proposal is that we support
structured sparsity with a variable sparsity rate (Section II),
where as in the previous work this is fixed: 6/8 in Kang [21]
and 2/4 in the Nvidia A100 [28]. Fig 12 illustrates this point
in terms of effective throughput and energy efficiency over the
full range of weight sparsity levels available with a block size
of 8. Three designs are shown. The first is the baseline systolic
array (1×1×1 array), optimized with hardware IM2COL and
activation sparsity CG. The second is the design (Table IV) with
fixed 4/8 DBB. The third one is proposed VDBB architecture
with variable DBB. Results are given with typical 50% and
80% activation sparsity. The systolic array baseline shows
no throughput improvement as model sparsity increases, but
does show some energy improvement due to less switching
activity in the circuits. The fixed DBB accelerator shows a
step improvement in throughput at and above 50% sparsity,
which corresponds with the fixed 4/8 DBB design point. The
energy is similar, but again shows a little more improvement
at high sparsity due to reduced switching. Finally, the variable
9
Fig. 11: 4 TOPS nominal designs: power for individual layers and the whole model of INT8 DBB ResNet50 V1. Normalized
to 1×1×1 with 50% average activation sparsity, closest to the blk1/unit3/conv3 layer in this ResNet50 example.
0
10
20
30
40
0 25 62.5 75 87.5
Ef
fe
ct
iv
e 
Th
ro
ug
hp
ut
 (T
OP
S)
 
Weight Sparsity (%)
Systolic Array (Dense W)
DBB (4/8 W Density)
VDBB (Variable W Density)
VDBB Scales with 
Weight Sparsity
(a) Throughput vs. Model Sparsity
0
20
40
60
80
0 25 50 62.5 75 87.5
Ef
fe
ct
iv
e 
En
er
gy
 (T
OP
S/
W
)
Weight  Sparsity (%)
Systolic Array (50% Act. Sparsity)
Systolic Array (80% Act. Sparsity)
DBB (4/8) (50% Act. Sparsity)
DBB (4/8) (80% Act. Sparsity)
VDBB (50% Act. Sparsity)
VDBB (80% Act. Sparsity)
VDBB Scales with 
Weight Sparsity
(b) Energy Efficiency vs. Model Sparsity
Fig. 12: Scaling of (a) effective throughput and (b) effective en-
ergy, with weight sparsity for baseline systolic array with clock
gating (1×1×1 32×64), DBB (4×8×4 4×8), and VDBB
(4×8×4 8×8). Energy efficiency increases with activation
sparsity (50% and 80% shown), but does not impact throughput.
VDBB effectively exploits increasing weight sparsity, and
outperforms both an optimized baseline (Systolic array with
CG) and a fixed DBB implementation optimized for 4/8 density.
DBB (VDBB) accelerator successfully demonstrates throughput
and energy which scale very strongly with the model sparsity.
Around 50% model sparsity, VDBB is slightly superior to fixed
DBB, but above 50% the benefit is very large indeed. Models
optimized for very high weight sparsity of 87.5% can achieve
as much as 30 TOPS effective throughput and 56 TOPS/W
energy efficiency. This confirms the advantage of the time
unrolled VDBB architecture.
Component Power, mW (%) Area, mm2 (%)
Systolic Tensor Array 318 (65.2%) 0.732 (20.0%)
Weight SRAM (512KB) 78.5 (16.1%) 0.54 (14.4%)
Activation SRAM (2MB) 31.0 (6.4%), 93.0† 2.16 (57.8%)
Cortex-M33 MCU [1] ×4. 50.5 (10.2%) 0.30 (8.0%)
IM2COL Unit 10.0 (2%) 0.01 (0.26%)
Total 487.5 (100%), 539.5† 3.74 (100%)
† IM2COL disabled
TABLE IV: Summary of the pareto-optimal VDBB design:
4×8×8 4×8 VDBB IM2COL, with nominal 4 TOPS. Power
reported with nominal 3/8 (62.5%) DBB model sparsity and
50% random sparse activations, achieving an effective 21.9
TOPS/W and 2.85 TOPS/mm2.
C. Comparison with Prior Work
This work has demonstrated a time unrolled variable DBB
scheme (Section III) implemented in a reuse optimized accel-
erator (Section IV). Here, we illustrate the benefit of these two
contributions by comparing our work with previously published
INT8 CNN inference accelerators in Table V. The proposed
designs shown include the nominal 4 TOPS design in Table IV
at multiple model sparsity points, as well as a 65nm version
of the same design to allow a broader comparison with results
in the older technology.
We first evaluate our results specifically against the state-of-
the-art sparse CNN accelerator work, Laconic [32]. This paper
includes a thorough survey against the latest work, including:
DaDianNao++ [8], Eyeriss [7], SCNN [30], Pragmatic [4],
and BitFusion [33], which convincingly demonstrates it is the
current state-of-the-art. Therefore, this is a useful comparison
point at the same INT8 precision, 1 GHz clock frequency
and comparable 15nm technology. The energy efficiency result
reported for Laconic is just under 2 TOPS/W energy efficiency.
Our nominal 4 TOPS VDBB design (Table IV) achieves 16.8
TOPS/W at 50% model sparsity, which is more than 8× higher.
The only other DBB accelerator is Kang [21], which uses a
fixed 2/8 DBB implemented in a dot-product microarchitecture,
and reports 1.65 TOPS/W in 65nm technology for 75%
sparse DBB. Our design implements variable DBB in a reuse
optimized accelerator and in 65nm achieves 2.8 TOPS/W at the
same 75% model sparsity, which is 70% higher. We also note
10
Technology SRAM Freq. Throughput Energy Efficiency1 Area Efficiency1 Weight Activation
A / W (GHz) (TOPS) (TOPS/W) (TOPS/mm2) Sparsity Sparsity
Ours 16nm 2MB / 512KB 1.0 4 55.7 28.2 87.5% VDBB 50% CG
16nm 2MB / 512KB 1.0 4 31.3 4.29 75% VDBB 50% CG
16nm 2MB / 512KB 1.0 4 21.9 2.85 62.5% VDBB 50% CG
16nm 2MB / 512KB 1.0 4 16.8 2.13 50% VDBB 50% CG
SMT-SA2 [34] 16nm 2MB / 512KB 1.0 4 7.4 1.13 62.5% Random 50% CG
Laconic [32] 15nm 2MB / 512KB 1.0 – 1.997 – Bit-wise Bit-wise
SCNN3 [32] 16nm 1.2MB / – 1.0 2 0.79 0.7 Random –
Ours 65nm 2MB / 512KB 0.5 1 2.80 0.26 75% VDBB 50% CG
65nm 2MB / 512KB 0.5 1 1.95 0.17 62.5% VDBB 50% CG
Kang et al. [21] 65nm 58KB 1.0 0.5 1.65 1.01 75% DBB –
Laconic [32] 65nm 2MB / 512KB. 1.0 – 0.81 – Bit-wise Bit-wise
Eyeriss v2 [6] 65nm 246KB 0.2 0.40 0.96 0.07/2.7M gates Random Random
1 Effective operations. 2 Our re-implementation with INT8 operands in 16nm. 3 24-bit accumulators rather than 32-bit.
TABLE V: Comparison with published sparse INT8 CNN accelerators in 16nm and 65nm technology. Published metrics for
these works varies wildly; however, even at modest 50% weight and activation sparsity, Our VDBB design achieves 16.8
TOPS/W in 16nm, which far exceeds previously reported results and offers strong throughput and energy scaling with weight
sparsity.
that such a high fixed DBB sparsity would not be practical
for compact models like MobileNet, ResNet on the ImageNet
dataset, based on our results (Table I).
The only other sparse systolic array design we are aware
of is SMT-SA [34]. To compare against this work, which was
reported in 45nm, we implemented the same design ourselves,
which achieves 7.4 TOPS/W compared to the proposed design
at 21.9 for the same sparsity. This is largely due to the cost of
the FIFOs required in the array, and the advantages of DBB
vs random weight sparsity (Section II).
Finally, SCNN [30] and Eyeriss v2 [6] are also included
in the comparison (Table V). In summary, we found that our
work outperforms: 1) Laconic [32], the latest sparse accelerator,
2) SMT-SA [34], the only other sparse systolic array, and 3)
Kang [21], the only other DBB accelerator.
VII. RELATED WORK
Clock Gating Random Sparsity A simple and effective
approach to exploiting random sparsity is to clock-gate (CG)
to save power when one or more zero operands are encoun-
tered [7], [31]. However, CG schemes generally result in low
utilization with no area or throughput improvement. We apply
this CG for activation sparsity (which is not amenable to DBB).
Random Sparsity For RNN acceleration, EIE [16] im-
plements a fine-grained random sparse CSR-encoded INT16
matrix-vector accelerator for dense layers, and ESE [17] extends
this to LSTMs. MASR [15] also exploits random sparsity, but
uses a bitmask encoding. PermDNN [10] targets sparse dense
layers using permuted diagonal matrices. A number of papers
target random sparse matrix multiplication for very sparse
data, such as Outer Space [29] which uses an outer product
scheme, and SpArch [39], which further optimizes for locality.
More specific to the lower sparsity of CNNs, Cnvlutin [5]
demonstrates skipping compute for zero activations discovered
at runtime, without explicit indexes. SCNN [30] implements
a fully CSR-indexed sparse CNN accelerator using an outer
product to exploit sparse weights and activations. FixyNN [37]
demonstrates a fixed-weight accelerator that can very efficiently
exploit random sparsity. We focus on CNN structured sparsity,
but compare with SCNN and Laconic (Table V).
Structured Sparsity Cambricon-S proposes a conventional
block sparse accelerator [40]. A DBB accelerator described by
Kang [21] implements a fixed weight sparsity of 75%. The ac-
celerator design is also based on a dot product microarchitecture
with limited data reuse. The hardware implementation of the
Nvidia structured sparsity scheme [28] are unknown, but is fixed
at 2/4 (50%) sparsity. From our pruning experiments (Table I),
2/8 (75%) is quite aggressive, but 2/4 is probably more
universally useful. Nonetheless, in both cases, the deployed
benefit is limited due to the fixed-sparsity ratio. In contrast,
our VDBB proposal demonstrates variable-sparsity DBB using
time unrolling in a reuse optimized accelerator.
Sparsity in Systolic Arrays Most sparse CNN accelerators
are based on dot-product designs reminiscent of DaDianNao [8],
which typically have lower data reuse compared to systolic
arrays (SAs) like the Google TPU [20]. SMT-SA [34] is a
sparse SA, which foccusses on random sparsity (instead of
DBB). Kung et al. [24] demonstrated a preprocessing step of
column combining of sparse weight matrices, before processing
on a dense SA architecture. We implemented an INT8 version
of SMT-SA to compare against (Table V), and found that
the DBB is much more efficient than the random sparsity of
SMT-SA, which requires FIFOs in the array.
VIII. CONCLUSION
Structured model sparsity is a powerful optimization to
enable improved throughput and energy efficiency in CNN
hardware accelerators, without the overheads and unpredictable
load balancing of random weight sparsity. However, unlike
with random sparsity, previous demonstrations of block sparsity
employ a fixed target sparsity ratio. Unfortunately, this is a
serious impediment to broad deployment, because real world
11
CNNs typically exhibit a wide range of weight sparsity ratios.
With a fixed sparsity, any models that do not achieve this
threshold must fall back to dense operation with no speedup.
At the same time, aggressively optimized models that exceed
the threshold also do not see any benefit.
In this paper, we introduce a variable density bound block
(VDBB) technique, which uses a time unrolled architecture to
implement a weight sparsity scheme that supports all possible
block sparsity levels. This enables hardware benefits from
model sparsity across the whole spectrum of models in use
today. We implement VDBB in a reuse optimized accelerator
microarchitecture, featuring a systolic tensor array (STA)
composed of more complex PEs called tensor PEs (TPEs) which
increase operand and accumulator reuse. We also describe a
hardware delayed IM2COL unit that achieves a 3×activation
SRAM bandwidth magnification effect to reduce SRAM power
consumption. The reuse optimized accelerator design introduces
a number of interdependent parameters, resulting in a non-trivial
design space, which we evaluate in 16nm process technology.
The optimal design scales strongly in throughput and energy
as a function of model sparsity, from 16.8 TOPS/W at 50%
sparsity up to 55.7 TOPS/W at 87.5%. The advantage of
both the VDBB compression and the reuse optimizations is
apparent in these results, which outperform previous sparse
CNN accelerator designs reported.
REFERENCES
[1] Arm Cortex-M33. [Online]. Available: https://developer.arm.com/ip-
products/processors/cortex-m/cortex-m33
[2] ARMv8-M Architecture Reference Manual. [Online]. Available:
https://static.docs.arm.com/ddi0553/a/DDI0553A e armv8m arm.pdf
[3] MobileNet v1 TensorFlow Model. [Online]. Avail-
able: http://download.tensorflow.org/models/mobilenet v1 2018 08 02/
mobilenet v1 1.0 224.tgz
[4] J. Albericio, A. Delma´s, P. Judd, S. Sharify, G. OLeary, R. Genov,
and A. Moshovos, “Bit-pragmatic deep neural network computing,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-50 17. New York, NY, USA:
Association for Computing Machinery, 2017, p. 382394. [Online].
Available: https://doi.org/10.1145/3123939.3123982
[5] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,
and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural
network computing,” in Proceedings of the 43rd International
Symposium on Computer Architecture, ser. ISCA ’16. Piscataway,
NJ, USA: IEEE Press, 2016, pp. 1–13. [Online]. Available:
https://doi.org/10.1109/ISCA.2016.11
[6] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator
for emerging deep neural networks on mobile devices,” IEEE Journal
on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2,
pp. 292–308, 2019.
[7] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural networks,” in
Proceedings of the 43rd International Symposium on Computer
Architecture. Piscataway, NJ, USA: IEEE Press, 2016, pp. 367–379.
[Online]. Available: https://doi.org/10.1109/ISCA.2016.40
[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer,”
in Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-47. Washington, DC, USA:
IEEE Computer Society, 2014, pp. 609–622. [Online]. Available:
http://dx.doi.org.ezp-prod1.hul.harvard.edu/10.1109/MICRO.2014.58
[9] F. Chollet, “Xception: Deep learning with depthwise separable
convolutions,” CoRR, vol. abs/1610.02357, 2016. [Online]. Available:
http://arxiv.org/abs/1610.02357
[10] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan,
“Permdnn: Efficient compressed dnn architecture with permuted diagonal
matrices,” in 2018 51st Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 2018, pp. 189–202.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[12] I. Fedorov, R. P. Adams, M. Mattina, and P. N. Whatmough, “SpArSe:
Sparse architecture search for CNNs on resource-constrained micro-
controllers,” in Advances in Neural Information Processing Systems
(NeurIPS), 2019, pp. 4978–4990.
[13] I. Fedorov, M. Stamenovic, C. Jenson, L.-C. Yang, A. Mandell, Y. Gan,
M. Mattina, and P. N. Whatmough, “ TinyLSTMs: Efficient Neural Speech
Enhancement for Hearing Aids ,” in Conference of the International
Speech Communication Association (INTERSPEECH), 2020.
[14] Y. Feng, P. Whatmough, and Y. Zhu, “ASV: Accelerated Stereo Vision
System,” in Proceedings of the 52nd Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO ’52. New York, NY,
USA: Association for Computing Machinery, 2019, p. 643656. [Online].
Available: https://doi.org/10.1145/3352460.3358253
[15] U. Gupta, B. Reagen, L. Pentecost, M. Donato, T. Tambe, A. M.
Rush, G. Wei, and D. Brooks, “MASR: A modular accelerator
for sparse rnns,” in 28th International Conference on Parallel
Architectures and Compilation Techniques, PACT 2019, Seattle, WA,
USA, September 23-26, 2019. IEEE, 2019, pp. 1–14. [Online].
Available: https://doi.org/10.1109/PACT.2019.00009
[16] S. Han et al., “EIE: Efficient inference engine on compressed deep
neural network,” in Proceedings of the 43rd Int. Symp. on Computer
Architecture (ISCA), 2016.
[17] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo,
S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efficient speech
recognition engine with sparse lstm on fpga,” in Proceedings of the
2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, ser. FPGA 17. New York, NY, USA: Association
for Computing Machinery, 2017, p. 7584. [Online]. Available:
https://doi.org/10.1145/3020078.3021745
[18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural network with pruning, trained quantization and huffman
coding,” CoRR, vol. abs/1510.00149, 2015. [Online]. Available:
http://arxiv.org/abs/1510.00149
[19] P. Hansen, A. Vilkin, Y. Khrustalev, J. Imber, D. Hanwell, M. Mattina, and
P. N. Whatmough, “ ISP4ML: Understanding the Role of Image Signal
Processing in Efficient Deep Learning Vision Systems ,” in International
Conference on Pattern Recognition (ICPR), 2020.
[20] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon,
J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,
A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps,
J. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,
J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,
H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox,
and D. H. Yoon, “In-datacenter performance analysis of a tensor
processing unit,” CoRR, vol. abs/1704.04760, 2017. [Online]. Available:
http://arxiv.org/abs/1704.04760
[21] H. Kang, “Accelerator-aware pruning for convolutional neural networks,”
IEEE Transactions on Circuits and Systems for Video Technology, pp.
1–1, 2019.
[22] S. Kodali, P. Hansen, N. Mulholland, P. Whatmough, D. Brooks, and
G. Wei, “Applications of Deep Neural Networks for Ultra Low Power IoT,”
in 2017 IEEE International Conference on Computer Design (ICCD),
2017, pp. 589–592.
[23] A. Krizhevsky and G. Hinton, “Learning multiple layers of features
from tiny images,” Master’s thesis, Department of Computer Science,
University of Toronto, 2009.
[24] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional
neural networks for efficient systolic array implementations: Column
combining under joint optimization,” in 24th Int. Conf. on Architectural
Support for Programming Languages and Operating Systems (ASPLOS),
2019, pp. 821–834.
12
[25] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[26] H. Li, M. Bhargav, P. N. Whatmough, and H. . Philip Wong, “On-chip
memory technology design space explorations for mobile deep neural
network accelerators,” in 2019 56th ACM/IEEE Design Automation
Conference (DAC), 2019, pp. 1–6.
[27] Z. Liu, P. N. Whatmough, and M. Mattina, “Systolic Tensor Array:
An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN
Inference,” IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 34–
37, 2020.
[28] Nvidia. A100 GPU Datasheet. [Online]. Avail-
able: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/
a100/pdf/nvidia-a100-datasheet.pdf
[29] S. Pal et al., “Outerspace: An outer product based sparse matrix
multiplication accelerator,” in Int. Symp. on High Performance Computer
Architecture (HPCA), Feb 2018, pp. 724–736.
[30] A. Parashar et al., “SCNN: An accelerator for compressed-sparse convo-
lutional neural networks,” in 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), June 2017, pp. 27–40.
[31] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
Herna´ndez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-
power, highly-accurate deep neural network accelerators,” in Proceedings
of the 43rd International Symposium on Computer Architecture, ser.
ISCA, 2016.
[32] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M.
Stuart, Z. Poulos, and A. Moshovos, “Laconic deep learning inference
acceleration,” in Proceedings of the 46th International Symposium
on Computer Architecture, ser. ISCA 19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 304317. [Online].
Available: https://doi.org/10.1145/3307650.3322255
[33] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and
H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable
architecture for accelerating deep neural networks,” in Proceedings of
the 45th Annual International Symposium on Computer Architecture,
ser. ISCA 18. IEEE Press, 2018, p. 764775. [Online]. Available:
https://doi.org/10.1109/ISCA.2018.00069
[34] G. Shomron, T. Horowitz, and U. Weiser, “SMT-SA: Simultaneous
multithreading in systolic arrays,” IEEE Comput. Archit. Lett., vol. 18,
no. 2, pp. 99–102, Jul. 2019.
[35] P. Warden. Why GEMM is at the heard of deep learning. [Online].
Available: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-
of-deep-learning/
[36] P. N. Whatmough, S. K. Lee, M. Donato, H. Hsueh, S. Xi, U. Gupta,
L. Pentecost, G. G. Ko, D. Brooks, and G. Wei, “A 16nm 25mm2 SoC
with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-
A53 to eFPGA and Cache-Coherent Accelerators,” in 2019 Symposium
on VLSI Circuits, 2019, pp. C34–C35.
[37] P. N. Whatmough, C. Zhou, P. Hansen, S. K. Venkataramanaiah,
J. sun Seo, and M. Mattina, “FixyNN: Efficient Hardware for Mobile
Computer Vision via Transfer Learning,” in Proceedings of the 2nd
SysML Conference, Palo Alto, CA, USA, 2019.
[38] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao,
H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz, “Interstellar: Using
halide’s scheduling language to analyze dnn accelerators,” in Proceedings
of the Twenty-Fifth International Conference on Architectural Support
for Programming Languages and Operating Systems, ser. ASPLOS ’20.
New York, NY, USA: Association for Computing Machinery, 2020, p.
369383. [Online]. Available: https://doi.org/10.1145/3373376.3378514
[39] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient
architecture for sparse matrix multiplication,” in 26th IEEE International
Symposium on High Performance Computer Architecture (HPCA), 2020.
[40] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li,
T. Chen, and Y. Chen, “Cambricon-s: Addressing irregularity in sparse
neural networks through a cooperative software/hardware approach,” in
Proceedings of the 51st Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-51. IEEE Press, 2018, p. 1528.
[Online]. Available: https://doi.org/10.1109/MICRO.2018.00011
[41] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy
of pruning for model compression.” CoRR, vol. abs/1710.01878, 2017.
[Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1710.
html#abs-1710-01878
[42] Y. Zhu, A. Samajdar, M. Mattina, and P. Whatmough, “Euphrates:
Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision,”
in Proceedings of the 45th Annual International Symposium on
Computer Architecture, ser. ISCA ’18. IEEE Press, 2018, p. 547560.
[Online]. Available: https://doi.org/10.1109/ISCA.2018.00052
13
